Role overview
**AI/ML Solutions Architect – Distributed Training & GPU Infrastructure**
**Location:**
Remote from anywhere in the U.S.
**Salary:**
Up to $230k base + bonus + RSU's depending on seniority
Join a fast-moving AI infrastructure team working on the cutting edge of large-scale ML workloads. This role is ideal for engineers who enjoy solving deep technical challenges in distributed training, multi-GPU systems, and scalable AI inference infrastructure. You'll work directly with AI-focused clients, helping them get the most out of modern GPUs (H100, B200, etc.) and ML frameworks like PyTorch and JAX.
**Team & Responsibilities:**
Work alongside world-class engineers building the infrastructure behind next-gen AI systems. As part of the customer solutions team, you'll:
* Design and deploy high-performance ML pipelines across hundreds/thousands of GPUs
* Guide customers in optimizing distributed training and inference setups
* Deliver tech talks, contribute to whitepapers, and gather feedback for product teams
* Work cross-functionally with engineering, product, and R&D to shape our AI platform
**Required Skills:**
* 5+ years in ML infrastructure, MLOps, or similar roles
* Deep experience with PyTorch or Tensorflow and multi-node training
* Strong understanding of ML pipeline design, performance tuning, and deployment
* Kubernetes, Slurm, Terraform, Git, Docker
* Programming in Python (Go, Java, or C++ a plus)
**Bonus:**
Experience moving ML systems from POC to production-scale; familiarity with Hugging Face, TensorFlow, or inference optimization
We’re looking for hands-on engineers who understand real-world ML problems and love building scalable, robust systems. If you thrive at the intersection of infrastructure and AI, this is your next move.