Role overview
*AI/ML Solutions Architect – Distributed Training & GPU Infrastructure
Location:**
Remote from anywhere in the U.S.
Salary:
Up to $230k base + bonus + RSU's depending on seniority
Join a fast-moving AI infrastructure team working on the cutting edge of large-scale ML workloads. This role is ideal for engineers who enjoy solving deep technical challenges in distributed training, multi-GPU systems, and scalable AI inference infrastructure. You'll work directly with AI-focused clients, helping them get the most out of modern GPUs (H100, B200, etc.) and ML frameworks like PyTorch and JAX.
What you'll work on
Work alongside world-class engineers building the infrastructure behind next-gen AI systems. As part of the customer solutions team, you'll:
- Design and deploy high-performance ML pipelines across hundreds/thousands of GPUs
- Guide customers in optimizing distributed training and inference setups
- Deliver tech talks, contribute to whitepapers, and gather feedback for product teams
- Work cross-functionally with engineering, product, and R&D to shape our AI platform
What we're looking for
- 5+ years in ML infrastructure, MLOps, or similar roles
- Deep experience with PyTorch or Tensorflow and multi-node training
- Strong understanding of ML pipeline design, performance tuning, and deployment
- Kubernetes, Slurm, Terraform, Git, Docker
- Programming in Python (Go, Java, or C++ a plus)
We’re looking for hands-on engineers who understand real-world ML problems and love building scalable, robust systems. If you thrive at the intersection of infrastructure and AI, this is your next move.