D
AI

Machine Learning Engineer

Doghouse Recruitment ·

Actively hiring Posted 6 days ago

Role overview

**AI/ML Solutions Architect – Distributed Training & GPU Infrastructure**

**Location:**
Remote from anywhere in the U.S.

**Salary:**
Up to $230k base + bonus + RSU's depending on seniority

Join a fast-moving AI infrastructure team working on the cutting edge of large-scale ML workloads. This role is ideal for engineers who enjoy solving deep technical challenges in distributed training, multi-GPU systems, and scalable AI inference infrastructure. You'll work directly with AI-focused clients, helping them get the most out of modern GPUs (H100, B200, etc.) and ML frameworks like PyTorch and JAX.

**Team & Responsibilities:**

Work alongside world-class engineers building the infrastructure behind next-gen AI systems. As part of the customer solutions team, you'll:

* Design and deploy high-performance ML pipelines across hundreds/thousands of GPUs
* Guide customers in optimizing distributed training and inference setups
* Deliver tech talks, contribute to whitepapers, and gather feedback for product teams
* Work cross-functionally with engineering, product, and R&D to shape our AI platform

**Required Skills:**

* 5+ years in ML infrastructure, MLOps, or similar roles
* Deep experience with PyTorch or Tensorflow and multi-node training
* Strong understanding of ML pipeline design, performance tuning, and deployment
* Kubernetes, Slurm, Terraform, Git, Docker
* Programming in Python (Go, Java, or C++ a plus)

**Bonus:**
Experience moving ML systems from POC to production-scale; familiarity with Hugging Face, TensorFlow, or inference optimization

We’re looking for hands-on engineers who understand real-world ML problems and love building scalable, robust systems. If you thrive at the intersection of infrastructure and AI, this is your next move.

Tags & focus areas

Used for matching and alerts on DevFound
Fulltime Remote Machine Learning Pytorch