Role overview

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

**Responsibilities**

* Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
* Manage and optimize Slurm-based HPC environments for distributed training of large language models
* Develop robust APIs and orchestration systems for both training pipelines and inference services
* Implement resource scheduling and job management systems across heterogeneous compute environments
* Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
* Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
* Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
* Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

**Qualifications**

* Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
* Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
* Experience with deploying and managing distributed training systems at scale
* Deep understanding of container orchestration and distributed systems architecture
* High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
* Experience managing GPU clusters and optimizing compute resource utilization

**Required Skills**

* Expert-level Kubernetes administration and YAML configuration management
* Proficiency with Slurm job scheduling, resource management, and cluster configuration
* Python and C++ programming with focus on systems and infrastructure automation
* Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
* Strong understanding of networking, storage, and compute resource management for ML workloads
* Experience developing APIs and managing distributed systems for both batch and real-time workloads
* Solid debugging and monitoring skills with expertise in observability tools for containerized environments

**Preferred Skills**

* Experience with Kubernetes operators and custom controllers for ML workloads
* Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
* Familiarity with GPU cluster management and CUDA optimization
* Experience with other ML frameworks like TensorFlow or distributed training libraries
* Background in HPC environments, parallel computing, and high-performance networking
* Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
* Experience with container registries, image optimization, and multi-stage builds for ML workloads

**Required Experience**

* Demonstrated experience managing large-scale Kubernetes deployments in production environments
* Proven track record with Slurm cluster administration and HPC workload management
* Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
* Experience supporting both long-running training jobs and high-availability inference services
* Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

The cash compensation range for this role is $190,000 - $250,000.

Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

*Equity: In addition to the base salary, equity may be part of the total compensation package.*
*Benefits: Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.*

Tags & focus areas

Used for matching and alerts on DevFound

Fulltime Ai Machine Learning Generative Ai Pytorch Tensorflow

AI Infra Engineer

Role overview

Tags & focus areas

Ready to Join the Team?