Role overview
- 3+ years working with large-scale ML systems or training pipelines
- Deep familiarity with PyTorch, especially distributed training via FSDP, DeepSpeed, or DDP
- Comfortable navigating training libraries like TorchTune, Accelerate, or Trainer APIs
- Practical experience with multi-node GPU training, including profiling, debugging, and optimizing jobs
- Understanding of low-level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
- You enjoy bridging research and engineering—making messy ideas actually run on hardware
What we're looking for
- Experience maintaining Slurm, Ray, or Kubernetes clusters
- Past contributions to open-source ML training frameworks
- Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
- Familiarity with on-policy reinforcement learning setups with inference (policy rollouts) as part of the training loop, such as GRPO, PPO, or A2C
- Experience working at a startup
Interview process
- Initial screening - Head of Talent (30 mins)
- Hiring manager interview - Head of AI (45 mins)
- Technical Interview - AI Chief Scientist and/or Head of AI (45 mins)
- Culture fit / Q&A (maybe in person) - with co-founder & CEO (45 mins)
Tags & focus areas
Used for matching and alerts on DevFound Fulltime Ai Ai Engineer Machine Learning Deep Learning Generative Ai