P-1 AI
AI

Machine Learning Engineer Training Infrastructure

P-1 AI · San Francisco, CA

Actively hiring Posted about 2 months ago

Role overview

  • 3+ years working with large-scale ML systems or training pipelines
  • Deep familiarity with PyTorch, especially distributed training via FSDP, DeepSpeed, or DDP
  • Comfortable navigating training libraries like TorchTune, Accelerate, or Trainer APIs
  • Practical experience with multi-node GPU training, including profiling, debugging, and optimizing jobs
  • Understanding of low-level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
  • You enjoy bridging research and engineering—making messy ideas actually run on hardware

What we're looking for

  • Experience maintaining Slurm, Ray, or Kubernetes clusters
  • Past contributions to open-source ML training frameworks
  • Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
  • Familiarity with on-policy reinforcement learning setups with inference (policy rollouts) as part of the training loop, such as GRPO, PPO, or A2C
  • Experience working at a startup

Interview process

  • Initial screening - Head of Talent (30 mins)
  • Hiring manager interview - Head of AI (45 mins)
  • Technical Interview - AI Chief Scientist and/or Head of AI (45 mins)
  • Culture fit / Q&A (maybe in person) - with co-founder & CEO (45 mins)

Tags & focus areas

Used for matching and alerts on DevFound
Fulltime Ai Ai Engineer Machine Learning Deep Learning Generative Ai