Role overview
You will own the internal training + research MLOps platform: scalable PEFT post-training (LoRA/QLoRA), dataset/label pipelines, data acquisition + ingestion, evaluation automation, experiment tracking, and cost-efficient GPU orchestration (including spot/preemptible strategies). You will also own the research inference layer (model serving, batching/caching, version routing) that closes the loop between training, rollouts, and evaluation. This is a “systems + research acceleration” role: your output is that research iterations become reliable, fast, price-efficient, and auditable.
What you'll work on
Training + post-training pipelines (PEFT-first):
- Reproducible pipelines for: SFT, verifier/PRM training, and RLVR-style post-training
- Build repeatable LoRA/QLoRA fine-tuning pipelines (SFT + verifier/PRM training + RLVR-style updates where used), optimized for cost and iteration speed
- Robust checkpointing/resume and failure handling for long-running jobs
- Artifact management: dataset versions, configs, checkpoints, eval results, and model registry with lineage
Inference serving + rollout collection (research-grade):
- Operate an LLM serving stack (e.g., vLLM/SGLang) for policy + verifier/PRM models
- Optimize throughput/cost via batching, caching, scheduling, and profiling
- Build reliable rollout collection and replay tooling (configs, model versions, artifacts, traces)
GPU orchestration + cost efficiency:
- Multi-GPU training reliability (single-node initially; scale up over time)
- Spot/preemptible strategy: interruption-tolerant training, autoscaling, queueing, capacity-aware scheduling
- Performance tuning: profiling, dataloading, communication overhead reduction, utilization improvements
Data acquisition + ingestion (training/eval):
- Build ingestion pipelines for code/text/trace datasets, including programmatic collection from select web sources where appropriate
- Implement deduping, normalization, provenance tracking, and dataset versioning
- Ensure operational robustness (rate limiting, retries, incremental crawls, change detection) and practical compliance hygiene (respect access policies/ToS where required)
What we're looking for
- Strong systems + ML infra experience: training pipelines, data systems, reliability engineering
- Strong data engineering fundamentals: building ingestion pipelines, handling messy sources, deduping, and dataset versioning/provenance.
- Experience running LLM inference serving (vLLM/SGLang/TGI), including batching/caching and performance tuning.
- Hands-on experience running multi-GPU training (PyTorch distributed: DDP/FSDP/DeepSpeed/etc.)
- Strong cloud + IaC skills (AWS/GCP; Terraform/CloudFormation/Pulumi)
- Track record building reproducible pipelines (artifact/version management, experiment tracking)
- Performance mindset: profiling, bottleneck identification, cost/perf tradeoffs