Role overview

You will own the internal training + research MLOps platform: scalable PEFT post-training (LoRA/QLoRA), dataset/label pipelines, data acquisition + ingestion, evaluation automation, experiment tracking, and cost-efficient GPU orchestration (including spot/preemptible strategies). You will also own the research inference layer (model serving, batching/caching, version routing) that closes the loop between training, rollouts, and evaluation. This is a “systems + research acceleration” role: your output is that research iterations become reliable, fast, price-efficient, and auditable.

What you'll work on

Training + post-training pipelines (PEFT-first):

Reproducible pipelines for: SFT, verifier/PRM training, and RLVR-style post-training
Build repeatable LoRA/QLoRA fine-tuning pipelines (SFT + verifier/PRM training + RLVR-style updates where used), optimized for cost and iteration speed
Robust checkpointing/resume and failure handling for long-running jobs
Artifact management: dataset versions, configs, checkpoints, eval results, and model registry with lineage

Inference serving + rollout collection (research-grade):

Operate an LLM serving stack (e.g., vLLM/SGLang) for policy + verifier/PRM models
Optimize throughput/cost via batching, caching, scheduling, and profiling
Build reliable rollout collection and replay tooling (configs, model versions, artifacts, traces)

GPU orchestration + cost efficiency:

Multi-GPU training reliability (single-node initially; scale up over time)
Spot/preemptible strategy: interruption-tolerant training, autoscaling, queueing, capacity-aware scheduling
Performance tuning: profiling, dataloading, communication overhead reduction, utilization improvements

Data acquisition + ingestion (training/eval):

Build ingestion pipelines for code/text/trace datasets, including programmatic collection from select web sources where appropriate
Implement deduping, normalization, provenance tracking, and dataset versioning
Ensure operational robustness (rate limiting, retries, incremental crawls, change detection) and practical compliance hygiene (respect access policies/ToS where required)

What we're looking for

Strong systems + ML infra experience: training pipelines, data systems, reliability engineering
Strong data engineering fundamentals: building ingestion pipelines, handling messy sources, deduping, and dataset versioning/provenance.
Experience running LLM inference serving (vLLM/SGLang/TGI), including batching/caching and performance tuning.
Hands-on experience running multi-GPU training (PyTorch distributed: DDP/FSDP/DeepSpeed/etc.)
Strong cloud + IaC skills (AWS/GCP; Terraform/CloudFormation/Pulumi)
Track record building reproducible pipelines (artifact/version management, experiment tracking)
Performance mindset: profiling, bottleneck identification, cost/perf tradeoffs

Tags & focus areas

Used for matching and alerts on DevFound

Fulltime Remote Machine Learning Mlops Ai

Founding ML Platform Engineer

Role overview

What you'll work on

What we're looking for

Tags & focus areas

Ready to Join the Team?