A
AI

MLops Engineer

Arrayo ·

Actively hiring Posted 28 days ago

Role overview

**MLops Engineer (Training Scalability & Workflow Optimization)**

**Overview**

We are seeking an
**MLops Engineer**
to lead the scaling of machine learning training pipelines and ensure the robustness and efficiency of our end-to-end ML workflows. This role focuses on leveraging
**Flyte**
,
**Kubernetes (GPU optimization)**
,
**Docker**
, and distributed training frameworks such as
**Ray**
to optimize and streamline our ML infrastructure.

**Responsibilities**

* **Workflow Orchestration:**
Develop and maintain ML workflows using
**Flyte**
to manage complex ML pipelines for training, testing, and deployment.
* **Training Scalability:**
Architect and scale large-scale ML training systems on
**GPU-backed Kubernetes clusters**
, including auto-scaling and performance tuning for multi-node/multi-GPU workloads.
* **Distributed Computing:**
Implement distributed model training pipelines using frameworks like
**Ray**
for parallelization and resource efficiency.
* **Containerization:**
Design, build, and optimize Docker images for ML workloads with a focus on reproducibility and security.
* **Resource Optimization:**
Debug and optimize GPU utilization, memory, and compute bottlenecks during training and inference phases.
* **Monitoring & Maintenance:**
Integrate monitoring for ML jobs, track resource consumption, and enforce cost-efficient resource utilization.
* **Collaboration:**
Work closely with data scientists and ML engineers to productize and scale ML experiments.

**Qualifications**

* Strong proficiency with
**Kubernetes**
(GPU scheduling, Helm, cluster autoscaling).
* Hands-on experience with
**Flyte**
or similar workflow orchestration tools (Airflow, Prefect).
* Deep knowledge of distributed ML training (e.g., PyTorch DDP, Ray, Horovod).
* Expertise in
**Docker**
and container lifecycle management.
* Solid understanding of GPU hardware/software stack (CUDA, NCCL).
* Familiarity with CI/CD for ML (MLops pipelines using tools like GitHub Actions, ArgoCD).
* Bonus: Familiarity with observability tools for ML systems (Prometheus, Grafana).

Tags & focus areas

Used for matching and alerts on DevFound
Machine Learning Data Science Mlops Pytorch Ai Fulltime