MLops Engineer

Actively hiring Posted 3 months ago 2 min read

Role overview

*MLops Engineer (Training Scalability & Workflow Optimization)

Overview**

We are seeking an
MLops Engineer
to lead the scaling of machine learning training pipelines and ensure the robustness and efficiency of our end-to-end ML workflows. This role focuses on leveraging
Flyte
,
Kubernetes (GPU optimization)
,
Docker
, and distributed training frameworks such as
Ray
to optimize and streamline our ML infrastructure.

What you'll work on

Workflow Orchestration: Develop and maintain ML workflows using Flyte to manage complex ML pipelines for training, testing, and deployment.
Training Scalability: Architect and scale large-scale ML training systems on GPU-backed Kubernetes clusters , including auto-scaling and performance tuning for multi-node/multi-GPU workloads.
Distributed Computing: Implement distributed model training pipelines using frameworks like Ray for parallelization and resource efficiency.
Containerization: Design, build, and optimize Docker images for ML workloads with a focus on reproducibility and security.
Resource Optimization: Debug and optimize GPU utilization, memory, and compute bottlenecks during training and inference phases.
Monitoring & Maintenance: Integrate monitoring for ML jobs, track resource consumption, and enforce cost-efficient resource utilization.
Collaboration: Work closely with data scientists and ML engineers to productize and scale ML experiments.

Tags & focus areas

Used for matching and alerts on DevFound

Machine Learning Data Science Mlops Pytorch Ai Fulltime

MLops Engineer

Role overview

What you'll work on

Tags & focus areas

Ready to Join the Team?