Role overview
**Job**
: Computer Vision/AI Engineer
**Duration**
: Long term contract
**Location**
: Orlando, FL
**Job Type**
: Hybrid
**Job Description:**
**How You’ll Make an Impact**
• Designing, building, and optimizing all aspects of large-scale training and fine-tuning, from dataloading to inference, to maximize Model Flop Utilization (MFU) on large compute clusters.
• Working closely and proactively with research scientists to translate models and algorithms into high-performance, production-ready code, integrating and testing the latest advancements.
• Relentlessly profiling and resolving training performance bottlenecks, optimizing the entire training stack for speed and efficiency.
• Contributing to the technology evaluations and selection of hardware, software, and cloud services for the AI infrastructure platform.
• Using MLOps frameworks (MLFlow, WnB, etc.) to ensure best practices across the model lifecycle, ensuring reproducibility, reliability, and continuous improvement.
• Creating thorough documentation for infrastructure and training procedures, staying updated on advancements in training strategies, and driving improvements in workflows and infrastructure.
**What You Bring**
• Master's degree or higher in Computer Science, Engineering, or a related technical field.
• 5 or more years in a Data & AI (Artificial Intelligence) Engineer or Machine Learning Engineer, focusing on building and optimizing infrastructure for large-scale machine learning systems. \*Candidates with more experience can be considered for a higher level or vice-versa.
• Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc.), large-scale multi-node GPU training, and optimization strategies for large foundation models on distributed compute infrastructure.
• Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.