E
AI

AI Infrastructure Engineer

Electria Group · California, United States

Actively hiring Posted 23 days ago

Role overview

**AI Infrastructure Engineer**

**San Francisco Bay Area, CA**

We're partnering with a well-funded AI startup building next-generation infrastructure to power large-scale machine learning applications. Their platform enables companies to train, deploy, and scale AI models efficiently across distributed systems and cloud environments.

They're expanding their infrastructure team to bring on engineers who can architect and optimize the systems that power modern AI workloads.

**Key Responsibilities:**

* Design and build scalable infrastructure for distributed model training and inference across GPU clusters
* Develop high-performance model serving platforms that handle thousands of requests per second with low latency
* Optimize data pipelines and storage systems to support massive-scale ML workloads
* Build tools and frameworks to improve developer productivity and streamline model deployment
* Collaborate with ML engineers and researchers to understand infrastructure requirements and bottlenecks
* Implement monitoring, observability, and reliability systems for production AI applications
* Work on edge AI infrastructure and distributed computing architectures

**Required Experience:**

* 5+ years in infrastructure engineering, distributed systems, or platform engineering
* Strong programming skills in Python and C++ (or Go/Rust)
* Deep experience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code tools (Terraform, Kubernetes)
* Hands-on experience with GPU infrastructure, CUDA, or ML frameworks (PyTorch, TensorFlow, JAX)
* Understanding of distributed computing, parallel processing, and system optimization
* Experience building APIs, microservices, or platform tools used by engineering teams

**Nice to Have:**

* Familiarity with container orchestration (Kubernetes, Docker) and CI/CD pipelines
* Experience with Ray, Kubeflow, MLFlow, or other ML infrastructure frameworks
* Prior work at a high-growth startup or building 0-to-1 infrastructure products

Tags & focus areas

Used for matching and alerts on DevFound
Fulltime Ai Machine Learning Pytorch Tensorflow