Certified NVIDIA AI Infrastructure Kubernetes Platm Engineer

IT Search Corp · Miami, FL, US · $208k - $270k

Actively hiring Posted about 5 hours ago 3 min read

Role overview

*NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems) Remote

Related Certifications required

6 months to 1+ yrs

$open

USC or GC req**

Alternate titles depending on context:

AI Platform Architect – DGX & SuperPOD
AI Infrastructure DevOps Engineer – NVIDIA DGX Stack
*Senior AI Systems Engineer – DGX | Kubernetes | InfiniBand

Job Description:**

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations.

This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.

*Core Responsibilities:

AI Infrastructure Operations**

Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.

What we're looking for

Kubernetes, Helm, GPU Operator, Kubeflow
DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines
Storage: NFS, BeeGFS, Lustre
Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA
Programming/scripting: Python, YAML, Bash

Tags & focus areas

Used for matching and alerts on DevFound

Fulltime Remote Ai

Certified NVIDIA AI Infrastructure Kubernetes Platm Engineer

Role overview

What we're looking for

Tags & focus areas

Ready to Join the Team?