DealerCX
AI

DevOps MLOps Engineer

DealerCX · US · $120k - $150k

Actively hiring Posted 16 days ago

Role overview

The Role

You'll own the entire deployment pipeline and model serving infrastructure. This is a hybrid DevOps + MLOps role – you'll ensure our application deploys reliably AND that our AI models (both frontier and local) serve efficiently.

Our cost optimization strategy requires routing between expensive frontier models (Claude, GPT) and cost-effective local models (Llama, Mistral) based on task complexity. You'll build and own this infrastructure.

What you'll work on

DevOps

  • CI/CD pipelines – Automated build, test, and deploy on every push
  • Infrastructure as code – Terraform/Pulumi for reproducible environments
  • Monitoring & alerting – Know when things break before customers do
  • Incident response – Own uptime and reliability
  • Daily deploys – Enable the team to ship to production every day safely

MLOps

  • Model serving infrastructure – Deploy and serve LLMs (local and API-based)
  • Model router – Build the abstraction layer that routes requests to appropriate models
  • GPU infrastructure – Manage inference servers for local models (Llama, Mistral)
  • Cost optimization – Track and optimize model usage costs
  • Model versioning – Safe rollouts and rollbacks for prompt/model changes

Platform

  • Developer experience – Make the team faster through better tooling
  • Scaling – Prepare infrastructure for growth

Security (Critical)

  • Infrastructure security – Server hardening, network security, firewall configuration, VPC design
  • Secrets management – Vault, AWS Secrets Manager, or similar; no secrets in code
  • Access control – IAM policies, least-privilege principles, SSO integration
  • Vulnerability scanning – Automated scanning in CI/CD, dependency audits, container scanning
  • Intrusion detection – CloudTrail, GuardDuty, or similar; alert on suspicious activity
  • Encryption – Data at rest and in transit; key management
  • Incident response – Work with fractional CISO to implement detection, containment, and recovery procedures
  • Compliance – Support audits and maintain security documentation

Quality & Testing Infrastructure

  • CI/CD quality gates – Automated tests run on every push; bad code doesn't deploy
  • Test environment management – Staging environments that mirror production
  • LLM output monitoring – Track hallucinations, wrong tool calls, response quality in production
  • Security scanning – Automated vulnerability scanning in CI pipeline
  • Alerting & anomaly detection – Know when something breaks before customers do

Tech StackCurrent

  • Cloud: AWS (EC2, RDS, S3, Lambda)
  • Containers: Docker
  • CI/CD: GitHub Actions
  • Database: PostgreSQL (RDS)
  • Caching: Redis

You'll Build

  • Model serving: vLLM, Ollama, or similar for local inference
  • GPU compute: AWS/GCP GPU instances or dedicated inference providers
  • Model routing: Custom abstraction layer for model selection
  • Observability: Datadog, Grafana, or similar for unified monitoring

What we're looking for

  • MLOps experience – Model deployment, serving, monitoring
  • GPU infrastructure – Managing inference workloads
  • Experience with LLM serving (vLLM, TGI, Ollama)
  • Kubernetes experience
  • Cost optimization mindset
  • Experience serving both frontier APIs and local models
  • LangChain/LangSmith or similar LLM observability
  • Startup experience – comfort with ambiguity and speed
  • Texas location

Tags & focus areas

Used for matching and alerts on DevFound
Fulltime Remote Ai Mlops Generative Ai