Role overview
Job ID 2511322
**Location**
REMOTE WORK, VA, US
**Date Posted**
2025-11-04
**Category**
Engineering and Sciences
**Subcategory**
Solutions Archt
**Schedule**
Full-time
**Shift**
Day Job
**Travel**
No
**Minimum Clearance Required None**
**Clearance Level Must Be Able to Obtain**
Public Trust
**Potential for Remote Work**
Yes
**Description**
We are seeking a versatile
**SRE/MLOps Engineer with DevSecOps expertise**
to design, automate, and operate secure, scalable, and repeatable
**model deployment workflows**
across the AI/ML Common Services environment. This role bridges
**infrastructure reliability, CI/CD automation, and model operations**
, enabling IRS mission teams to move from experimentation to production with confidence.
The engineer will not only support
**ML lifecycle operations**
(Databricks, MLflow, AWS SageMaker/Bedrock) but also bring
**DevSecOps rigor**
to ensure compliance, monitoring, and infrastructure-as-code are embedded in every step. By partnering with Infrastructure, Security, and Architecture teams, this role ensures the AAP environment is
**resilient, automated, and compliance-ready**
at enterprise scale.
**Key Responsibilities**
* Enable secure, scalable, and repeatable deployment workflows for both ML models and supporting infrastructure.
* Build and maintain runtime environments, service accounts, orchestration logic for Databricks, MLflow, and AWS AI services.
* Implement and maintain CI/CD pipelines (Bitbucket, Bamboo, Jenkins, or equivalent) for code, data, and model deployments.
* Apply DevSecOps practices — integrating security scans, compliance checks, and audit logging into deployment pipelines.
* Collaborate with Infrastructure DSO and Solutions Architect to integrate Terraform-based IaC for consistent, automated provisioning.
* Implement observability, alerting, and logging (CloudWatch, Datadog, Prometheus) to monitor both application and ML workloads.
* Align infrastructure with ML lifecycle needs — including staging, promotion, rollback, retraining, and compliance-aware tracking.
* Develop automation templates, reusable workflows, and guardrails to accelerate onboarding of mission team models while ensuring security.
* Contribute to incident response, performance tuning, and reliability engineering across ML and non-ML workloads.
**Qualifications**
**Required Qualifications**
* Bachelor’s or master’s degree in computer science, Data Engineering, or a related technical discipline.
* 5+ years of experience in Site Reliability Engineering, DevOps, or MLOps with production-grade systems.
* Must be a U.S. Citizen with the ability to obtain and maintain a Public Trust security clearance.
* Hands-on experience with Databricks, MLflow, or AWS SageMaker/Bedrock for ML model lifecycle operations.
* Strong proficiency in Terraform, CI/CD pipelines, and container orchestration (Docker, Kubernetes).
* Experience implementing security automation (e.g., IaC scanning, container security, SAST/DAST tools) within CI/CD workflows.
* Solid understanding of observability stacks (logs, metrics, tracing) and best operational practices.
**Desired Skills**
* Active IRS clearance highly desired.
* Experience in federal or regulated environments with security, audit, and compliance requirements (FedRAMP, NIST 800-53).
* Knowledge of Trustworthy AI monitoring (bias detection, drift monitoring, explainability).
* Familiarity with Unity Catalog, Delta Lake, and data pipeline orchestration in Databricks.
* Hands-on experience with Zero Trust security models and secure boundary implementations.
* Relevant certifications such as
+ Databricks Certified Machine Learning Professional.
+ AWS DevOps Engineer – Professional.
+ Certified Kubernetes Administrator (CKA).
+ Security+ or equivalent security cert.
Target salary range $120,001 - $160,000. The estimate displayed represents the typical salary range for this position based on experience and other factors.