Role overview
Greetings
We're seeking a
hands-on MLOps Solution Architect
to design and implement scalable, secure, and cost-effective ML platforms on
AWS
. You'll lead the end-to-end architecture for model training, CI/CD pipelines, deployment strategies, monitoring, and governance across teams of data scientists and engineers.
Location: Toronto, ON (Hybrid- 3 days onsite per week)
Client: One of the largest banks in Canada
Duration: Long-term contract
What we're looking for
- Architect MLOps frameworks using AWS SageMaker , EKS , ECR , CodePipeline , and Step Functions
- Design pipelines for data prep, training, evaluation, registry, and automated deployment
- Integrate MLflow or SageMaker Model Registry for model tracking and lifecycle management
- Implement model serving strategies — batch, online, A/B, shadow, and canary rollouts
- Set up monitoring with CloudWatch , Evidently AI , Prometheus , or WhyLabs
- Establish governance: lineage, audit trails, model approvals, and access controls (IAM, KMS)
- Drive standardization across MLOps templates and Infrastructure as Code (Terraform or CloudFormation)
- Collaborate with Data Engineering and DevOps to align ML pipelines with enterprise architecture
Must-Have Skills
- 14+ years of experience in ML/AI platform design and data infrastructure
- Deep expertise in AWS services:
- Compute: EC2, EKS, Batch, Lambda
- Storage: S3, Lake Formation, Glue Catalog
- Pipeline: Step Functions, CodePipeline, Airflow
- Training/Serving: SageMaker (Studio, Training, Model Registry, Endpoints)
- Monitoring: CloudWatch, CloudTrail, Prometheus
- Security: IAM, Secrets Manager, KMS, VPC
- Proficient in Python and infrastructure scripting (Terraform, CloudFormation)
- Experience building and deploying models in production environments (CI/CD)
- Familiar with data versioning (DVC, Delta Lake) and experiment tracking (MLflow)
- Strong understanding of containerization (Docker, EKS) and Kubernetes-based serving
- Excellent communication and stakeholder management
- Knowledge of Generative AI and LLM deployment using AWS Bedrock or custom endpoints
- Familiarity with event-driven pipelines using SNS/SQS or Kinesis
- Model performance optimization with GPU instances and autoscaling
- Cost governance and monitoring for ML workloads
- Experience in financial or regulated industries (governance, model risk)
Best Regards,