Site Reliability Engineer

Actively hiring Posted over 4 years ago 2 min read

Role overview

Site reliability engineers are responsible for the overall performance and reliability of Kronos’ machine learning and trading infrastructure. Some call it DevOps or MLOps. Site reliability engineers design and implement the tools that automate building reliable and performant systems.

You will be in a team managing hundreds of servers located in tens of data centers hybrid on premise and cloud. Your work is to ensure the infrastructure stability by redundancy and failover, achieved by automation.

What you'll work on

Automation. Site reliability engineers are obsessed with automation and tooling
Deployment & change management, canary and release processes of Kronos’ learning and trading infrastructure
Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, and monitoring
Availability, performance, efficiency & scaling
Incident response, including on-call experience and a comprehensive postmortem process

What we're looking for

Experience in building and deploying machine learning models
Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way
Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing
Have experience in migrating production databases, or having knowledge of how to narrow down the service downtime when migration
Interest in trading and financial markets

Tags & focus areas

Used for matching and alerts on DevFound

Dev Sys Admin Python Docker Aws Gcp

Site Reliability Engineer

Role overview

What you'll work on

What we're looking for

Tags & focus areas

Ready to Join the Team?