Job Title : Site Reliability Engineer
Location : Ottawa, Ontario
Pay Range : $85
What's the Job?
- Build, operate, and continuously improve a reliable on-premises development and testing environment, including integration of AI / GPU servers.
- Design and implement pipelines for continuous verification and validation to ensure software quality throughout the development lifecycle.
- Develop and maintain monitoring, alerting, and observability systems to enable proactive issue detection and resolution.
- Automate infrastructure provisioning, configuration, and lifecycle management using Infrastructure as Code (IaC) practices.
- Collaborate with cross-functional teams to embed reliability, security, and quality into system design and operations.
What's Needed?
Experience working with high-assurance or regulated systems, such as defense, government, or critical infrastructure.Strong expertise with Kubernetes orchestration and Linux system administration.Proficiency in designing and operating CI / CD pipelines, including automation of testing and verification processes.Hands-on experience with infrastructure automation tools like Terraform, Ansible, or GitOps.Knowledge of monitoring and observability platforms such as Prometheus, Grafana, ELK, or OpenSearch.What's in it for me?
Opportunity to work on cutting-edge AI / ML workloads and GPU-enabled infrastructure.Engagement in a collaborative and innovative environment focused on reliability and security.Chance to develop skills in high-assurance and air-gapped environments.Work with a dedicated team committed to continuous improvement and excellence.Contribute to impactful projects supporting critical systems and infrastructure.If this is a role that interests you and you'd like to learn more, click apply now and a recruiter will be in touch with you to discuss this great opportunity. We look forward to speaking with you!