Talent.com
IFS
Senior Lead Site Reliability EngineerIFS • Vancouver, British Columbia, Canada
Senior Lead Site Reliability Engineer

Senior Lead Site Reliability Engineer

IFS • Vancouver, British Columbia, Canada
12 hours ago
Job type
  • Full-time
  • Permanent
Job description
Job Description

Role Overview

  • As a Senior Lead Site Reliability Engineer (SRE) specializing in Azure, you will be a hands-on technical owner of our cloud infrastructure. You will architect, build, and operate the systems that underpin our Azure-based SaaS offerings — owning reliability, scalability, and security from the infrastructure layer up. You will work in close partnership with R&D to embed operational excellence into the software delivery lifecycle, and you take full ownership of every system within Cloud Operations' purview. You bring deep Azure and DevOps expertise, thrive in complex distributed environments, and raise the technical bar through the quality of your engineering work.

Key Responsibilities

  • Design, implement, and continuously improve Azure-based infrastructure for high-availability, mission-critical SaaS services — owning the full lifecycle from architecture through to production operation.
  • Own, operate, and continuously improve CI/CD pipelines across Jenkins, Azure DevOps, and GitHub Actions — including pipeline architecture, build performance, deployment reliability, secrets handling, and migration work as we evolve our toolchain. This is active ownership, not support.
  • Configure and maintain Ansible playbooks for configuration management, provisioning automation, and drift remediation across the infrastructure estate.
  • Build and maintain Infrastructure as Code using Terraform and/or ARM/Bicep, covering the full provisioning lifecycle — from initial environment build through to day-two operations and ongoing change management.
  • Work directly and continuously with R&D engineering teams to embed reliability, operability, and deployment quality into the software development lifecycle — including pipeline design reviews, pre-production environment ownership, release readiness, and incident learnings fed back into build practices.
  • Own the observability and alerting stack across Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and Pingdom — including metric collection, synthetic monitoring coverage, alerting thresholds, and dashboard design. Own the PagerDuty configuration end-to-end: escalation policies, routing rules, service integrations, and on-call schedule management. Act as the technical escalation point for complex incidents and participate in the team's on-call rotation.
  • Design, operate, and optimize AKS clusters for production workloads — including node pool configuration, autoscaling, network policy, ingress architecture, workload identity, and persistent storage patterns. Own cluster health, upgrade lifecycle, and capacity planning end-to-end.
  • Instrument Kubernetes workloads with Prometheus exporters and build Grafana dashboards that give engineering teams genuine operational visibility into service health, latency, error rates, and resource consumption.
  • Take full technical ownership of all systems within Cloud Operations' scope — infrastructure, tooling, pipelines, observability, and security controls. If it lives in our environment, you own its reliability, its documentation, and its improvement roadmap.
  • Lead root cause analysis on production incidents; author post-mortems with actionable engineering remediation, not just process changes.
  • Define, instrument, and own SLOs, SLIs, and error budgets for Azure-hosted SaaS services; use data to drive reliability investment decisions.
  • Engineer and enforce security controls across identity, access, secrets, and certificate management in Azure — including hands-on implementation, not just policy definition. Contribute directly to the technical controls, evidence collection, and continuous compliance posture required to maintain SOC 2 Type II, ISO 27001, and ISO 9001 certification across the Cloud Operations environment.
  • Evaluate emerging Azure services and features against real production requirements; build proof-of-concepts, validate at scale, and drive adoption where the engineering case is clear.
  • Produce and maintain architecture documentation, runbooks, and operational playbooks that are technically precise enough for an on-call engineer to execute under pressure — and meet the documentation standards required under our ISO 9001 quality management obligations.

Qualifications

Required

  • 7+ years in SRE, Cloud Operations, or DevOps roles, with at least 4 years of hands-on Microsoft Azure focus.
  • Deep expertise across Azure services including App Services, AKS, Azure SQL, Storage, Networking, Security Centre, and Monitor.
  • Hands-on experience building, maintaining, and improving CI/CD pipelines in Jenkins, Azure DevOps, and GitHub Actions — including real ownership of pipeline failures, performance, and evolution, not just consumption.
  • Working experience with Ansible for configuration management and infrastructure automation.
  • Production-grade Kubernetes/AKS experience — cluster operations, workload troubleshooting, RBAC, network policies, Helm, and upgrade management in a live SaaS environment.
  • Hands-on experience with Prometheus and Grafana in a production context — metric instrumentation, alerting rule design, and dashboard development, not just consumption.
  • Experience with Pingdom for synthetic monitoring and PagerDuty for incident alerting and on-call management — including configuration of escalation policies, alert routing, and participation in a 24/7 on-call rotation.
  • Strong scripting and automation skills in PowerShell, Python, Bash, or equivalent — with a track record of using code to eliminate operational toil.
  • Proven, production-grade experience with Infrastructure as Code using Terraform and/or ARM/Bicep.
  • Advanced troubleshooting ability across distributed systems, network layers, and application performance in Azure — comfortable owning a complex outage end-to-end.
  • Demonstrated ability to work closely and effectively with software development teams — contributing to SDLC processes, pipeline standards, and release quality as a technical peer, not a service desk.
  • Strong working knowledge of security protocols, certificate lifecycle management, secrets management, and compliance controls in Azure — including practical experience supporting or maintaining SOC 2 Type II, ISO 27001, or ISO 9001 audits in an infrastructure or cloud operations context.
  • Demonstrated experience leading incident response and driving post-mortem remediation to completion.

Preferred

  • Azure certifications (Azure Solutions Architect, Azure DevOps Engineer Expert, or equivalent).
  • Experience with hybrid or multi-cloud environments, including AWS.
  • Familiarity with Azure cost management tooling and hands-on optimisation work.
  • Experience operating large-scale SaaS platforms with multi-tenant infrastructure.
  • Experience with Grafana alerting, Grafana OnCall, or similar on-call routing tooling.


Additional Information

What We’re Offering

  • Salary Range: $133k and $151k CAD
  • Permanent, Full-time

Use of Artificial Intelligence in Recruitment
As part of our recruitment process, we may use automated tools, including artificial intelligence, to help screen and assess applications based on job‑related criteria such as skills, experience, and qualifications.
These tools do not make hiring decisions. All employment decisions are reviewed and made by members of our hiring team.

We embrace flexibility and hybrid work opportunities to support diverse needs and lifestyles, while also valuing inclusive workplace experiences. By fostering a sense of community, we drive innovation, strengthen connections, and nurture belonging. Our commitment ensures you can work in a way that suits you best, while also engaging with colleagues to share ideas and build meaningful relationships.

Create a job alert for this search

Senior Lead Site Reliability Engineer • Vancouver, British Columbia, Canada

Similar jobs

Senior Site Reliability Engineer

RelayVancouver, British Columbia, Canada
Full-time

Relay is a digital banking platform that gives self-made business owners the tools and know-how to be great with money—bringing clarity, confidence, and control to every dollar earned, so they can ... Show more

 • Promoted

Senior Site Reliability Engineer

Orion InnovationVancouver, British Columbia, Canada
Full-time

Overview Senior Site Reliability Engineer (SRE) with Kubernetes and Rancher.Full-time role focused on building and maintaining highly resilient, secure systems, including in air-gapped environments... Show more

 • Promoted

Senior Site Reliability Engineer

CerebrasVancouver, Metro Vancouver Regional District, CA
Full-time

We’re seeking a senior Site Reliability Engineer/DevOps who is passionate about building the best infrastructure and maintaining the health of the systems.Design and maintain scalable, secure, and ... Show more

 • Promoted

Senior Site Reliability Engineer

ThinkificVancouver, Metro Vancouver Regional District, CA
Full-time

Senior Site Reliability Engineer.Senior Site Reliability Engineer.Are you an experienced Site Reliability Engineer looking for a new challenge?.Senior Site Reliability Engineer.Senior Site Reliabil... Show more

 • Promoted

Site Reliability Engineer in Growing Team

HiiveVancouver, British Columbia, Canada
Full-time

Join a dynamic infrastructure team as a Site Reliability Engineer.Focus on enhancing platform reliability, ensuring availability, and supporting AI workloads for improved system performance.In this... Show more

 • Promoted

Site Reliability Engineer

Apple Inc.Vancouver, Canada
Full-time

Vancouver, British Columbia, Canada Software and ServicesThe Apple Service Engineering - SRE team is looking for Site Reliability Engineers with experience in developing processes, tools, and autom... Show more

 • Promoted

Site Reliability Engineer

TELUS DigitalVancouver, Metro Vancouver Regional District, CA
Full-time

Welcome to TELUS Digital — where innovation drives impact at a global scale.As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunicat... Show more

 • Promoted

Experienced Site Reliability Engineer - Remote

Tech InsightsVancouver, Metro Vancouver Regional District, CA
Remote
Full-time

TechInsights seeks a Senior Site Reliability Engineer to enhance AI operations from anywhere in Canada.Oversee reliability strategies, manage error budgets, and collaborate closely with engineering... Show more

 • Promoted

Senior Site Reliability Engineer, Developer Platform

Rivian and Volkswagen Group TechnologiesVancouver, British Columbia, Canada
Full-time

About Us Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive’s next chapter.From operating systems to zonal controllers to cl... Show more

 • Promoted

Sr. Site Reliability Engineer I

AxonVancouver, Metro Vancouver Regional District, CA
Full-time

Join Axon and be a Force for Good.At Axon, we’re on a mission to Protect Life.We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud sof... Show more

 • Promoted

Site Reliability Engineer

Tecsys Inc.Vancouver, Metro Vancouver Regional District, CA
Permanent

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company.The... Show more

 • Promoted

Senior Site Reliability Engineer (Remote-First)

VySystemsVancouver, Metro Vancouver Regional District, CA
Remote
Full-time

A leading technology company is seeking a Senior Site Reliability Engineer with robust Kubernetes knowledge to work remotely.Ideal candidates have over 6 years of experience in IT disciplines, prof... Show more

 • Promoted

Remote Site Reliability Engineer - Scale Crypto Systems

NewtonVancouver, Metro Vancouver Regional District, CA
Remote
Full-time

A leading innovative tech company in Toronto is looking for a Site Reliability Engineer.In this pivotal role, you will enhance the reliability and resilience of critical services, manage incidents,... Show more

 • Promoted

Senior Site Reliability Engineer- Remote

ClickHouseVancouver, Metro Vancouver Regional District, CA
Remote
Full-time

Senior Site Reliability Engineer- Remote.Recognized on the 2025 Forbes Cloud 100 list, ClickHouse is one of the most innovative and fast-growing private cloud companies.With more than 3,000 custome... Show more

 • Promoted

Site Reliability Engineer (Sre) – Cvaas - $95,000 - $145,000 A Year

Arista NetworksVancouver, Canada
Full-time

SRE at Arista Networks responsible for global CloudVision service fleet, including CI/CD, automation, incident response, and capacity planning. Show more

 • Promoted

Senior Site Reliability Engineer

ScalePadVancouver, Metro Vancouver Regional District, CA
Full-time

At ScalePad, we hire thoughtful builders who want their work to matter.Our roles are designed for people who thrive on driving impact, see ambiguity as an opportunity, and believe that raising the ... Show more

 • Promoted

Senior Site Reliability Engineer Focused on Kubernetes Infrastructure

Chainlink LabsVancouver, Metro Vancouver Regional District, CA
Full-time

Elevate decentralized architecture as a Senior Site Reliability Engineer.Spearhead Kubernetes-based infrastructure for decentralized applications, driving scalability, security, and operational eff... Show more

 • Promoted

Senior Infrastructure Reliability Engineer

ShippoVancouver, Metro Vancouver Regional District, CA
Full-time

Enhance shipping solutions as a Senior Site Reliability Engineer in a remote setting.Focus on infrastructure integrity, scalability, and performance in a collaborative environment.This position inv... Show more

 • Promoted

Staff Site Reliability Engineer, Fabric

MongoDBVancouver, British Columbia, Canada
Full-time

Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization.Among these ... Show more

 • Promoted

Senior Site Reliability Engineer

Treasure AIVancouver, Canada
Full-time

Treasure AI is an agentic experience platform built to acquire, retain, and grow your most valuable customers.Powered by AI, it is shaped by human creativity and always operates continuously with c... Show more