Join our SRE squad supporting ~1000 AWS-hosted services for BMO. You’ll own operational reliability, rapid triage, and proactive maintenance across production and non-prod, partnering closely with Cloud Engineering, SOC, and application teams.
Key Responsibilities
Deliver 24×7 monitoring, incident response, and problem management; drive MTTA / MTTR reduction and SLO / SLI adherence.
Perform preventive health checks; analyze ticket trends to implement continual service improvements and automation to reduce toil.
Execute blameless postmortems and high-quality RCA; maintain SOPs / runbooks and reliability dashboards.
Configure / tune observability (Dynatrace, CloudWatch, ELK); enable self-healing workflows and workload optimizations.
Support change / service requests within agreed SLAs; collaborate during transitions and onboard new AWS services.
Core Skills & Tools
AWS :
Lambda, ECS / Fargate / EC2, API Gateway, SNS / SQS, Kinesis, RDS; IAM / KMS foundations.
Observability & ITSM :
Dynatrace, CloudWatch, ELK; ServiceNow for incidents / changes; SLI / SLO dashboards.
Reliability Practices :
Error budgets, capacity / performance benchmarking, automation / runbook execution, FinOps awareness.
Qualifications
5+ years SRE / DevOps or L2 operations for cloud-native stacks; strong AWS production experience.
Proven incident / change / problem management in 24×7 environments; adept at RCA and postmortems.
Hands‑on with observability tooling and operational automation; excellent collaboration and documentation skills.
Shift Coverage & Locations
Follow-the-sun model with overlapping handoffs across Canada / India to ensure continuous support. Success is measured by uptime, MTTR / MTTD, change failure rate, error‑budget consumption, SLO adherence, RCA quality, and CSI throughput.
#J-18808-Ljbffr
Site Reliability Engineer • Toronto, Canada