Site Reliability Engineering ManagerCarltonOne • Markham, York Region, CA

Site Reliability Engineering Manager

CarltonOne • Markham, York Region, CA

7 days ago

Job type

Full-time

Job description

Join to apply for the Site Reliability Engineering Manager role at CarltonOne

5 days ago Be among the first 25 applicants

Get AI-powered advice on this job and more exclusive features.

CarltonOne is a global B2B technology leader, and part of the Goldman Sachs portfolio, helping organizations around the world reward and inspire exceptional people. Our solutions empower employees to be more productive, sales teams to perform at their best, and customers to stay engaged and loyal.

Our platform powers the global engagement industry, enabling companies to deliver impactful employee recognition, customer loyalty, rewards, sales, and channel incentive programs. We partner with over 450 clients , 500 vendors , and serve 14 million members across 185 countries .

Beyond engagement, every CarltonOne solution drives our eco-action mission : funding tree planting to help restore the planet. To date, we’ve funded over 20 million trees and are on track to plant millions more each year. Learn more at carltonone.com.

About the Opportunity

We are seeking a strategic and technically adept SRE Manager to lead our Site Reliability Engineering team. This role is pivotal in ensuring the reliability, scalability, and performance of our cloud‑native infrastructure and services. You will guide a team of SREs, collaborate cross‑functionally with DevOps, Security, and Engineering, and champion best practices in observability, incident response, and automation.

Responsibilities

Lead, mentor, and grow a team of Site Reliability Engineers, fostering a culture of ownership, continuous learning, and operational excellence
Define and drive SRE strategy, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budget management
Collaborate with cross‑functional teams (Engineering, DevOps, Security, Product) to align reliability goals with business objectives
Build and maintain strong relationships with stakeholders across the organization

Reliability & Incident Management

Establish and continuously improve the end‑to‑end incident management lifecycle, from detection through post‑incident review

Lead coordination of incident response efforts across engineering, DevOps, and support teams during major outages

Implement and maintain runbooks and playbooks for common incident scenarios

Facilitate blameless postmortems to identify root causes, document findings, and ensure follow‑up actions are completed

Track and report on incident metrics (MTTR, MTTD, frequency, severity) to identify trends and drive continuous improvement

Drive automation initiatives to reduce toil, eliminate manual effort, and improve system resilience

Monitoring, Observability & Performance

Design and implement comprehensive monitoring and observability strategies using industry‑leading tools including Datadog, Grafana, CloudWatch, and Prometheus

Deploy and optimize cloud security monitoring using Rapid7 InsightCloudSec and Wiz for threat detection and compliance

Leverage Cloudflare for edge performance monitoring and DDoS protection

Establish actionable alerting systems with proper thresholds and escalation paths

Analyze performance, availability metrics, and capacity trends to proactively identify and resolve issues

Create and maintain dashboards that provide visibility into system health and business‑critical metrics

Operational Excellence & Cloud Infrastructure

Lead root cause analysis for recurring issues and implement long‑term preventative solutions

Optimize cloud resource usage and costs through automation, right‑sizing, and performance tuning

Oversee disaster recovery planning and testing to meet Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements

Implement and maintain Infrastructure‑as‑Code (IaC) practices using Terraform, CloudFormation, and Helm

Champion security best practices including RBAC, IAM policies, encryption, and vulnerability management

Drive capacity planning initiatives to ensure infrastructure scales with business growth

Qualifications

Bachelor’s degree in computer science, Engineering, or related field

7+ years of experience in cloud infrastructure, DevOps, or SRE roles, with 2+ years in a leadership capacity

Proven experience managing incident response and reliability programs at scale

Deep expertise in AWS services (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda)

Strong background in Kubernetes, container orchestration, and service meshes

Proficiency in Infrastructure‑as‑Code (Terraform, CloudFormation, Helm)

Experience with CI / CD pipelines and automation (Bamboo, Jenkins, Ansible)

Solid understanding of networking concepts (TCP / IP, DNS, load balancing, CDN)

Familiarity with monitoring and observability platforms (Datadog, Grafana, CloudWatch)

Excellent communication, stakeholder management, and cross‑functional collaboration skills

Strong incident management and crisis leadership capabilities

Strategic thinking with focus on long‑term reliability and scalability goals

Nice to Have

AWS Certified Solutions Architect or SRE‑related certifications (SRE Practitioner, CKA, CKAD)

Experience with ITIL or other incident management frameworks

Solid understanding of security frameworks and tools (RBAC, IAM, KMS, Wiz, Rapid7)

Experience with multi‑cloud environments (Azure, GCP)

Familiarity with Cloudflare, Ubuntu Server, VMware vSphere, and on‑premises hosting

Experience with observability tools such as OpenTelemetry, Honeycomb, or New Relic

Familiarity with chaos engineering principles and tools (Chaos Monkey, Gremlin)

Background in high‑scale, high‑availability systems (99.99%+ uptime SLOs)

Perks

Competitive salary and benefits package.

Health, dental, and vision coverage.

Access to our employee benefits portal for exclusive discounts.

Monthly company‑wide events, celebrations, and team activities.

Bravo reward points program for recognition and appreciation.

Convenient office location close to public transit.

How to Apply

If this great opportunity looks rewarding to you, let’s connect. Our online application will give you the option to apply to this role directly.

We value diversity and inclusion and encourage all qualified people to apply. If we can make this easier through accommodation in the recruitment process, or if you need assistance to accommodate a disability, please contact us with the “Help” button in the application.

We will review applications, with priority given to those who have completed the assessment, and look forward to hearing from you.

Seniority level

Mid‑Senior level

Employment type

Full‑time

Job function

Information Technology

Industries

IT Services and IT Consulting

#J-18808-Ljbffr

Create a job alert for this search

Engineering Manager • Markham, York Region, CA

Similar jobs

Staff Site Reliability Engineer

ContactMonkey • Toronto, ON, Canada

Full-time

Hey there! We're ContactMonkey 👋.Our mission? To power measurable employee engagement worldwide.And we'd love for you to join us!. About the job - Staff Site Reliability Engineer.You are no...Show more

Last updated: 15 days ago • Promoted

Site Reliability Engineer III

ACV Auctions • Toronto

Full-time

If you are looking for a career at a dynamic company with a people-first mindset and a deep culture of growth and autonomy, ACV is the right place for you! Competitive compensation packages and lea...Show more

Last updated: 17 days ago • Promoted

Site Reliability Engineer

freelance.ca • Toronto, Canada

Full-time

If you are fine with below JD please share me your Updated resume ASAP.Site Reliability EngineerLocation : TORONTO (ONSITE)Duration : 6 monthsExp Required : 10 YearsJob Description : Job Title : SRETec...Show more

Last updated: 22 days ago • Promoted

Site Reliability Engineer

Scotiabank • Toronto

Full-time

As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications.You will have...Show more

Last updated: 17 days ago • Promoted

Site Reliability Engineer

Verto Health • Toronto, ON, Canada

Full-time

At Verto Health, we’re transforming how healthcare organizations connect and collaborate through delivery of digital twin & AI-enabled journeys for population health.Our solutions use pat...Show more

Last updated: 15 days ago • Promoted

Site Reliability Engineer

Capgemini • Toronto

Full-time

Talent Acquisition Business Partner - Strategic Business Unit at Capgemini America Inc.Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d ...Show more

Last updated: 17 days ago • Promoted

Site Reliability Engineer (SRE)

Tangerine • Toronto, Canada

Permanent

SRE & Production Support As Canada’s leading digital bank, Tangerine technology is at the heart of everything we do.We have redefined what digital banking is and we continue to evolve on what it ca...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

PowerToFly • Toronto

Full-time

We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to manage critical cloud infrastructure and site reliability operations for Autodesk's global Product Access...Show more

Last updated: 6 days ago • Promoted

Site Reliability Engineer II (Build and Release Engineering Team)

OpenTable • Toronto, Canada

Full-time

Site Reliability Engineer II (Build and Release Engineering Team) Join to apply for the.Site Reliability Engineer II (Build and Release Engineering Team). With millions of diners, 60,000+ restaurant...Show more

Last updated: 23 days ago • Promoted

Senior Site Reliability Engineer

Rootly • Toronto

Full-time

Join to apply for the Senior Site Reliability Engineer role at Rootly.This range is provided by Rootly.Your actual pay will be based on your skills and experience — talk with your recruiter to lear...Show more

Last updated: 17 days ago • Promoted

Senior Site Reliability Engineer, Kong Konnect

Kong Inc. • Toronto

Full-time

Senior Site Reliability Engineer, Kong Konnect.This range is provided by Kong Inc.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Are you ready ...Show more

Last updated: 17 days ago • Promoted

Senior Manager, Site Reliability Engineering

Tubi • Toronto

Full-time

Senior Manager, Site Reliability Engineering at Tubi.Join to apply for the Senior Manager, Site Reliability Engineering role at Tubi. About Tubi : Boldly built for every fandom, Tubi is a free stream...Show more

Last updated: 17 days ago • Promoted

Site Reliability Engineer - Observability

Flinks • Toronto, ON, CA

Remote

Full-time

Quick Apply

Flinks is the embedded finance platform that brings together connectivity, intelligence, and payments — giving businesses the infrastructure they need to build and deliver seamless financial experi...Show more

Last updated: 30+ days ago

Site Reliability Engineer

iManage • Toronto

Full-time

SRE is part of a global organization that leverages the latest technology to communicate with our colleagues across the globe. We organize ourselves into distributed teams SRE teams are anchored ...Show more

Last updated: 17 days ago • Promoted

Senior Site Reliability Engineer

MariaDB plc • Toronto, ON, Canada

Full-time +1

MariaDB is making a big impact on the world.Whether you're checking your bank account, buying a coffee, shopping online, making a phone call, listening to music, taking out a loan or ordering t...Show more

Last updated: 20 days ago • Promoted

Lead Site Reliability Engineer

SimCorp • Toronto

Full-time

Lead Site Reliability Engineer.Join some of the most innovative thinkers in FinTech as we lead the evolution of financial technology. If you are an innovative, curious, collaborative person who embr...Show more

Last updated: 17 days ago • Promoted

Site-Reliability Engineer (SRE)

Aarorn Technologies Inc • Toronto

Full-time

Get AI-powered advice on this job and more exclusive features.Aarorn Technologies Inc provided pay range.This range is provided by Aarorn Technologies Inc. Your actual pay will be based on your skil...Show more

Last updated: 6 days ago • Promoted

Senior Site Reliability Engineer

Autodesk • Toronto

Full-time

Senior Site Reliability Engineer.We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to manage critical cloud infrastructure and site reliability operations for...Show more

Last updated: 17 days ago • Promoted