Talent.com
Sitetracker
Site Reliability EngineerSitetracker • Toronto, Canada
No longer accepting applications
Site Reliability Engineer

Site Reliability Engineer

Sitetracker • Toronto, Canada
7 days ago
Job type
  • Full-time
Job description
The Opportunity This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role.

What You'll Do As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team.

By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.

The Skills You'll Have Deep SRE Expertise

Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.

Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.

Build observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.

Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.

Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.

Deep Technical Expertise in AWS

Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.

Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.

Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.

Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.

Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.

Impact, Leadership & Team Enablement

Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.

Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.

Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.

Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.

Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.

Communication & Influence

Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.

Write postmortems that both engineers and non-engineers can read, understand, and learn from.

Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.

Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.

Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.

Within 90 Days, You'll

Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.

Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.

Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.

Within 180 Days, You'll

Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.

Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.

Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.

Within 365 Days, You'll

Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.

Define a clear, evidence‑based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi‑region expansion.

Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.

$97,000 - $149,200 a year

#J-18808-Ljbffr
Create a job alert for this search

Site Reliability Engineer • Toronto, Canada

Similar jobs

Site Reliability Engineer

Insight GlobalToronto
Full-time

Insight Global is looking for a Site Reliability Engineer/Implementation Lead to support a CCaaS transformation program.The role will focus on implementing monitoring solutions across a distributed... Show more

 • Promoted

Lead Site Reliability Engineer

Movable InkToronto, ON, CA
Full-time

Movable Ink scales content personalization for marketers through data-activated content generation and AI decisioning.The world’s most innovative brands rely on Movable Ink to maximize revenue, sim... Show more

 • Promoted

Site Reliability Engineer

KyndrylToronto, ON, CA
Full-time +1

Join to apply for the Site Reliability Engineer role at Kyndryl.Direct message the job poster from Kyndryl.Recruitment & Strategic Staffing @Kyndryl | Partnering with IT Consultants in Financial Se... Show more

 • Promoted

Senior Site Reliability Engineer

ThinkificToronto, ON, CA
Full-time

Senior Site Reliability Engineer.Senior Site Reliability Engineer.Are you an experienced Site Reliability Engineer looking for a new challenge?.Senior Site Reliability Engineer.Senior Site Reliabil... Show more

 • Promoted

Site Reliability Engineer

TELUS DigitalToronto, ON, CA
Full-time

Welcome to TELUS Digital — where innovation drives impact at a global scale.As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunicat... Show more

 • Promoted

Impactful Site Reliability Engineer Fostering Reliability and Performance

RootlyToronto, ON, CA
Full-time

Join as an impactful Site Reliability Engineer, shaping the technical future and enhancing system reliability.Tackle rewarding challenges in a collaborative startup atmosphere.As a key player, you’... Show more

 • Promoted

Site Reliability Engineer

McCain FoodsToronto, ON, CA
Full-time

Our Global Technology team’s goal is to leverage technology and data to drive profitable growth, focus on enhancing customer experience and to further our purpose of 'Celebrating real connections t... Show more

 • Promoted

Senior Site Reliability Engineer I

InstacartToronto, ON, CA
Permanent

Join our team as a Senior Site Reliability Engineer II, where your expertise will play a crucial role in maintaining the backbone of our platform's operations.You'll take on challenges directly, en... Show more

 • Promoted

Senior Site Reliability Engineer

SimCorpToronto, ON, CA
Full-time

Senior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations: Torontotime type: Full timeposted on: Posted Todayjob requisition id: R-211168Job Advertisement*... Show more

 • Promoted

Site Reliability Engineer, Observability

PricelineToronto, ON, CA
Full-time

This role is eligible for our hybrid work model: Two days in-office.Site Reliability Engineer, Observability.Our Technology team is the backbone of our company: constantly creating, testing, learni... Show more

 • Promoted

Expert Site Reliability Engineer Position

Okta for DevelopersToronto, ON, CA
Full-time

Ensure secure identity management as a Senior Site Reliability Engineer.Collaborate in a remote team to enhance the reliability and scalability of mission-critical authentication systems.The SRE po... Show more

 • Promoted

Staff Site Reliability Engineer

ThinkificToronto, ON, CA
Full-time

Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a.Staff Site Reliability Engineer.Staff Site Reliability Engineer (SRE).As a Staff Site Reliability E... Show more

 • Promoted

Sr. Site Reliability Engineer I

Axon EnterpriseToronto, ON, CA
Full-time

At Axon, we’re on a mission to Protect Life.We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud software.Like our products, we work b... Show more

 • Promoted

Site Reliability Engineer

Dayforce US, Inc.Toronto, Ontario, Canada
Full-time

Posted Friday, March 27, 2026 at 12:00 AM | Expires Friday, May 29, 2026 at 10:59 PM Location:.For this role, we are open to remote work and can hire anywhere in Great Britain Dayforce is a global ... Show more

 • Promoted

Site Reliability Engineer

HCLTechtoronto, on, ca
Full-time

Hands-on experience with at least one major public cloud platform (Azure, AWS, or GCP).Strong understanding of cloud infrastructure and application runtime components, including compute, storage, n... Show more

 • Promoted

Site Reliability Engineer

LongbridgeToronto, ON, CA
Full-time

Longbridge is a fast-growing online brokerage platform on a mission to make investing smarter, simpler, and more accessible for everyone.As part of our global expansion, we’re looking for a.Site Re... Show more

 • Promoted

Senior Site Reliability Engineer

CaptivateIQToronto, ON, CA
Full-time

The Site Reliability Engineering team in CaptivateIQ operates across the engineering organization, supporting our development teams by providing them with the tools and processes they need to get t... Show more

 • Promoted

Senior Site Reliability Engineer Role

ITRidersToronto, ON, CA
Full-time

Elevate your career as a Senior Site Reliability Engineer at our company.Craft observability-as-code solutions using Terraform while optimizing system reliability across diverse environments.We see... Show more

 • Promoted

Senior Site Reliability Engineer Focused on Kubernetes Infrastructure

Chainlink LabsToronto, ON, CA
Full-time

Elevate decentralized architecture as a Senior Site Reliability Engineer.Spearhead Kubernetes-based infrastructure for decentralized applications, driving scalability, security, and operational eff... Show more

 • Promoted

Site Reliability Engineer

Momentum Financial Services GroupToronto
Full-time

At Momentum Financial Services Group, we help people move forward by reimagining how money works for those who need it most.With more than 40 years of experience, we’re the team behind Money Mart—C... Show more