Talent.com
Site Reliability Engineer
Site Reliability EngineerScotiabank • Toronto, ON, CA
Site Reliability Engineer

Site Reliability Engineer

Scotiabank • Toronto, ON, CA
30+ days ago
Job type
  • Full-time
Job description

Requisition ID: 251796

Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.

We’re looking for an SRE with deep experience in production observability and incident response to raise the reliability and transparency of our customer-facing services. You will own the end-to-end observability stack across Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Monitoring, drive proactive detection and reduction of toil, and lead major incident response. This role focuses on operational excellence and service health and NOT platform engineering or DevOps provisioning.

Is this role right for you? In this role you will:

  • Design and maintain end-to-end monitoring for critical services using Dynatrace (APM, Real User Monitoring, Synthetic, Davis AI, Smartscape) and GCP Cloud Monitoring (metrics, alerting policies, SLOs/SLIs, uptime checks, dashboards).
  • Build service maps, dependency models, and problem detection in Dynatrace; tune Davis AI problem rules and reduce alert noise through thresholds, baselining, and tagging.
  • Implement SLOs/SLIs with error budgets; continuously review burn rates and align alerting to customer impact.
  • Partner with application teams to instrument code paths (e.g., Dynatrace OneAgent), trace distributed transactions, and validate golden signals (latency, traffic, errors, saturation).
  • Create and optimize Splunk data models, indexes, sourcetypes, ingestion pipelines, and SPL searches; build actionable dashboards for NOC/SRE/Engineering.
  • Develop operational analytics and executive reporting in Power BI (data modeling, DAX/Measures, scheduled refresh) to track reliability KPIs, incident trends, MTTR/MTTD, SLO compliance, and capacity signals.
  • Establish governance for data quality, field extractions, and retention to ensure fast, accurate investigations.
  • Lead incident response (Sev1/Sev2): run bridges, coordinate SMEs, communicate status/timelines, drive mitigation and customer updates.
  • Maintain runbooks, decision trees, and standard operating procedures; ensure blameless post-incident reviews (PIRs) with clear RCA, corrective actions, and preventative measures.
  • Track and close problem tickets tied to recurring failure modes; verify effectiveness of fixes via metrics and error budgets.
  • Use light coding/scripting to automate recurring tasks: alert tuning, data enrichment, log parsing, playbook triggers, service health checks.
  • Build small utilities or bots for on-call workflows (e.g., auto-triage, context gathering, incident timelines).
  • Contribute to observability standards and best practices (naming, tags, SLIs, alert policies), and mentor teams on instrumenting for reliability.

Note: This role does NOT manage CI/CD, infrastructure provisioning, or platform build (Terraform/Kubernetes cluster ops). Collaboration with those teams is expected, but ownership remains on monitoring, analytics, incident response, and reliability outcomes.

Do you have the skills that will enable you to succeed in this role? We’d love to work with you if you have:

  • 5+ years in SRE/Production Operations/Observability with Dynatrace and Splunk in high-availability environments.
  • Hands-on with GCP operations: Cloud Monitoring, Cloud Logging, Alerting Policies, Uptime Checks, SLOs/SLIs; familiarity with Error Reporting/Trace is a plus.
  • Strong SPL (Splunk) and Dynatrace (APM/RUM/Synthetic) expertise—including alert design, dashboards, and noise reduction.
  • Power BI proficiency: data modeling, DAX measures, role-level security, and scheduled refresh for operational/Exec reporting.
  • Proven incident commander experience for Sev1/Sev2 with clear comms, stakeholder management, and PIR facilitation.
  • Coding/scripting for automation and data manipulation (e.g., Python or PowerShell; Go/Bash a plus).
  • Solid understanding of service reliability concepts: golden signals, SLOs/error budgets, capacity and saturation, graceful degradation.
  • Strong analytical mindset with a bias to measurable outcomes (MTTD/MTTR, alert volume, SLO compliance).

What's in it for you?

  • Diversity, Equity, Inclusion & Allyship - We strive to create an inclusive culture where every employee is empowered to reach their fullest potential, respected for who they are, and are embraced through bias-free practices and inclusive values across Scotiabank. We embrace diversity and provide opportunities for all employee to learn, grow & participate through our various Employee Resource Groups (ERGs) that span across diverse gender identities, ethnicity, race, age, ability & veterans.
  • Accessibility and Workplace Accommodations - We value the unique skills and experiences each individual brings to the Bank, and are committed to creating and maintaining an inclusive and accessible environment for everyone. Scotiabank continues to locate, remove and prevent barriers so that we can build a diverse and inclusive environment while meeting accessibility requirements.
  • Upskilling through online courses, cross-functional development opportunities, and tuition assistance.
  • Competitive Rewards program including bonus, flexible vacation, personal, sick days and benefits will start on day one.
  • Dynamic Ecosystem - Free tea & coffee, universal washrooms, and lots of space for team collaboration.
  • Community Engagement - No matter where you choose to work from; we offer opportunities for community engagement & belonging with our various programs.

Location(s): Canada : Ontario : Toronto

Scotiabank is a leading bank in the Americas. Guided by our purpose: "for every future", we help our customers, their families and their communities achieve success through a broad range of advice, products and services, including personal and commercial banking, wealth management and private banking, corporate and investment banking, and capital markets.

At Scotiabank, we value the unique skills and experiences each individual brings to the Bank, and are committed to creating and maintaining an inclusive and accessible environment for everyone. If you require accommodation (including, but not limited to, an accessible interview site, alternate format documents, ASL Interpreter, or Assistive Technology) during the recruitment and selection process, please let our Recruitment team know. If you require technical assistance, please click here. Candidates must apply directly online to be considered for this role. We thank all applicants for their interest in a career at Scotiabank; however, only those candidates who are selected for an interview will be contacted.

Create a job alert for this search

Site Reliability Engineer • Toronto, ON, CA

Similar jobs

Site Reliability Engineer

TykToronto, ON, CA
Full-time

The Tyk API Management platform is helping to drive the connected world and power new products and services.We're changing the way that organisations connect any number of their systems and service...Show more

 • Promoted

Site Reliability Engineer 3

BehavoxToronto
Full-time

Behavox is shaping the future of how businesses harness their most important raw material - data.Our mission is bold: Organize enterprise data into actionable information that protects and promotes...Show more

 • Promoted

Site Reliability Engineer

iManageToronto, ON, CA
Full-time

SRE is part of a global organization that leverages the latest technology to communicate with our colleagues across the globe.We organize ourselves into distributed teams – SRE teams are anchored t...Show more

 • Promoted

Senior Site Reliability Engineer

RBCToronto, ON, CA
Full-time

This role will be responsible for the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Digital Branch SRE organization.As t...Show more

 • Promoted

Lead Site Reliability Engineer

Movable InkToronto, ON, CA
Full-time

Movable Ink scales content personalization for marketers through data-activated content generation and AI decisioning.The world’s most innovative brands rely on Movable Ink to maximize revenue, sim...Show more

 • Promoted

Site Reliability Engineer

ScotiabankToronto
Full-time

As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications.You will have...Show more

 • Promoted

Lead Reliability Enhancements as a Site Reliability Engineer

ScotiabankToronto, ON, CA
Full-time

Become the backbone of digital services as a Site Reliability Engineer.Elevate application reliability and spearhead operational improvements while enhancing customer engagement.This role is pivota...Show more

 • Promoted • New!

Site Reliability Engineer

KyndrylToronto, ON, CA
Full-time +1

Join to apply for the Site Reliability Engineer role at Kyndryl.Direct message the job poster from Kyndryl.Recruitment & Strategic Staffing @Kyndryl | Partnering with IT Consultants in Financial Se...Show more

 • Promoted

Senior Site Reliability Engineer

ThinkificToronto, ON, CA
Full-time

Senior Site Reliability Engineer.Senior Site Reliability Engineer.Are you an experienced Site Reliability Engineer looking for a new challenge?.Senior Site Reliability Engineer.Senior Site Reliabil...Show more

 • Promoted

Impactful Site Reliability Engineer Fostering Reliability and Performance

RootlyToronto, ON, CA
Full-time

Join as an impactful Site Reliability Engineer, shaping the technical future and enhancing system reliability.Tackle rewarding challenges in a collaborative startup atmosphere.As a key player, you’...Show more

 • Promoted

Site Reliability Engineer

McCain FoodsToronto, ON, CA
Full-time

Our Global Technology team’s goal is to leverage technology and data to drive profitable growth, focus on enhancing customer experience and to further our purpose of 'Celebrating real connections t...Show more

 • Promoted

Site Reliability Engineer - Identity and Platform Services

OMERSToronto, ON, CA
Full-time

Choose a workplace that empowers your impact.Join a global workplace where employees thrive.One that embraces diversity of thought, expertise and experience.A place where you can personalize your e...Show more

 • Promoted

Site Reliability Engineer (Dynatrace & Observability)

Astra North Infoteck Inc.Toronto, ON, CA
Full-time

A technology solutions company in Toronto is seeking a skilled Site Reliability Engineer to enhance their monitoring and observability practices.The ideal candidate will have extensive experience w...Show more

 • Promoted

Site Reliability Engineer, Observability

PricelineToronto, ON, CA
Full-time

This role is eligible for our hybrid work model: Two days in-office.Site Reliability Engineer, Observability.Our Technology team is the backbone of our company: constantly creating, testing, learni...Show more

 • Promoted

Remote Site Reliability Engineer Role

YelpToronto, ON, CA
Remote
Full-time

Join a fully remote team as a Site Reliability Engineer.Leverage your skills in scalable systems, automation, and problem-solving while supporting a platform that serves over 100 million users mont...Show more

 • Promoted

Staff Site Reliability Engineer

ThinkificToronto, ON, CA
Full-time

Are you an experienced Site Reliability Engineer looking for a new challenge? We’re looking for a.Staff Site Reliability Engineer.Staff Site Reliability Engineer (SRE).As a Staff Site Reliability E...Show more

 • Promoted

Sr. Site Reliability Engineer I

Axon EnterpriseToronto, ON, CA
Full-time

At Axon, we’re on a mission to Protect Life.We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud software.Like our products, we work b...Show more

 • Promoted

Site Reliability Engineer

DeltatreToronto, ON, CA
Permanent

The Site Reliability Engineer (SRE) is responsible for improving the reliability, stability, and operational readiness of critical digital platforms.The role focuses on proactively reducing risk, s...Show more