Site Reliability EngineerTecsys Inc. • Toronto, ON, CA

Site Reliability Engineer

Tecsys Inc. • Toronto, ON, CA

30+ days ago

Job type

Permanent

Job description

Get AI-powered advice on this job and more exclusive features.

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our conveniently located offices and collaborative workspaces, provide our team with the freedom and flexibility to work in the way that makes our employees most productive.

About Us

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. We work with industry leaders to transform their supply chains through technology. If you thrive on tackling interesting challenges with continuous learning opportunities, then Tescys could be a good fit for you!

About The Role

We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.

Your Responsibilities

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
Innovate relentlessly : Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability : Enhance and expand monitoring and alerting using Datadog; define SLOs / SLIs and create actionable dashboards that drive reliability outcomes
Drive automation : Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI / CD) to reduce manual intervention and enable self-healing systems
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
Be on-call
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience
Implement monitoring, Logging, alerting, and SLA Reporting
Create and maintain technical documentation
Implement, maintain and mature SRE best practices
Lead incidents : Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users

Requirements

5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments

Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure

Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale

Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar)

Familiarity with CI / CD pipelines and release automation (GitLab preferred, Jenkins acceptable)

Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards

Experience with incident management, on-call participation, escalation, and structured postmortems

Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics

Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned

Experience with Fedramp compliance is a strong asset

Basic knowledge of Java- or .Net-based development required

Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec

Additional requirements

Escalation on-call rotation

Occasional travel (quarterly offsites, conferences - less than 10%)

At Tecsys, we are committed to fostering a diverse and inclusive workplace where all employees feel valued, respected, and empowered. We believe that diversity drives innovation and strengthens our ability to deliver exceptional solutions. We welcome and encourage applicants from all backgrounds, experiences, and perspectives to join our team.

Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview.

NB : if you are applying to this position, you must be a Canadian Citizen or a Permanent Resident of Canada, OR , have a valid Canadian work permit.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Toronto, ON, CA

Similar jobs

Staff Site Reliability Engineer

ContactMonkey • Toronto, ON, Canada

Full-time

Hey there! We're ContactMonkey 👋.Our mission? To power measurable employee engagement worldwide.And we'd love for you to join us!. About the job - Staff Site Reliability Engineer.You are no...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer - Observability

Flinks Technology Inc. • Toronto

Full-time

Flinks is where financial data moves—with purpose, trust, and impact.We’re on a mission to simplify access to financial data and help businesses build better, faster, and more secure financial prod...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer 3

Behavox • Toronto

Full-time

Behavox is shaping the future of how businesses harness their most important raw material - data.Our mission is bold : Organize enterprise data into actionable information that protects and promotes...Show more

Last updated: 27 days ago • Promoted

Site Reliability Engineer

Denvr • Toronto, Canada

Full-time

Site Reliability Engineer - Platform Infrastructure Team (100% Remote - Canada) Denvr is a vertically integrated AI Platform Services company headquartered in Calgary, Canada.We provide foundationa...Show more

Last updated: 12 days ago • Promoted

Site Reliability Engineer

Verto Health • Toronto, ON, Canada

Full-time

At Verto Health, we’re transforming how healthcare organizations connect and collaborate through delivery of digital twin & AI-enabled journeys for population health.Our solutions use pat...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Kyndryl • Toronto

Full-time +1

Join to apply for the Site Reliability Engineer role at Kyndryl.Direct message the job poster from Kyndryl.Recruitment & Strategic Staffing @Kyndryl | Partnering with IT Consultants in Financial Se...Show more

Last updated: 26 days ago • Promoted

Site Reliability Engineer

Capgemini • Toronto, Canada

Full-time

Talent Acquisition Business Partner – Strategic Business Unit at Capgemini America Inc.Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d ...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer (SRE)

Tangerine • Toronto, Canada

Permanent

SRE & Production Support As Canada’s leading digital bank, Tangerine technology is at the heart of everything we do.We have redefined what digital banking is and we continue to evolve on what it ca...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Fivetran • Toronto

Full-time

Senior Site Reliability Engineer.From Fivetran’s founding until now, our mission has remained the same : to make access to data as simple and reliable as electricity. With Fivetran, customer data arr...Show more

Last updated: 8 hours ago • Promoted • New!

Senior Site Reliability Engineer

Tubi • Toronto

Full-time

Senior Site Reliability Engineer.Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of ...Show more

Last updated: 30+ days ago • Promoted

Lead Site Reliability Engineer

RBC • Toronto, Canada

Full-time

Join RBC as a Lead Site Reliability Engineer and take the lead in ensuring the reliability, scalability, and performance of our critical production systems and infrastructure.This is your chance to...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer (SRE)

Tangerine Bank • Toronto

Full-time +1

Press Tab to Move to Skip to Content Link.Select how often (in days) to receive an alert : .Tangerine is Canada’s leading direct bank. We offer flexible and accessible banking options, innovative prod...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

MariaDB plc • Toronto, ON, Canada

Full-time +1

MariaDB is making a big impact on the world.Whether you're checking your bank account, buying a coffee, shopping online, making a phone call, listening to music, taking out a loan or ordering t...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Dayforce US, Inc. • Toronto

Full-time

Dayforce is a global human capital management (HCM) company headquartered in Toronto, Ontario, and Minneapolis, Minnesota, with operations across North America, Europe, Middle East, Africa (EMEA), ...Show more

Last updated: 4 days ago • Promoted

Site Reliability Engineer

STAPLES Canada • Richmond Hill

Full-time

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and operational excellence of Staples Canada’s digital platforms. This role supports production systems...Show more

Last updated: 12 days ago • Promoted

Site Reliability Engineer

Moneris • Toronto, Canada

Full-time

Your Moneris Career - The Opportunity.We are looking for a Site Reliability Engineer (SRE) to join our dynamic team.As an SRE, you will help ensure the reliability, performance, and scalability of ...Show more

Last updated: 27 days ago • Promoted

Lead Site Reliability Engineer

SimCorp • Toronto

Full-time

Lead Site Reliability Engineer.Join some of the most innovative thinkers in FinTech as we lead the evolution of financial technology. If you are an innovative, curious, collaborative person who embr...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

iManage • Toronto, ON, Canada

Full-time

SRE is part of a global organization that leverages the latest technology to communicate with our colleagues across the globe. We organize ourselves into distributed teams SRE teams are anchored ...Show more

Last updated: 30+ days ago • Promoted