Talent.com
Tubi Tv
Senior Manager, Site Reliability EngineeringTubi Tv • Toronto, Canada
No longer accepting applications
Senior Manager, Site Reliability Engineering

Senior Manager, Site Reliability Engineering

Tubi Tv • Toronto, Canada
10 days ago
Job type
  • Full-time
Job description
Senior Manager, Site Reliability Engineering

Overview

About Tubi: Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and TV shows, thousands of creator-led stories and hundreds of Tubi Originals made for the most passionate fans. Headquartered in San Francisco and founded in 2014, Tubi is part of Tubi Media Group, a division of Fox Corporation. About the Role

About the Role:

Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developerver mindset and toolkit to the challenges of building and running large-scale, distributed systems. Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation. We are seeking an experienced and visionary

Senior SRE Manager

to lead and grow our newly built Site Reliability Engineering team. You are more than a people manager or a tech lead; you are the strategic leader responsible for architecting our reliability roadmap. You will build and mentor a team of talented engineers, foster a culture of blameless learning and continuous improvement, and champion the engineering practices that allow us to balance rapid innovation with rock-solid stability. You will be a key influencer in our engineering leadership, partnering with peers across the organization to ensure reliability is a shared responsibility and a core tenet of our engineering culture. What You'll Do

Lead, mentor, and grow a team of Site Reliability Engineers. Foster a culture of innovation and technical excellence where engineers feel empowered to do their best work. Provide personalized coaching, create professional development plans, and guide the careers of senior and emerging talent within the team. Establish equitable, sustainable on-call practices (including global coverage where applicable) that protect focus time and avoid burnout. Define team rituals - runbook reviews, game days, and incident retros - that reinforce quality and learning. Strategic Planning & Vision:

Define and drive the multi-year technical strategy and vision for Tubi’s observability, and automation platforms. Partner with infra lead to align Tubi’s infrastructure & SRE roadmap. Partner with tech leaders to align the SRE roadmap with business objectives. Champion a data-driven approach to reliability, using Service Level Objectives (SLOs) and error budgets to facilitate productive conversations about risk and feature velocity. Operational Excellence & Incident Management:

Own the end-to-end availability, performance, and efficiency of our critical user-facing services. Evolve our incident response practice to reduce MTTR and MTBF. Champion a rigorous, blameless, and data-driven post-mortem culture to ensure we learn from both successes and failures, driving engineering teams toward systemic fixes and automation to prevent recurrence of incidents. Streamline and improve our existing processes and practices, and collaborate with other teams to enhance our production release standards by improving current processes. Define and tune a 24x7 on-call rotation for low noise and fast response; act as executive escalation partner during major incidents. Own disaster-recovery strategy (playbooks, failover drills, recovery simulations) and track SLO gaps with time-bound remediations. Financial & Vendor Management:

Own the SRE budget, tooling, and headcount. Manage relationships with key third-party vendors for observability and SRE-related AI platforms, work with infra lead and finance team for contract negotiations and ensure value from investments. Cross-Functional Collaboration:

Act as a key influencer and strategic partner to leaders in Software Engineering, Product Management, and Infra/Sec. Drive the adoption of SRE best practices and principles throughout the organization, ensuring new services are designed for reliability, scalability, and observability from day one. Your Background

8+ years of experience in a technical field, with at least 3+ years in an engineering leadership position managing SRE, DevOps, or Production Engineering teams. A deep, principled understanding of SRE tenets, including SLIs, SLOs, error budgets, toil reduction, and capacity planning. Exceptional communication, negotiation, and influencing skills, with the ability to articulate complex technical concepts and strategies to both technical and non-technical stakeholders at all levels of the organization. A strong technical background as a hands-on software engineer or site reliability engineer prior to moving into management. Deep knowledge of AWS services (networking, IAM, EKS, ALBs/NLBs, Route 53, CloudWatch). Proven experience with Kubernetes in production (EKS preferred), including service exposure, networking, and availability engineering. Hands-on familiarity with modern SRE tools and technologies, including Infrastructure as Code (Terraform, Ansible), container orchestration (Kubernetes), observability platforms (Prometheus, Grafana, Datadog, Splunk), and incident tooling (PagerDuty, FireHydrant), deployment-safety tooling (Argo Rollouts, LaunchDarkly), and observability standards (OpenTelemetry). Preferred Qualifications (Nice-to-Haves)

Executive-caliber incident communication/storytelling skills (clear status, stakeholder alignment, and post-incident narratives). Demonstrated success in hiring, developing, and mentoring high-performing engineers, including managing senior and principal-level talent. Experience managing globally distributed teams and developing equitable and sustainable on-call rotation practices. Experience in financial planning, budget management, and vendor contract negotiation for technical infrastructure and tooling. The AI Mandate: Building the Future of Observability with AI

You will not just manage a team that uses AI; you will lead the charge in building an AI-native SRE function.

This is a strategic mandate that requires a forward-thinking leader who understands both the potential and the pitfalls of integrating intelligent systems into critical operations. This includes: AIOps Strategy Development:

Developing and executing the strategy for integrating AIOps and machine learning into our observability stack. Move the team from a reactive monitoring posture to predictive maintenance and automated anomaly detection, fundamentally changing how we ensure reliability. Accelerating Automation with AI:

Championing the effective and responsible use of AI-assisted coding tools within the SRE team. Set standards and practices to leverage these tools to accelerate automation, tooling, and infrastructure code. Building the Business Case:

Building the techno-economic case for new AI tooling, managing vendor relationships, and ensuring cost-effective and secure implementation. Articulate ROI in terms of reduced downtime, improved efficiency, and faster incident resolution. Fostering Critical AI Literacy:

Fostering a culture that can evaluate, debug, and learn from AI outputs, extending blameless post-mortems to AI-driven actions and recommendations. #LI-Hybrid EEO Statement:

We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, gender identity, disability, protected veteran status, or any other characteristic protected by law. We will consider qualified applicants with criminal histories consistent with applicable law. Interested in building your career at Tubi? Get future opportunities sent straight to your email.

#J-18808-Ljbffr
Create a job alert for this search

Senior Manager, Site Reliability Engineering • Toronto, Canada

Similar jobs

Site Reliability Engineer

CapgeminiToronto, ON, CA
Full-time

Talent Acquisition Business Partner – Strategic Business Unit at Capgemini America Inc.Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d ... Show more

 • Promoted

Senior Manager, Site Reliability Engineering

Tubi, Inc.Toronto, ON, CA
Full-time

Senior Manager, Site Reliability Engineering.Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users.Tubi offers the world's largest co... Show more

 • Promoted

Senior Engineering Manager, Site Reliability

RelayToronto, Ontario, Canada
Full-time

Relay is a digital banking platform that gives self-made business owners the tools and know-how to be great with money—bringing clarity, confidence, and control to every dollar earned, so they can ... Show more

 • Promoted

RBC Senior Site Reliability Engineer Role

RBC Dominion SecuritiesToronto, ON, CA
Full-time

Join RBC in Toronto as a Senior Site Reliability Engineer, focusing on enhancing system reliability and performance.This role emphasizes operational excellence in a dynamic financial environment.As... Show more

 • Promoted

Lead Reliability Enhancements as a Site Reliability Engineer

ScotiabankToronto, ON, CA
Full-time

Become the backbone of digital services as a Site Reliability Engineer.Elevate application reliability and spearhead operational improvements while enhancing customer engagement.This role is pivota... Show more

 • Promoted

Senior Site Reliability Engineer

ThinkificToronto, ON, CA
Full-time

Senior Site Reliability Engineer.Senior Site Reliability Engineer.Are you an experienced Site Reliability Engineer looking for a new challenge?.Senior Site Reliability Engineer.Senior Site Reliabil... Show more

 • Promoted

Lead Site Reliability Engineering Manager

UpshopToronto, ON, CA
Full-time

Drive reliability and performance as a Site Reliability Engineering Manager with our leading team.This is a managerial role focused on cloud infrastructure, automation, and team development.We are ... Show more

 • Promoted

Senior Site Reliability Engineering

RBCToronto, Ontario, Canada
Full-time

Lead design, development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications within the Commercial Payments Technology (CPT) SRE organization.Requires adva... Show more

 • Promoted

Remote Director, Site Reliability Engineering

AffirmToronto, ON, CA
Remote
Full-time

A leading financial technology company is hiring a Senior Reliability Engineer to drive the vision for Reliability Engineering while fostering a diverse team.Responsibilities include ensuring high ... Show more

 • Promoted

Senior Site Reliability Engineer I

InstacartToronto, ON, CA
Permanent

Join our team as a Senior Site Reliability Engineer II, where your expertise will play a crucial role in maintaining the backbone of our platform's operations.You'll take on challenges directly, en... Show more

 • Promoted

Site Reliability Engineer

TELUS DigitalToronto, ON, CA
Full-time

Welcome to TELUS Digital — where innovation drives impact at a global scale.As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunicat... Show more

 • Promoted

Senior Site Reliability Engineer for Storage

MongoDBToronto, ON, CA
Full-time

Shape the future of cloud storage as a Senior Site Reliability Engineer.Ensure the reliability and efficiency of distributed storage services that underpin our cloud architecture.In this pivotal ro... Show more

 • Promoted

Senior Engineer for Site Reliability Management

SimCorpToronto, ON, CA
Full-time

Elevate service reliability as a Senior Engineer specializing in Site Reliability Management.Focus on serving clients effectively, ensuring operational excellence and timely incident resolution.Thi... Show more

 • Promoted

GCP Site Reliability Engineering Manager

The Home Depot CanadaToronto, Ontario, Canada
Full-time

Direct a team of Site Reliability Engineers specializing in GCP environments, enhancing eCommerce system reliability through your leadership as a Manager.Expertise in performance engineering is cru... Show more

 • Promoted

Sr. Site Reliability Engineer I

Axon EnterpriseToronto, ON, CA
Full-time

At Axon, we’re on a mission to Protect Life.We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud software.Like our products, we work b... Show more

 • Promoted

Senior Site Reliability Engineer

RootlyToronto, ON, CA
Full-time

At Rootly, we are on a mission to be the go‑to way companies respond when things go wrong, helping every organization be more reliable.We do this by building an industry‑leading incident management... Show more

 • Promoted

Senior Manager, Site Reliability Engineering

Tubi TvToronto, ON, CA
Full-time

Senior Manager, Site Reliability Engineering.About Tubi: Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users.Tubi offers the world'... Show more

 • Promoted

Senior Site Reliability Engineer

CaptivateIQToronto, ON, CA
Full-time

The Site Reliability Engineering team in CaptivateIQ operates across the engineering organization, supporting our development teams by providing them with the tools and processes they need to get t... Show more

 • Promoted

Site Reliability Engineer

HCLTechtoronto, on, ca
Full-time

Hands-on experience with at least one major public cloud platform (Azure, AWS, or GCP).Strong understanding of cloud infrastructure and application runtime components, including compute, storage, n... Show more

 • Promoted

Senior Site Reliability Engineer Focused on Kubernetes Infrastructure

Chainlink LabsToronto, ON, CA
Full-time

Elevate decentralized architecture as a Senior Site Reliability Engineer.Spearhead Kubernetes-based infrastructure for decentralized applications, driving scalability, security, and operational eff... Show more