Talent.com
Parallel Domain
Senior Site Reliability EngineerParallel Domain • Winnipeg, Canada
No longer accepting applications
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Parallel Domain • Winnipeg, Canada
3 days ago
Salary
CA$145,000.00 yearly
Job type
  • Full-time
Job description
About the Role Before an autonomous vehicle navigates a busy intersection, before a robot learns to pick and place in a warehouse, before any Physical AI system is trusted in the real world, it has to prove itself in ours. Parallel Domain builds the platform that validates the next generation of autonomous systems in high‑fidelity virtual environments, and the infrastructure underneath that platform is what makes simulation at scale possible.

We're hiring a Senior Site Reliability Engineer to help build and operate that infrastructure. This role sits at the core of how we run large‑scale, distributed simulation workloads for autonomous‑systems testing and validation. You'll work across multi‑region AWS infrastructure, operate Kubernetes at scale, and contribute directly to reliability, security, and deployment systems that the rest of the engineering org depends on.

This is a hands‑on role with the broad ownership typical of a startup. You'll partner closely with platform, simulation, and ML teams to keep the system running smoothly and evolving. We're growing the team—two of these roles are open—and the work is substantive: multi‑region GPU scheduling, Windows workloads on Kubernetes, large‑scale batch simulation, and an enterprise product direction that will require rethinking parts of how we deploy and operate.

Responsibilities

Infrastructure ownership and cloud operations.

Design, build, and maintain multi‑region AWS infrastructure using Terraform. Operate and scale EKS clusters across production regions: autoscaling, node lifecycle, and workload health. Manage networking across environments: VPC design, DNS, load balancing, and cross‑region connectivity. Support infrastructure changes, migrations, and expansions into new regions. Contribute to and improve GitOps‑based deployment workflows using GitHub Actions, Helm, and Kustomize.

Reliability engineering and incident response.

Build and run incident management processes: severity definitions, escalation paths, and on‑call practices. Lead incident response, debugging, and root‑cause analysis. Write postmortems and drive systemic reliability improvements. Improve observability across metrics, logging, tracing, and dashboards. Support GPU and batch workloads running on Kubernetes.

Security and access management.

Provide security‑conscious feedback on platform architecture decisions. Own cloud IAM governance: roles, policies, and access boundaries across accounts and services. Lead compliance‑adjacent work including audit‑readiness, partner certification requirements, and responses to customer security questionnaires.

Platform tooling and developer experience.

Improve CI/CD pipelines and infrastructure validation. Support engineers with infrastructure debugging, environment setup, and performance issues. Contribute to tooling and automation in Python and Bash. Take on adjacent responsibilities as needed in a startup environment.

Required Qualifications

Experience.

5+ years in SRE, DevOps, or infrastructure engineering roles, with a track record of operating production systems across multiple regions.

Terraform.

Modules, state management, and multi‑environment patterns.

AWS depth.

Solid experience across VPC, IAM, EKS, S3, and CloudWatch.

Kubernetes expertise.

Cluster operations, autoscaling, RBAC, and Helm.

CI/CD and GitOps.

Experience with GitHub Actions, ArgoCD, or similar workflows.

Networking fundamentals.

CIDR, DNS, load balancing, VPN, and cross‑region connectivity.

Observability.

Experience with tooling such as Prometheus and Grafana.

Scripting.

Comfort with Python and Bash for tooling and automation.

Cross‑platform familiarity.

Working knowledge of both Linux and Windows environments. Operational experience supporting Windows‑based workloads is a meaningful advantage.

Pragmatism and ownership.

Comfortable in a fast‑moving startup with evolving priorities. You take ownership of systems while collaborating closely with other teams, and you're pragmatic about tradeoffs between speed, reliability, and complexity.

Preferred Qualifications

Windows on Kubernetes.

Experience with Windows node pools, Windows AMIs, and GPU‑adjacent components on K8s.

GPU scheduling.

Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration.

Domain workloads.

Experience supporting simulation, ML, or rendering workloads in cloud infrastructure.

AWS extras.

Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family.

Service mesh.

Familiarity with service proxy or service mesh patterns.

Container OS.

Experience with container‑optimized OS images (e.g., Bottlerocket, Packer).

Cost optimization.

Cloud cost optimization at scale.

Core Tools

Terraform

AWS

Kubernetes

Helm

Kustomize

ArgoCD

GitHub Actions

Prometheus

Grafana

Docker

Python

Bash

What Makes a Great Candidate You think in failure modes and proactively surface issues. You hold a principled view on security and push back constructively when designs introduce unnecessary risk. You communicate clearly across engineering, product, and customer‑facing teams, flagging issues with urgency proportional to customer impact. You take end‑to‑end ownership of complex efforts and know when to push for the clean solution versus the pragmatic one.

Base salary range of CAD $145,000–$185,000, depending on skills, qualifications, and experience, plus equity, full health/dental/vision coverage, learning stipend, and generous vacation. This role is remote‑friendly across Canada and the US Pacific Northwest.

#J-18808-Ljbffr
Create a job alert for this search

Senior Site Reliability Engineer • Winnipeg, Canada

Similar jobs

Senior Site Reliability Engineer

Tech InsightsWinnipeg, Canada
Full-time

Senior Site Reliability Engineer – AI Operations TechInsights is building the reliability and AI operations foundation for its next chapter — an AI-first intelligence platform that runs the most de... Show more

 • Promoted

Site Reliability Engineer

HCLTechWinnipeg, Canada
Full-time

Join our SRE L2 squad supporting ~1000 AWS-hosted services.You’ll own operational reliability, rapid triage, and proactive maintenance across production and non-prod, partnering closely with Cloud ... Show more

 • Promoted

Lead Site Reliability Engineer Innovating Ai Tools And Standards

Coalition IncWinnipeg, Canada
Full-time

Shape the future of AI in site reliability engineering as a Staff SRE.Drive impactful standards, tooling, and integrations while ensuring reliable development practices in a remote-first culture.As... Show more

 • Promoted

Site Reliability Engineer

TELUS DigitalWinnipeg, Canada
Full-time

Welcome to TELUS Digital — where innovation drives impact at a global scale.As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunicat... Show more

 • Promoted

Site Reliability Engineer

TextNowWinnipeg, Canada
Full-time

This range is provided by TextNow.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We believe communication belongs to everyone.We exist to democ... Show more

 • Promoted

Reliability Engineer

J.D. Irving, LimitedWinnipeg, Canada
Permanent

Job Description As a member of the Irving Forest Services team, starting at Irving Paper, this position will support reliability programs across the Pulp and Paper Division and have growth potentia... Show more

 • Promoted

Senior Contaminated Sites Engineer – Northern Remediation Lead

WSP in CanadaWinnipeg, MB, CA
Full-time

A leading engineering and environmental consulting firm in Winnipeg is seeking an experienced professional to lead environmental assessments and develop remediation strategies for contaminated site... Show more

 • Promoted

Senior Site Reliability Engineer

ArcadiaWinnipeg, Canada
Full-time

Role: Senior Site Reliability Engineer (x4 openings)Type: Full-time | PermanentComp: $150-210k base + bonus + equity + benefitsA high-growth fintech company is hiring a Senior Site Reliability Engi... Show more

 • Promoted

Senior Site Reliability Engineer in Crypto

P2PWinnipeg, Manitoba, Canada
Full-time

Join Kraken as a Senior Site Reliability Engineer, contributing to innovative crypto solutions from anywhere in the world.This remote role emphasizes managing infrastructure and enhancing CI/CD pro... Show more

 • Promoted

Reliability Leader

Maple Leaf Foods Inc.Winnipeg, Canada
Full-time

The Opportunity: Reporting to the Site Leader, this role will oversee the implementation all maintenance, automation and engineering long-term strategies and programs aligned with corporate asset r... Show more

 • Promoted

Senior Site Reliability Engineer

Crypto Pro NetworkWinnipeg, Canada
Full-time

We’re here to onboard the world to the decentralized economy.Why? Because crypto and blockchain aren’t just technologies—they’re tools for global financial empowerment.They give people control over... Show more

 • Promoted

Intermediate Site Reliability Engineer, Environment Automation

GitLabWinnipeg, Canada
Full-time

Overview GitLab is the intelligent orchestration platform for DevSecOps.GitLab enables organizations to increase developer productivity, improve operational efficiency, reduce security and complian... Show more

 • Promoted

Site Reliability Engineer - Ops & Automation

CerebrasWinnipeg, Canada
Full-time

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs.Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the prog... Show more

 • Promoted

Gcp Site Reliability Engineering Manager

RelayWinnipeg, Canada
Full-time

Manage a team of Site Reliability Engineers focused on GCP environments for eCommerce systems, emphasizing performance engineering and incident management. Show more

 • Promoted • New!

Manager, Site Reliability Engineering

TempoWinnipeg, Canada
Full-time

With over 30,000 customers, including a third of Fortune 500 companies, Tempo is trusted by organizations across the globe to make their workflows work better.We create a suite of integrated soluti... Show more

 • Promoted

Operational Reliability Engineer Expert

CU Direct CorporationWinnipeg, Canada
Full-time

Drive system reliability as an Operational Reliability Engineer Expert.Own incident response processes, performance monitoring, and improve system resilience in a fast-paced environment.This positi... Show more

 • Promoted

Senior Site Reliability Engineer

Parallel DomainWinnipeg, Canada
Full-time

About the Role Before an autonomous vehicle navigates a busy intersection, before a robot learns to pick and place in a warehouse, before any Physical AI system is trusted in the real world, it has... Show more

 • Promoted

Expert Senior Site Reliability Engineer Role

BitcompleteWinnipeg, Canada
Full-time

Drive the design of cloud-native solutions as an Expert Senior Site Reliability Engineer.Employ your skills in distributed systems, Kubernetes, and systems optimization to ensure performance and se... Show more

 • Promoted

Senior Site Reliability Engineer

ShippoWinnipeg, Canada
Full-time

At Shippo, our vision is bold and clear:we are the shipping layer of the internet.Our mission isto make every merchant successfulthrough excellent shipping,delivering world-class logistics technolo... Show more

 • Promoted

Senior Site Reliability Engineer — Kubernetes, Aws & Observability

ThinkificWinnipeg, Canada
Full-time

A leading e-learning provider in Canada is seeking a Senior Site Reliability Engineer to enhance and secure their infrastructure supporting online course creators.This role involves improving perfo... Show more