Talent.com
Site Reliability Engineer – GenAI Platform
Site Reliability Engineer – GenAI PlatformAstra North Infoteck Inc. • MONTREAL & MIRABEL, QC, ca
Site Reliability Engineer – GenAI Platform

Site Reliability Engineer – GenAI Platform

Astra North Infoteck Inc. • MONTREAL & MIRABEL, QC, ca
8 hours ago
Job type
  • Full-time
  • Quick Apply
Job description

Job Description

Experience : 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities :

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

Design and build automation for core platform capabilities, reducing manual toil

Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

Establish, monitor, and enforce SLOs / SLIs / SLAs, error budgets, alerting, and dashboards

Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

Optimize cost vs. performance tradeoffs in large-scale compute environments

Harden systems for security, compliance, auditability, and data governance

Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems

Define disaster recovery (DR) strategies, backup / restore practices, fault toler-ance mechanisms

Maintain runbooks, operational playbooks, documentation, and training materials

Participate in on-call rotations and respond to production incidents 24 / 7 as needed

Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills :

Production experience in SRE / Infrastructure / ops for large-scale systems

Strong programming / scripting skills (Python, Go, Java, or equivalent)

Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

Networking & systems engineering knowledge (TCP / IP, DNS, routing, load bal-ancing, distributed storage)

Solid experience in capacity planning, performance tuning, scaling, and incident response

Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments

Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus

Excellent communication, documentation, and cross-team collaboration skills

Proven track record of reducing operational toil via automation

Requirements

Android and iOS

Create a job alert for this search

Site Reliability Engineer GenAI Platform • MONTREAL & MIRABEL, QC, ca

Similar jobs
Ingénierie en chef, Responsable des essais en vol / Chief Engineering, Flight Test Lead

Ingénierie en chef, Responsable des essais en vol / Chief Engineering, Flight Test Lead

Airbus Canada Limited Partnership • Sainte-Anne-de-Bellevue, Quebec, Canada
Full-time
Job Description : • • • • •English job description follows • • • Vous avez une expérience dans l’intégration des systèmes avion, vous avez travaillé le domaine des essais en vol et vous souhaitez évolu...Show more
Last updated: 19 days ago • Promoted
Solutions Engineer

Solutions Engineer

Meld • mirabel, QC, ca
Full-time
About the Company Meld is a fast growing startup looking to add developer support for customers who use our API driven platform for managing their crypto related integra...Show more
Last updated: 6 days ago • Promoted
DevOps Engineer - saint-jérôme

DevOps Engineer - saint-jérôme

TELUS Digital • saint-jérôme, qc, ca
Full-time
Welcome to TELUS Digital — where innovation drives impact at a global scale.As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunicat...Show more
Last updated: 17 hours ago • Promoted • New!
Remote Rust Engineer - AI Trainer

Remote Rust Engineer - AI Trainer

SuperAnnotate • Prevost, Quebec, CA
Remote
Full-time
As an hourly paid, fully remote Rust Engineer for AI Data Training, you will review AI-generated Rust code and explanations or generate your own, evaluate the reasoning quality and step-by-step pro...Show more
Last updated: 30+ days ago
M365 / Gen AI Engineer

M365 / Gen AI Engineer

APEX-TEK PLACEMENT CONSULTANTS PRIVATE LIMITED • saint-jérôme, qc, ca
Full-time
Job description & Roles and responsibilities.The M365 / GenAI Engineer designs, builds, and supports secure integrations, connectors, and operational controls across Microsoft 365 and enterprise LLM ...Show more
Last updated: 17 hours ago • Promoted • New!
Senior Flight Systems Engineer

Senior Flight Systems Engineer

Cessna Aircraft Company • Mirabel
Full-time
Une entreprise aérospatiale recherche un Spécialiste en Systèmes pour gérer l'intégration des modifications et évaluer la sécurité des systèmes. Le candidat idéal a au moins 10 ans d’expérience dans...Show more
Last updated: 25 days ago • Promoted
Siteminder IAM Expert

Siteminder IAM Expert

Software International • Mirabel
Full-time
Software International (SI) supplies technical talent to a variety of Fortune 100 / 500 / 1000 and other companies in Canada / US. We are currently hiring for a Siteminder IAM Expert for our Fortune 500 c...Show more
Last updated: 26 days ago • Promoted
M365 / Gen AI Engineer - saint-jérôme

M365 / Gen AI Engineer - saint-jérôme

APEX-TEK PLACEMENT CONSULTANTS PRIVATE LIMITED • saint-jérôme, qc, ca
Full-time
Job description & Roles and responsibilities.The M365 / GenAI Engineer designs, builds, and supports secure integrations, connectors, and operational controls across Microsoft 365 and enterprise LLM ...Show more
Last updated: 17 hours ago • Promoted • New!
Snowflake Cortex expert - mirabel

Snowflake Cortex expert - mirabel

Amaris Consulting • mirabel, qc, ca
Full-time
Snowflake Cortex & Snowpark Specialist.AI-driven solutions within the Snowflake Data Cloud.You will work closely with Data Engineering, Architecture, and Business teams to build scalable pipelines,...Show more
Last updated: 17 hours ago • Promoted • New!
Remote Go Engineer - AI Trainer

Remote Go Engineer - AI Trainer

SuperAnnotate • Sainte-Adele, Quebec, CA
Remote
Full-time
As an hourly paid, fully remote Go Engineer for AI Data Training, you will review AI-generated Go code and explanations or generate your own, evaluate the reasoning quality and step-by-step problem...Show more
Last updated: 30+ days ago
Senior Full Stack Engineer

Senior Full Stack Engineer

Luxoft • mirabel, qc, ca
Full-time
Luxoft is looking for a Full-stack Developer who would be working with our Customer - one of the world's largest investment management companies. Based in Southern California, our client manages clo...Show more
Last updated: 17 hours ago • Promoted • New!
M365 / Gen AI Engineer - APEX-TEK PLACEMENT CONSULTANTS PRIVATE LIMITED

M365 / Gen AI Engineer - APEX-TEK PLACEMENT CONSULTANTS PRIVATE LIMITED

APEX-TEK PLACEMENT CONSULTANTS PRIVATE LIMITED • mirabel, qc, ca
Full-time
Job description & Roles and responsibilities.The M365 / GenAI Engineer designs, builds, and supports secure integrations, connectors, and operational controls across Microsoft 365 and enterprise LLM ...Show more
Last updated: 17 hours ago • Promoted • New!
Senior Full-Stack Engineer – Green Tech Impact & Growth

Senior Full-Stack Engineer – Green Tech Impact & Growth

EffectiV HVAC Inc. • Blainville
Full-time
A rapidly growing technology firm in Quebec seeks a Senior Full Stack Analyst / Programmer to develop and maintain software systems. You will be responsible for both legacy and modern applications usi...Show more
Last updated: 8 days ago • Promoted
Chef d’équipe outillage, spécialiste en conception / Tooling Group Lead, Design Specialist

Chef d’équipe outillage, spécialiste en conception / Tooling Group Lead, Design Specialist

Raytheon Technologies • Mirabel
Full-time
CA-QC-MIRABEL-M01 ~ 11155 Julien-Audette ~ M01 BLDG.Chef d’équipe outillage, spécialiste en conception.À propos de Pratt & Whitney Canada. Pratt & Whitney Canada (P&WC) est un leader mondial de l’in...Show more
Last updated: 4 days ago • Promoted
Repair Design Engineer : Shape Critical Maintenance

Repair Design Engineer : Shape Critical Maintenance

Expleo Group • Mirabel
Full-time
Une entreprise d'ingénierie à Mirabel, QC, recherche un Ingénieur en conception de réparations pour fournir des solutions techniques, gérer les questions des clients et améliorer les processus de r...Show more
Last updated: 26 days ago • Promoted
Propulsion Systems Specialist — Hybrid Work & Growth

Propulsion Systems Specialist — Hybrid Work & Growth

Airbus • Mirabel
Full-time
Une entreprise d'aéronautique basée à Mirabel (Québec) recherche un Spécialiste des systèmes de propulsion pour rejoindre son équipe d'Ingénierie. Vous travaillerez sur le développement de solutions...Show more
Last updated: 5 days ago • Promoted
Sr Systems Engineer - Spacecraft Flight Dynamics

Sr Systems Engineer - Spacecraft Flight Dynamics

MDA • Sainte-Anne-de-Bellevue
Full-time
For those who dream of advancing our space in the Universe and on Earth, we will take you there.MDA is an international space mission partner and pioneer in robotics & space operations, satellite s...Show more
Last updated: 26 days ago • Promoted
Senior Full Stack Engineer - saint-jérôme

Senior Full Stack Engineer - saint-jérôme

Luxoft • saint-jérôme, qc, ca
Full-time
Luxoft is looking for a Full-stack Developer who would be working with our Customer - one of the world's largest investment management companies. Based in Southern California, our client manages clo...Show more
Last updated: 17 hours ago • Promoted • New!