SRE for Gen AI App Infrastructure and OperationsAstra North Infoteck Inc. • Laval, Qc

SRE for Gen AI App Infrastructure and Operations

Astra North Infoteck Inc. • Laval, Qc

24 days ago

Job type

Full-time

Job description

"AI Infra Ops and SRE engineer

Need to come to office 3 days a week

Skills :

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming / scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP / IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation

Experience : 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities :

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

Design and build automation for core platform capabilities, reducing manual toil

Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

Establish, monitor, and enforce SLOs / SLIs / SLAs, error budgets, alerting, and dashboards

Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

Optimize cost vs. performance tradeoffs in large-scale compute environments

Harden systems for security, compliance, auditability, and data governance

Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems

Define disaster recovery (DR) strategies, backup / restore practices, fault toler-ance mechanisms

Maintain runbooks, operational playbooks, documentation, and training materials

Participate in on-call rotations and respond to production incidents 24 / 7 as needed

Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability"

Create a job alert for this search

Ai Infrastructure • Laval, Qc

Similar jobs

AI / HPC Sr. Field Solutions Architect / Architecte de solutions senior IA / HPC

CDW Canada • Montreal (administrative region), QC, Canada

Full-time

Field Solutions Architect / Architecte de solutions senior IA / HPC.Chez CDW, nous accomplissons les projets ensemble.La confiance, les relations humaines et l’engagement sont au cœur de la collabora...Show more

Last updated: 4 days ago • Promoted

Senior Generative AI Engineer

Alexa Translations • Montreal, QC, Canada

Full-time

Alexa Translations provides translation services in the legal, financial, and securities sectors by leveraging proprietary A. Unmatched in speed and quality, our machine translation engine is best-i...Show more

Last updated: 30+ days ago • Promoted

Trigonometry Private Tutoring Jobs L'epiphanie

Superprof • L'epiphanie, Canada

Full-time +1

Superprof is Canada's #1 tutoring platform, and we're actively recruiting passionate tutors! Whether you're a student, a professional, or simply someone who loves teaching, join the largest communi...Show more

Last updated: 30+ days ago • Promoted

Senior DevOps Engineer

Medeloop • Montreal, QC, Canada

Full-time

Our unified platform, spanning AI-powered analytics, study management, and grant automation, streamlines the entire research lifecycle, enabling faster, smarter, and more impactful discoveries acro...Show more

Last updated: 30+ days ago • Promoted

SRE for Gen AI App Infrastructure and Operations

Astra North Infoteck Inc. • Montreal, QC, ca

Full-time

Quick Apply

Need to come to office 3 days a week.Production experience in SRE / Infrastructure / ops for large-scale systems.Strong programming / scripting skills (Python, Go, Java, or equivalent).Deep experienc...Show more

Last updated: 24 days ago

DevOps / SRE Engineer (Remote)

Rivalry • Montreal, QC, Canada

Remote

Full-time

Rivalry is a startup uniquely positioned to disrupt the dated online gambling space.The founders and staff come from the gaming and esports scene and are now working their way into the betting worl...Show more

Last updated: 30+ days ago • Promoted

Gen AI Lead – AI / Data (1782)

freelance.ca • Montreal, Canada

Temporary

Hybrid work model, 2 days / week in Montreal Office.Month Contract, 8 hours / day, 40 hours / week.AI / ML, Generative AI, Copilot, ChatGPT, Python, API integration, data pipelines, cloud platforms, Azure,...Show more

Last updated: 30+ days ago • Promoted

Senior Developer / DevOps (AWS)

Targeted Talent • Montreal, QC, Canada

Full-time

This role is with a company that is a leader in the video streaming industry.This role is great for someone located in Canada looking for a remote role. You will be working in PST working hours.Desi...Show more

Last updated: 30+ days ago • Promoted

Remote SRE & GitOps Engineer : Automate, Scale Cloud Infra

Canonical • Ahuntsic North, ca

Remote

Full-time

A leading open-source software firm is seeking a Site Reliability / GitOps Engineer to enhance operations automation for both private and public clouds. You will focus on infrastructure-as-code, ens...Show more

Last updated: 10 hours ago • Promoted • New!

Staff SRE / DevOps Engineer — Cloud Modernization Lead

GeoComply • Montreal

Full-time

A leading tech firm in geolocation is seeking a Staff SRE / DevOps Engineer to drive a strategic cloud rearchitecture initiative. You will be pivotal in modernizing systems for availability and perfor...Show more

Last updated: 26 days ago • Promoted

Senior Cloud / DevOps / SRE / Systems Engineer

mrge - commerce advertising • Montreal

Full-time

Looking for an exciting challenge?.As a company, we offer a Commerce Advertising Suite that drives growth for both publishers and advertisers through best-in-class solutions in commerce content, pe...Show more

Last updated: 5 days ago • Promoted

DevOps SRE

DiliTrust • Montreal

Full-time

Ready to be part of the Legal Tech revolution?.As a leading software-as-a-service (SaaS) provider, Dilitrust is a global company dedicated to offering an integrated suite of legal and governance pr...Show more

Last updated: 26 days ago • Promoted

Senior Solutions Architect Gen AI

National Bank of Canada • Montreal

Full-time

Senior Solutions Architect Gen AI within the Data Ecosystem, Artificial Intelligence, Client Domain, Marketing, Pricing & Billing, and ECM Solutions Architecture team at National Bank.The role focu...Show more

Last updated: 26 days ago • Promoted

Senior Cloud DevOps Engineer - AWS, IaC, SRE Lead

Mistplay • Montreal

Full-time

Une entreprise de technologie recherche un ingénieur DevOps expérimenté à Montréal pour diriger la conception et l'optimisation de l'infrastructure infonuagique. Vous travaillerez avec des équipes m...Show more

Last updated: 6 days ago • Promoted

Senior Solutions Architect Gen AI

Banque Nationale du Canada • Montreal

Full-time

A career as a Senior Solutions Architect Gen AI within the Data Ecosystem, Artificial Intelligence, Client Domain, Marketing, Pricing & Billing, and ECM Solutions Architecture team at National Bank...Show more

Last updated: 26 days ago • Promoted

Lead Site Reliability Engineering (SRE)

freelance.ca • Montreal, Canada

Full-time

Lead Site Reliability Engineering (SRE).Vous serez responsable de bâtir et de maintenir des pipelines CI / CD partagés, d’implanter des pratiques exemplaires en matière de résilience et de stabilité,...Show more

Last updated: 30+ days ago • Promoted

AI / ML Infrastructure Engineer — SRE for Scalable AI Clusters

BULL-IT SOLUTIONS LTD • Montreal

Full-time

A leading IT solutions provider is seeking a Delivery Head in Montreal.The role focuses on managing large-scale systems while ensuring reliability and efficiency. Ideal candidates will possess stron...Show more

Last updated: 5 days ago • Promoted

Hybrid Cloud SRE – Azure / AWS, Automation & Observability

PowerToFly • Montreal

Full-time

A leading global financial organization is seeking a Cloud SRE Specialist to join their Azure Platform Squad in Montreal. This role involves enhancing operational stability and automating cloud depl...Show more

Last updated: 6 days ago • Promoted