Talent.com
AI Site Reliability Engineer
AI Site Reliability EngineerAstra North Infoteck Inc. • Laval, Qc
No longer accepting applications
AI Site Reliability Engineer

AI Site Reliability Engineer

Astra North Infoteck Inc. • Laval, Qc
30+ days ago
Job type
  • Full-time
Job description
Role: SRE +AI

Hyrbid: 3 days in office-

Location: Montreal

Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
• Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
• Design and build automation for core platform capabilities, reducing manual toil
• Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
• Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
• Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
• Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
• Optimize cost vs. performance tradeoffs in large-scale compute environments
• Harden systems for security, compliance, auditability, and data governance
• Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
• Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
• Maintain runbooks, operational playbooks, documentation, and training materials
• Participate in on-call rotations and respond to production incidents 24/7 as needed
• Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Skills:
• Production experience in SRE / Infrastructure / ops for large-scale systems
• Strong programming/scripting skills (Python, Go, Java, or equivalent)
• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
• Solid experience in capacity planning, performance tuning, scaling, and incident response
• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
• Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
• Excellent communication, documentation, and cross-team collaboration skills
• Proven track record of reducing operational toil via automation


Create a job alert for this search

AI Site Reliability Engineer • Laval, Qc

Similar jobs

Senior Site Reliability Engineer II - Remote, Scale-Focused

InstacartMontreal, Montreal (administrative region), CA
Remote
Full-time

A leading grocery delivery service is seeking a Senior Site Reliability Engineer II in Calgary, Alberta.You will ensure optimal performance and reliability of the platform while establishing incide...Show more

 • Promoted

Site Reliability Engineer

TykMontreal (administrative region), QC, CA
Full-time

The Tyk API Management platform is helping to drive the connected world and power new products and services.We're changing the way that organisations connect any number of their systems and service...Show more

 • Promoted

Site Reliability Engineer

QlikRivière-Des-Prairies-Pointe-Aux-Trembles, Canada
Full-time

What makes us Qlik? A Gartner® Magic Quadrant™ Leader for 15 years in a row, Qlik transforms complex data landscapes into actionable insights, driving strategic business outcomes.Serving over 40,00...Show more

 • Promoted

Staff Site Reliability Engineer, AI Enablement Leader

Coalition, Inc.Montreal (administrative region), QC, CA
Full-time

A technology company in Canada is seeking a Staff Site Reliability Engineer to lead AI enablement across its engineering organization.The ideal candidate will have extensive experience in SRE, DevO...Show more

 • Promoted

Site Reliability Engineer with Automation Focus

YelpMontreal (administrative region), QC, CA
Full-time

Join a collaborative, remote SRE team dedicated to ensuring service reliability.In this role, leverage your expertise in automation and systems management to support a platform serving millions.You...Show more

 • Promoted

Lead Site Reliability Engineer Innovating AI Tools and Standards

Coalition IncMontreal (administrative region), QC, CA
Full-time

Shape the future of AI in site reliability engineering as a Staff SRE.Drive impactful standards, tooling, and integrations while ensuring reliable development practices in a remote-first culture.As...Show more

 • Promoted

Site Reliability Engineer

TMC CanadaMontreal (administrative region), QC, CA
Permanent

Systems Reliability Engineering (SRE) is a production-oriented discipline focused on improving system service availability, observability, scalability, performance, and reliability for technology p...Show more

 • Promoted

Site Reliability Engineer

Tecsys Inc.Montreal, Montreal (administrative region), CA
Permanent

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company.The...Show more

 • Promoted

Remote Senior Site Reliability Engineer for Cutting-Edge Solutions

GoDaddyMontreal (administrative region), QC, CA
Remote
Full-time

Take charge of our eCommerce platform's reliability as a Senior Site Reliability Engineer.In this remote role, you’ll focus on CI/CD processes and system scalability.We are looking for an innovativ...Show more

 • Promoted

Specialist Site Reliability Engineer

Global Talent Alliance, CanadaMontreal (administrative region), QC, CA
Full-time

About the job Specialist Site Reliability Engineer.The role of the Specialist Site Reliability Engineer (SRE) is to execute RAM analysis and engineering in support of the I&T solutions.The overall ...Show more

 • Promoted

Dynamic Site Reliability Engineer Enhancing System Performance

mthreeMontreal
Full-time

Are you ready to enhance system performance as a Site Reliability Engineer? Join a team that values collaboration, tackling complex challenges in a fast-paced technological setting.This engineering...Show more

 • Promoted

Senior Site Reliability Engineer

ThinkificMontreal (administrative region), QC, CA
Full-time

Senior Site Reliability Engineer.Senior Site Reliability Engineer.Are you an experienced Site Reliability Engineer looking for a new challenge?.Senior Site Reliability Engineer.Senior Site Reliabil...Show more

 • Promoted

Senior Site Reliability Engineer- Remote

ClickHouseMontreal (administrative region), QC, CA
Remote
Full-time

Senior Site Reliability Engineer- Remote.Recognized on the 2025 Forbes Cloud 100 list, ClickHouse is one of the most innovative and fast-growing private cloud companies.With more than 3,000 custome...Show more

 • Promoted

Hybrid Site Reliability Engineer Role

SAP SEMontreal (administrative region), QC, CA
Full-time

Join a Site Reliability Engineering team focused on cloud service reliability.Use your skills in incident management and container technologies to enhance operational efficiency in a hybrid work se...Show more

 • Promoted

Cloud-Focused Site Reliability Engineer Driving Automation and Reliability

Dayforce US, Inc.Montreal, Montreal (administrative region), CA
Full-time

Play a vital role as a Site Reliability Engineer, enhancing cloud systems' automation and reliability.Collaborate with teams and build strong relationships while working remotely in a dynamic envir...Show more

 • Promoted

Site Reliability Engineer for AI Systems

Apptoza Inc.Montreal (administrative region), QC, CA
Full-time

Drive the reliability of AI applications as a Site Reliability Engineer.Utilize your extensive knowledge of IaaS platforms, automation, and systems engineering to deliver performance and stability....Show more

 • Promoted

Remote Site Reliability Engineer - Scale Crypto Systems

NewtonMontreal (administrative region), QC, CA
Remote
Full-time

A leading innovative tech company in Toronto is looking for a Site Reliability Engineer.In this pivotal role, you will enhance the reliability and resilience of critical services, manage incidents,...Show more

 • Promoted

Sr. Site Reliability Engineer I

Axon EnterpriseMontreal (administrative region), QC, CA
Full-time

At Axon, we’re on a mission to Protect Life.We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud software.Like our products, we work b...Show more