Talent.com
SRE for Gen AI App Infrastructure and Operations
SRE for Gen AI App Infrastructure and OperationsAstra North Infoteck Inc. • Laval, Qc
SRE for Gen AI App Infrastructure and Operations

SRE for Gen AI App Infrastructure and Operations

Astra North Infoteck Inc. • Laval, Qc
24 days ago
Job type
  • Full-time
Job description

"AI Infra Ops and SRE engineer

Need to come to office 3 days a week

Skills :

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming / scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP / IP, DNS, routing, load bal-ancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
  • Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation

Experience : 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.

Roles and Responsibilities :

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs / SLIs / SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recovery (DR) strategies, backup / restore practices, fault toler-ance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24 / 7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability"
  • Create a job alert for this search

    Ai Infrastructure • Laval, Qc

    Similar jobs
    AI / HPC Sr. Field Solutions Architect / Architecte de solutions senior IA / HPC

    AI / HPC Sr. Field Solutions Architect / Architecte de solutions senior IA / HPC

    CDW Canada • Montreal (administrative region), QC, Canada
    Full-time
    Field Solutions Architect / Architecte de solutions senior IA / HPC.Chez CDW, nous accomplissons les projets ensemble.La confiance, les relations humaines et l’engagement sont au cœur de la collabora...Show more
    Last updated: 4 days ago • Promoted
    Senior Generative AI Engineer

    Senior Generative AI Engineer

    Alexa Translations • Montreal, QC, Canada
    Full-time
    Alexa Translations provides translation services in the legal, financial, and securities sectors by leveraging proprietary A. Unmatched in speed and quality, our machine translation engine is best-i...Show more
    Last updated: 30+ days ago • Promoted
    Trigonometry Private Tutoring Jobs L'epiphanie

    Trigonometry Private Tutoring Jobs L'epiphanie

    Superprof • L'epiphanie, Canada
    Full-time +1
    Superprof is Canada's #1 tutoring platform, and we're actively recruiting passionate tutors! Whether you're a student, a professional, or simply someone who loves teaching, join the largest communi...Show more
    Last updated: 30+ days ago • Promoted
    Senior DevOps Engineer

    Senior DevOps Engineer

    Medeloop • Montreal, QC, Canada
    Full-time
    Our unified platform, spanning AI-powered analytics, study management, and grant automation, streamlines the entire research lifecycle, enabling faster, smarter, and more impactful discoveries acro...Show more
    Last updated: 30+ days ago • Promoted
    SRE for Gen AI App Infrastructure and Operations

    SRE for Gen AI App Infrastructure and Operations

    Astra North Infoteck Inc. • Montreal, QC, ca
    Full-time
    Quick Apply
    Need to come to office 3 days a week.Production experience in SRE / Infrastructure / ops for large-scale systems.Strong programming / scripting skills (Python, Go, Java, or equivalent).Deep experienc...Show more
    Last updated: 24 days ago
    DevOps / SRE Engineer (Remote)

    DevOps / SRE Engineer (Remote)

    Rivalry • Montreal, QC, Canada
    Remote
    Full-time
    Rivalry is a startup uniquely positioned to disrupt the dated online gambling space.The founders and staff come from the gaming and esports scene and are now working their way into the betting worl...Show more
    Last updated: 30+ days ago • Promoted
    Gen AI Lead – AI / Data (1782)

    Gen AI Lead – AI / Data (1782)

    freelance.ca • Montreal, Canada
    Temporary
    Hybrid work model, 2 days / week in Montreal Office.Month Contract, 8 hours / day, 40 hours / week.AI / ML, Generative AI, Copilot, ChatGPT, Python, API integration, data pipelines, cloud platforms, Azure,...Show more
    Last updated: 30+ days ago • Promoted
    Senior Developer / DevOps (AWS)

    Senior Developer / DevOps (AWS)

    Targeted Talent • Montreal, QC, Canada
    Full-time
    This role is with a company that is a leader in the video streaming industry.This role is great for someone located in Canada looking for a remote role. You will be working in PST working hours.Desi...Show more
    Last updated: 30+ days ago • Promoted
    Remote SRE & GitOps Engineer : Automate, Scale Cloud Infra

    Remote SRE & GitOps Engineer : Automate, Scale Cloud Infra

    Canonical • Ahuntsic North, ca
    Remote
    Full-time
    A leading open-source software firm is seeking a Site Reliability / GitOps Engineer to enhance operations automation for both private and public clouds. You will focus on infrastructure-as-code, ens...Show more
    Last updated: 10 hours ago • Promoted • New!
    Staff SRE / DevOps Engineer — Cloud Modernization Lead

    Staff SRE / DevOps Engineer — Cloud Modernization Lead

    GeoComply • Montreal
    Full-time
    A leading tech firm in geolocation is seeking a Staff SRE / DevOps Engineer to drive a strategic cloud rearchitecture initiative. You will be pivotal in modernizing systems for availability and perfor...Show more
    Last updated: 26 days ago • Promoted
    Senior Cloud / DevOps / SRE / Systems Engineer

    Senior Cloud / DevOps / SRE / Systems Engineer

    mrge - commerce advertising • Montreal
    Full-time
    Looking for an exciting challenge?.As a company, we offer a Commerce Advertising Suite that drives growth for both publishers and advertisers through best-in-class solutions in commerce content, pe...Show more
    Last updated: 5 days ago • Promoted
    DevOps SRE

    DevOps SRE

    DiliTrust • Montreal
    Full-time
    Ready to be part of the Legal Tech revolution?.As a leading software-as-a-service (SaaS) provider, Dilitrust is a global company dedicated to offering an integrated suite of legal and governance pr...Show more
    Last updated: 26 days ago • Promoted
    Senior Solutions Architect Gen AI

    Senior Solutions Architect Gen AI

    National Bank of Canada • Montreal
    Full-time
    Senior Solutions Architect Gen AI within the Data Ecosystem, Artificial Intelligence, Client Domain, Marketing, Pricing & Billing, and ECM Solutions Architecture team at National Bank.The role focu...Show more
    Last updated: 26 days ago • Promoted
    Senior Cloud DevOps Engineer - AWS, IaC, SRE Lead

    Senior Cloud DevOps Engineer - AWS, IaC, SRE Lead

    Mistplay • Montreal
    Full-time
    Une entreprise de technologie recherche un ingénieur DevOps expérimenté à Montréal pour diriger la conception et l'optimisation de l'infrastructure infonuagique. Vous travaillerez avec des équipes m...Show more
    Last updated: 6 days ago • Promoted
    Senior Solutions Architect Gen AI

    Senior Solutions Architect Gen AI

    Banque Nationale du Canada • Montreal
    Full-time
    A career as a Senior Solutions Architect Gen AI within the Data Ecosystem, Artificial Intelligence, Client Domain, Marketing, Pricing & Billing, and ECM Solutions Architecture team at National Bank...Show more
    Last updated: 26 days ago • Promoted
    Lead Site Reliability Engineering (SRE)

    Lead Site Reliability Engineering (SRE)

    freelance.ca • Montreal, Canada
    Full-time
    Lead Site Reliability Engineering (SRE).Vous serez responsable de bâtir et de maintenir des pipelines CI / CD partagés, d’implanter des pratiques exemplaires en matière de résilience et de stabilité,...Show more
    Last updated: 30+ days ago • Promoted
    AI / ML Infrastructure Engineer — SRE for Scalable AI Clusters

    AI / ML Infrastructure Engineer — SRE for Scalable AI Clusters

    BULL-IT SOLUTIONS LTD • Montreal
    Full-time
    A leading IT solutions provider is seeking a Delivery Head in Montreal.The role focuses on managing large-scale systems while ensuring reliability and efficiency. Ideal candidates will possess stron...Show more
    Last updated: 5 days ago • Promoted
    Hybrid Cloud SRE – Azure / AWS, Automation & Observability

    Hybrid Cloud SRE – Azure / AWS, Automation & Observability

    PowerToFly • Montreal
    Full-time
    A leading global financial organization is seeking a Cloud SRE Specialist to join their Azure Platform Squad in Montreal. This role involves enhancing operational stability and automating cloud depl...Show more
    Last updated: 6 days ago • Promoted