Talent.com
Software Engineer - Reliability
Software Engineer - Reliabilitylumalabs.ai • London, ON, CA
Software Engineer - Reliability

Software Engineer - Reliability

lumalabs.ai • London, ON, CA
30+ days ago
Job type
  • Full-time
Job description

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

We are looking for a hands‑on, first‑principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.

You will build, maintain, and scale Luma’s infrastructure across on‑prem and multi‑vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You’ll Do

  • Architect for Reliability & Scale : Participate in critical re‑architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next‑generation infrastructure operates.
  • Own Multi‑Cloud GPU Clusters : Take end‑to‑end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance : Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast‑moving AI startup environment.
  • Deep Linux Performance Tuning : Use your mastery of Linux systems to troubleshoot and optimise performance at the OS and kernel level.
  • Build Robust Automation : Write high‑quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware / Software Failures : Serve as the final escalation point for the most challenging GPU, networking (InfiniBand / RDMA), and system‑level issues, often collaborating directly with hardware vendors like NVIDIA.

Who You Are

  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast‑paced, large‑scale environment.
  • Deep Linux Mastery : You possess deep, hands‑on expertise in Linux, containerised systems, and debugging low‑level system performance.
  • Cloud Infrastructure Expert : You have strong experience with providers like AWS or OCI.
  • Tenacious Troubleshooter : You thrive on solving complex, low‑level problems where hardware and software intersect.
  • Startup DNA : You are energetic and thrive in a less structured, fast‑paced environment.
  • Security‑Minded : You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High‑Performance Networking : You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimise throughput for massive distributed training jobs.
  • What Sets You Apart (Bonus Points)

  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large‑scale GPU clusters for AI / ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
  • Compensation

    The base pay range for this role is $170,000 – $360,000 per year.

    #J-18808-Ljbffr

    Create a job alert for this search

    Reliability Engineer • London, ON, CA

    Similar jobs
    Software Developer (Laravel / Vue.js)

    Software Developer (Laravel / Vue.js)

    EventConnect • London, ON, Canada
    Full-time
    Location : London, Ontario (Hybrid).EventConnect is a sports-tourism technology company that connects event organizers, teams, hotels, and destinations on one platform. We streamline registration, sc...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer

    Machine Learning Engineer

    Hifyre • London, Ontario, Canada
    Full-time
    Hifyre provides market intelligence for the cannabis industry, analyzing retail data to help.Our models power product recommendations, sales forecasting, and market analysis for both internal opera...Show more
    Last updated: 4 hours ago • Promoted • New!
    Azure DevOps Engineer

    Azure DevOps Engineer

    LTIMindtree • London, Ontario, Canada
    Full-time
    LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace.Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnan...Show more
    Last updated: 4 hours ago • Promoted • New!
    Guidewire Engineer

    Guidewire Engineer

    BuzzClan • London, Ontario, Canada
    Full-time
    Guidewire PolicyCenter Configuration – Senior.Year Contract (Extension Possible).Guidewire PolicyCenter Configuration Developer. GOSU, Java, and web development.The ideal candidate will be hands-on ...Show more
    Last updated: 6 hours ago • Promoted • New!
    Software Developer

    Software Developer

    ITPS (Canada) LTD • London, ON, Canada
    Full-time
    A truly unique opportunity awaits one that is unmatched anywhere else.Whether you are early in your career and ready to take the right first step, or an experienced professional seeking a role that...Show more
    Last updated: 30+ days ago • Promoted
    Design Engineering Mechanical Systems Engineer / Specialist

    Design Engineering Mechanical Systems Engineer / Specialist

    General Dynamics Land Systems • London, ON, Canada
    Full-time
    The Design Engineering Mechanical Systems.In addition, to be accountable for design fit, form, & functionality, ensuring compliance to requirements across the team. Areas of focus involves exper...Show more
    Last updated: 16 days ago • Promoted
    Forensic Engineer SME

    Forensic Engineer SME

    Mitigateway • London, Ontario, Canada
    Full-time
    We believe that by embedding expert forensic reasoning into scalable AI, we can transform the way risk is understood and adjudicated in property insurance losses. We build enterprise-grade generativ...Show more
    Last updated: 4 hours ago • Promoted • New!
    Equipment Documentation & Homologation Lead

    Equipment Documentation & Homologation Lead

    PowerCo • St. Thomas
    Full-time
    A leading renewable energy company based in St.Thomas is hiring a Senior Specialist in Equipment Documentation & Homologation to oversee the validation and compliance of manufacturing equipment.Thi...Show more
    Last updated: 8 days ago • Promoted
    Quality Engineer

    Quality Engineer

    Masco Corporation • St. Thomas
    Full-time
    Quality Engineer at Masco Canada (On-Site – St.Come build the future with us! At Masco Canada, we are passionate about delivering innovative solutions and exceptional customer experiences.We are lo...Show more
    Last updated: 29 days ago • Promoted
    System Analyst - OGGN Inc.

    System Analyst - OGGN Inc.

    OGGN Inc. • london, on, ca
    Full-time
    Please specify your security clearance level in resume, resumes without security clearance level will not be shortlisted for review for this client • • •. Location : Remote (Physically present in Canada...Show more
    Last updated: 9 hours ago • Promoted • New!
    Embedded C Developer

    Embedded C Developer

    Amaris Consulting • London, Ontario, Canada
    Full-time
    The ideal candidate will have strong experience in.You will be involved in the design, development, and optimization of embedded software for industrial and technological applications.Design, devel...Show more
    Last updated: 8 hours ago • Promoted • New!
    AWS Bioinformatics Engineer (Full time remote US / Canada)

    AWS Bioinformatics Engineer (Full time remote US / Canada)

    Juniper Genomics • London, Ontario, Canada
    Remote
    Full-time
    You have 2-5 years’ experience in high-volume production bioinformatics workflows for WGS and WTS analysis.You've worked in a regulated clinical lab environment and have built tools that help scien...Show more
    Last updated: 8 hours ago • Promoted • New!
    Chief Engineer

    Chief Engineer

    InnVest Hotels GP Ltd • London, ON, Canada
    Full-time
    We are seeking an experienced, hands-on.Engineering Department and ensure the smooth operation, safety, and functionality of all hotel facilities This person is accountable for ensuring all.They wi...Show more
    Last updated: 1 day ago • Promoted
    Systems Safety Engineer

    Systems Safety Engineer

    General Dynamics Land Systems • London, ON, Canada
    Full-time
    The Safety Specialist role is responsible for performing System Safety, Health Hazard and Hazardous Materials analyses during system / subsystem design for the purpose of ensuring vehicle crew safety...Show more
    Last updated: 30+ days ago • Promoted
    Customer Success Engineer

    Customer Success Engineer

    BioIntelligence Technologies • London, Ontario, Canada
    Full-time
    BioIntelligence Technologies started with a simple goal : accelerate access to greener chemicals for families, friends, citizens, and the planet. Today, biotech accounts for 20% of all molecules and ...Show more
    Last updated: 8 hours ago • Promoted • New!
    System Analyst

    System Analyst

    OGGN Inc. • London, Ontario, Canada
    Full-time
    Please specify your security clearance level in resume, resumes without security clearance level will not be shortlisted for review for this client • • •. Location : Remote (Physically present in Canada...Show more
    Last updated: 8 hours ago • Promoted • New!
    Automation Systems Programmer

    Automation Systems Programmer

    BOS Innovations • London, ON, CA
    Full-time
    Quick Apply
    At BOS Innovations, we design and build advanced automation systems that move manufacturing forward.Our programmers are at the heart of that innovation, transforming concepts into intelligent, reli...Show more
    Last updated: 30+ days ago
    Senior Embedded Software Developer

    Senior Embedded Software Developer

    ZTR • London, ON, Canada
    Full-time
    Join us in Advancing the Environmental Global Impact within the Rail Industry.For nearly 40 years, ZTR has designed, developed and released products that have a positive environmental impact within...Show more
    Last updated: 30+ days ago • Promoted