Talent.com
Software Engineer - Reliability
Software Engineer - Reliabilitylumalabs.ai • London, ON, CA
Software Engineer - Reliability

Software Engineer - Reliability

lumalabs.ai • London, ON, CA
30+ days ago
Job type
  • Full-time
Job description

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

We are looking for a hands‑on, first‑principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.

You will build, maintain, and scale Luma’s infrastructure across on‑prem and multi‑vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You’ll Do

  • Architect for Reliability & Scale : Participate in critical re‑architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next‑generation infrastructure operates.
  • Own Multi‑Cloud GPU Clusters : Take end‑to‑end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance : Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast‑moving AI startup environment.
  • Deep Linux Performance Tuning : Use your mastery of Linux systems to troubleshoot and optimise performance at the OS and kernel level.
  • Build Robust Automation : Write high‑quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware / Software Failures : Serve as the final escalation point for the most challenging GPU, networking (InfiniBand / RDMA), and system‑level issues, often collaborating directly with hardware vendors like NVIDIA.

Who You Are

  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast‑paced, large‑scale environment.
  • Deep Linux Mastery : You possess deep, hands‑on expertise in Linux, containerised systems, and debugging low‑level system performance.
  • Cloud Infrastructure Expert : You have strong experience with providers like AWS or OCI.
  • Tenacious Troubleshooter : You thrive on solving complex, low‑level problems where hardware and software intersect.
  • Startup DNA : You are energetic and thrive in a less structured, fast‑paced environment.
  • Security‑Minded : You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High‑Performance Networking : You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimise throughput for massive distributed training jobs.
  • What Sets You Apart (Bonus Points)

  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large‑scale GPU clusters for AI / ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
  • Compensation

    The base pay range for this role is $170,000 – $360,000 per year.

    #J-18808-Ljbffr

    Create a job alert for this search

    Reliability Engineer • London, ON, CA

    Similar jobs
    Physics Private Tutoring Jobs Lucan

    Physics Private Tutoring Jobs Lucan

    Superprof • Lucan, Canada
    Full-time +1
    Superprof is Canada's #1 tutoring platform, and we're actively recruiting passionate tutors! Whether you're a student, a professional, or simply someone who loves teaching, join the largest communi...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer - london

    Machine Learning Engineer - london

    Iris Software Inc. • london, on, ca
    Full-time
    Iris's direct client, one of the leading Fortune 500 Company, is looking to hire.Please take a look at the below mentioned job description and let me know if you would be interested.The following j...Show more
    Last updated: 12 hours ago • Promoted • New!
    Project Engineer

    Project Engineer

    General Dynamics Land Systems • London, ON, Canada
    Full-time
    With over 40 years of fielding Canadian military vehicles, General Dynamics Land Systems-Canada has a proven capability of supporting our armed forces over the long haul. Recently, we successfully c...Show more
    Last updated: 7 days ago • Promoted
    Senior Automation Engineer

    Senior Automation Engineer

    Qualitest • london, on, ca
    Full-time
    Qualitest looking for a Senior Automation Engineer with strong technical expertise and leadership qualities.The ideal candidate should be a self-starter who can quickly contribute, lead automation ...Show more
    Last updated: 6 days ago • Promoted
    Licensed Millwright - London, ON

    Licensed Millwright - London, ON

    Cargill • Lucan, ON, CA
    Full-time
    Must hold a valid Millwright License.As a Licensed Millwright at Cargill, you’ll play a key role in maintaining and repairing equipment to ensure safe and efficient operations.This is a hands-on ro...Show more
    Last updated: 23 hours ago • Promoted
    Remote C# Software Engineer - AI Trainer

    Remote C# Software Engineer - AI Trainer

    SuperAnnotate • St. Thomas, Ontario, CA
    Remote
    Full-time
    This is an hourly-paid, fully remote contractor role where you will review AI-generated responses and / or generate C# / . NET engineering content, evaluating reasoning quality and step-by-step problem-...Show more
    Last updated: 20 hours ago • New!
    Composite Design Engineer

    Composite Design Engineer

    Allient Incorporated • London, ON, Canada
    Full-time
    Composite Design Engineer to join our team in London, Ontario! The successful candidate will be a design engineer working with composite technology, design and applications.They will be able to lea...Show more
    Last updated: 3 days ago • Promoted
    Software Developer (Laravel & Vue.js | Intermediate-Level Experience)

    Software Developer (Laravel & Vue.js | Intermediate-Level Experience)

    EventConnect • London, ON, Canada
    Full-time
    Location : London, Ontario (Hybrid).EventConnect is a sports-tourism technology company that connects event organizers, teams, hotels, and destinations on one platform. We streamline registration, sc...Show more
    Last updated: 23 hours ago • Promoted
    Mechanical Engineer - Building Design

    Mechanical Engineer - Building Design

    MasterTech • London, ON, Canada
    Full-time
    Exceptional design engineers in the mechanical industry should be aware of an excellent opening in London ON.The ideal Mechanical Engineer should be well versed in AutoCAD, MS Word, HVAC and plumbi...Show more
    Last updated: 23 days ago • Promoted
    Controls Engineer

    Controls Engineer

    ZTR • London, ON, Canada
    Full-time
    Join us in Advancing the Environmental Global Impact within the Rail Industry.For nearly 40 years, ZTR has designed, developed and released products that have a positive environmental impact within...Show more
    Last updated: 3 days ago • Promoted
    Machine Learning Engineer - Iris Software Inc.

    Machine Learning Engineer - Iris Software Inc.

    Iris Software Inc. • london, on, ca
    Full-time
    Iris's direct client, one of the leading Fortune 500 Company, is looking to hire.Please take a look at the below mentioned job description and let me know if you would be interested.The following j...Show more
    Last updated: 12 hours ago • Promoted • New!
    Systems Safety Specialist / Engineer

    Systems Safety Specialist / Engineer

    General Dynamics Land Systems • London, ON, Canada
    Full-time
    At General Dynamics Land Systems we put our Customers at the centre of everything we do.Our commitment to protect and enable the people who serve drives us to develop products and systems that give...Show more
    Last updated: 3 days ago • Promoted
    Sr. Infrastructure Engineer with Kubernetes - london

    Sr. Infrastructure Engineer with Kubernetes - london

    Confidential • london, on, ca
    Full-time
    The role seeks a highly experienced Infrastructure Specialist to spearhead the design, deployment, and operational excellence of a modern cloud-native infrastructure. The ideal candidate must posses...Show more
    Last updated: 6 days ago • Promoted
    Senior Analytics Engineer - london

    Senior Analytics Engineer - london

    TekRek • london, on, ca
    Full-time
    TekRek has partnered with a fast growing firm specializing in AI enablement and data optimization.Our client works with some of the top tier technology firms in Silcom Valley.With rapid growth and ...Show more
    Last updated: 13 hours ago • Promoted • New!
    Chief Engineer

    Chief Engineer

    InnVest Hotels GP Ltd • London, ON, Canada
    Full-time
    We are seeking an experienced, hands-on.Engineering Department and ensure the smooth operation, safety, and functionality of all hotel facilities This person is accountable for ensuring all.They wi...Show more
    Last updated: 10 days ago • Promoted
    Sr. Infrastructure Engineer with Kubernetes

    Sr. Infrastructure Engineer with Kubernetes

    Confidential • london, on, ca
    Full-time
    The role seeks a highly experienced Infrastructure Specialist to spearhead the design, deployment, and operational excellence of a modern cloud-native infrastructure. The ideal candidate must posses...Show more
    Last updated: 6 days ago • Promoted
    Algebra Private Tutoring Jobs Lucan

    Algebra Private Tutoring Jobs Lucan

    Superprof • Lucan, Canada
    Full-time +1
    Superprof is Canada's #1 tutoring platform, and we're actively recruiting passionate tutors! Whether you're a student, a professional, or simply someone who loves teaching, join the largest communi...Show more
    Last updated: 30+ days ago • Promoted
    TrueLight Applications Engineer - Machine Vision

    TrueLight Applications Engineer - Machine Vision

    BOS Innovations • London, ON, Canada
    Full-time
    At TrueLight, we’re redefining what’s possible in machine vision and intelligent automation.As a BOS Innovations company, we combine the agility of a startup with the stability and expe...Show more
    Last updated: 30+ days ago • Promoted