Talent.com
Software Engineer - Reliability
Software Engineer - Reliabilitylumalabs.ai • London, ON, CA
Software Engineer - Reliability

Software Engineer - Reliability

lumalabs.ai • London, ON, CA
Il y a plus de 30 jours
Type de contrat
  • Temps plein
Description de poste

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale. Our SRE team is the foundation of our research and product velocity, responsible for the thousands of NVIDIA and AMD GPUs across multiple providers that power our work.

Where You Come In

We are looking for a hands‑on, first‑principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure.

You will build, maintain, and scale Luma’s infrastructure across on‑prem and multi‑vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.

What You’ll Do

  • Architect for Reliability & Scale : Participate in critical re‑architecture sessions to redesign our systems for higher efficiency and scale. You won't just maintain existing clusters; you will help define how our next‑generation infrastructure operates.
  • Own Multi‑Cloud GPU Clusters : Take end‑to‑end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance.
  • Drive Security & Compliance : Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices in a fast‑moving AI startup environment.
  • Deep Linux Performance Tuning : Use your mastery of Linux systems to troubleshoot and optimise performance at the OS and kernel level.
  • Build Robust Automation : Write high‑quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure without relying on heavy operational toil.
  • Debug Complex Hardware / Software Failures : Serve as the final escalation point for the most challenging GPU, networking (InfiniBand / RDMA), and system‑level issues, often collaborating directly with hardware vendors like NVIDIA.

Who You Are

  • 8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast‑paced, large‑scale environment.
  • Deep Linux Mastery : You possess deep, hands‑on expertise in Linux, containerised systems, and debugging low‑level system performance.
  • Cloud Infrastructure Expert : You have strong experience with providers like AWS or OCI.
  • Tenacious Troubleshooter : You thrive on solving complex, low‑level problems where hardware and software intersect.
  • Startup DNA : You are energetic and thrive in a less structured, fast‑paced environment.
  • Security‑Minded : You possess a working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO.
  • Expert in High‑Performance Networking : You have practical experience with InfiniBand, RDMA, or RoCE and understand how to optimise throughput for massive distributed training jobs.
  • What Sets You Apart (Bonus Points)

  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm.
  • Experience managing large‑scale GPU clusters for AI / ML workloads (training or inference).
  • Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray.
  • Compensation

    The base pay range for this role is $170,000 – $360,000 per year.

    #J-18808-Ljbffr

    Créer une alerte emploi pour cette recherche

    Reliability Engineer • London, ON, CA

    Offres similaires
    AWS Bioinformatics Engineer (Full time remote US / Canada) - Juniper Genomics

    AWS Bioinformatics Engineer (Full time remote US / Canada) - Juniper Genomics

    Juniper Genomics • london, on, ca
    Télétravail
    Temps plein
    You have 2-5 years’ experience in high-volume production bioinformatics workflows for WGS and WTS analysis.You've worked in a regulated clinical lab environment and have built tools that help scien...Voir plus
    Dernière mise à jour : il y a 19 jours • Offre sponsorisée
    Software Developer

    Software Developer

    ITPS (Canada) LTD • London, ON, Canada
    Temps plein
    A truly unique opportunity awaits one that is unmatched anywhere else.Whether you are early in your career and ready to take the right first step, or an experienced professional seeking a role that...Voir plus
    Dernière mise à jour : il y a plus de 30 jours • Offre sponsorisée
    Remote Rust Engineer - AI Trainer

    Remote Rust Engineer - AI Trainer

    SuperAnnotate • St. Thomas, Ontario, CA
    Télétravail
    Temps plein
    As an hourly paid, fully remote Rust Engineer for AI Data Training, you will review AI-generated Rust code and explanations or generate your own, evaluate the reasoning quality and step-by-step pro...Voir plus
    Dernière mise à jour : il y a 22 jours
    Guidewire Engineer - london

    Guidewire Engineer - london

    BuzzClan • london, on, ca
    Temps plein
    Guidewire PolicyCenter Configuration – Senior.Year Contract (Extension Possible).Guidewire PolicyCenter Configuration Developer. GOSU, Java, and web development.The ideal candidate will be hands-on ...Voir plus
    Dernière mise à jour : il y a plus de 30 jours • Offre sponsorisée
    AWS Bioinformatics Engineer (Full time remote US / Canada)

    AWS Bioinformatics Engineer (Full time remote US / Canada)

    Juniper Genomics • london, on, ca
    Télétravail
    Temps plein
    You have 2-5 years’ experience in high-volume production bioinformatics workflows for WGS and WTS analysis.You've worked in a regulated clinical lab environment and have built tools that help scien...Voir plus
    Dernière mise à jour : il y a 19 jours • Offre sponsorisée
    Guidewire Engineer - BuzzClan

    Guidewire Engineer - BuzzClan

    BuzzClan • london, on, ca
    Temps plein
    Guidewire PolicyCenter Configuration – Senior.Year Contract (Extension Possible).Guidewire PolicyCenter Configuration Developer. GOSU, Java, and web development.The ideal candidate will be hands-on ...Voir plus
    Dernière mise à jour : il y a plus de 30 jours • Offre sponsorisée
    Intermediate Software Developer (Laravel / Vue.js)

    Intermediate Software Developer (Laravel / Vue.js)

    EventConnect • London, ON, Canada
    Temps plein
    Location : London, Ontario (Hybrid).EventConnect is a sports-tourism technology company that connects event organizers, teams, hotels, and destinations on one platform. We streamline registration, sc...Voir plus
    Dernière mise à jour : il y a 1 jour • Offre sponsorisée
    Composite Design Engineer

    Composite Design Engineer

    Allient Incorporated • London, ON, Canada
    Temps plein
    Composite Design Engineer to join our team in London, Ontario! The successful candidate will be a design engineer working with composite technology, design and applications.They will be able to lea...Voir plus
    Dernière mise à jour : il y a 2 jours • Offre sponsorisée
    Remote Senior SQL Engineer - AI Trainer

    Remote Senior SQL Engineer - AI Trainer

    SuperAnnotate • St. Thomas, Ontario, CA
    Télétravail
    Temps plein
    As a Senior SQL Engineer, you will work remotely on an hourly paid basis to review AI-generated SQL queries, database designs, and data-processing logic, as well as generate high-quality reference ...Voir plus
    Dernière mise à jour : il y a 22 jours
    Controls Engineer

    Controls Engineer

    ZTR • London, ON, Canada
    Temps plein
    Join us in Advancing the Environmental Global Impact within the Rail Industry.For nearly 40 years, ZTR has designed, developed and released products that have a positive environmental impact within...Voir plus
    Dernière mise à jour : il y a 2 jours • Offre sponsorisée
    Sr AWS Cloud Engineer-Canada Remote

    Sr AWS Cloud Engineer-Canada Remote

    RELQ TECHNOLOGIES LLC • London, ON, Canada
    Télétravail
    Temps plein
    Quick Apply
    Sr AWS Cloud Engineer 7+Years Canada Remote 6+Contract We're looking for a Sr.AWS Cloud Engineer to ...Voir plus
    Dernière mise à jour : il y a 2 jours
    Guidewire Engineer

    Guidewire Engineer

    BuzzClan • london, on, ca
    Temps plein
    Guidewire PolicyCenter Configuration – Senior.Year Contract (Extension Possible).Guidewire PolicyCenter Configuration Developer. GOSU, Java, and web development.The ideal candidate will be hands-on ...Voir plus
    Dernière mise à jour : il y a plus de 30 jours • Offre sponsorisée
    Systems Safety Specialist / Engineer

    Systems Safety Specialist / Engineer

    General Dynamics Land Systems • London, ON, Canada
    Temps plein
    At General Dynamics Land Systems we put our Customers at the centre of everything we do.Our commitment to protect and enable the people who serve drives us to develop products and systems that give...Voir plus
    Dernière mise à jour : il y a 2 jours • Offre sponsorisée
    Remote Ruby Engineer - AI Trainer

    Remote Ruby Engineer - AI Trainer

    SuperAnnotate • St. Thomas, Ontario, CA
    Télétravail
    Temps plein
    As an hourly paid, fully remote Ruby Engineer for AI Data Training, you will review AI-generated Ruby and Rails code or generate your own solutions, evaluate the reasoning quality and step-by-step ...Voir plus
    Dernière mise à jour : il y a 22 jours
    Sr. Infrastructure Engineer with Kubernetes - Confidential

    Sr. Infrastructure Engineer with Kubernetes - Confidential

    Confidential • london, on, ca
    Temps plein
    The role seeks a highly experienced Infrastructure Specialist to spearhead the design, deployment, and operational excellence of a modern cloud-native infrastructure. The ideal candidate must posses...Voir plus
    Dernière mise à jour : il y a 5 jours • Offre sponsorisée
    Systems Safety Engineer

    Systems Safety Engineer

    General Dynamics Land Systems • London, ON, Canada
    Temps plein
    The Safety Specialist role is responsible for performing System Safety, Health Hazard and Hazardous Materials analyses during system / subsystem design for the purpose of ensuring vehicle crew safety...Voir plus
    Dernière mise à jour : il y a plus de 30 jours • Offre sponsorisée
    Machine Learning Engineer - london

    Machine Learning Engineer - london

    Hifyre • london, on, ca
    Temps plein
    Hifyre provides market intelligence for the cannabis industry, analyzing retail data to help.Our models power product recommendations, sales forecasting, and market analysis for both internal opera...Voir plus
    Dernière mise à jour : il y a 19 jours • Offre sponsorisée
    AWS Bioinformatics Engineer (Full time remote US / Canada) - london

    AWS Bioinformatics Engineer (Full time remote US / Canada) - london

    Juniper Genomics • london, on, ca
    Télétravail
    Temps plein
    You have 2-5 years’ experience in high-volume production bioinformatics workflows for WGS and WTS analysis.You've worked in a regulated clinical lab environment and have built tools that help scien...Voir plus
    Dernière mise à jour : il y a 19 jours • Offre sponsorisée