Talent.com
CARFAX
Senior Software Engineer - Dev OpsCARFAX • London, Ontario
Senior Software Engineer - Dev Ops

Senior Software Engineer - Dev Ops

CARFAX • London, Ontario
10 days ago
Salary
CA$92,500.00 yearly
Job type
  • Full-time
Job description

Description

Join Team CARFAX as a Senior Software Engineer - Dev OpsWe are looking for a seasoned Senior Software Engineer - Dev Ops to join our platform team and take an active role in designing, scaling, and operating the infrastructure that powers Large Language Model (LLM) development and hosting. This is a high-impact, highly technical position where you will own critical platform components, drive architectural decisions, and directly shape the reliability, performance, and security of our AI infrastructure.At its core, this is a Kubernetes-first, cloud-native platform engineering role. We care deeply about your ability to architect and operate scalable, resilient infrastructure for LLM workloads — the specific cloud or tooling you've built that experience on is secondary. Our current platform runs on AWS with EKS, Flyte, ArgoCD, JupyterHub, and the LGTM observability stack, and you'll be working within that environment — but we are far more interested in the depth of your platform thinking than in a specific vendor background.If you are an engineer who thrives at the intersection of AI/ML and cloud-native infrastructure, who gets excited about solving the unique scaling and operational challenges that LLM workloads demand, and who wants to work on technology that sits at the absolute cutting edge of the AI industry — this role was built for you.At CARFAX, we believe in the power of teamwork and value in-person interactions so that we can collaborate and thrive together. This position will require 2 days in the London, ON office per week, subject to change with future business needs. One last thing: Our four-day week continues in Summer 2026 What You'll Own:LLM Platform Architecture — Actively participate in the design and evolution of the core infrastructure platform supporting LLM training, fine-tuning, and inference workloads at scale, contributing architectural decisions that balance performance, cost, and reliability across the full platform lifecycle.Kubernetes & Advanced Autoscaling — Own the design and implementation of sophisticated K8s autoscaling strategies (HPA, VPA, KEDA, Cluster Autoscaler) tailored to the highly variable and GPU-intensive demands of LLM workloads. Our current environment is EKS, though equivalent production Kubernetes experience on GKE, AKS, or on-prem is equally valued.ML Workflow Orchestration — Participate in the engineering and optimization of ML pipeline infrastructure, contributing to best practices for pipeline design, resource allocation, and workflow reliability across LLM training and evaluation workloads. We currently use Flyte — experience with comparable platforms such as Kubeflow, Airflow, or Prefect translates well.AI Developer Platform — Own and contribute to the architecture and operations of interactive compute environments used by AI researchers and LLM engineers to develop, experiment, and prototype. We run JupyterHub today, though experience with equivalent multi-user ML development platforms is directly applicable.CI/CD & GitOps — Participate in the development and ongoing improvement of GitOps workflows and CI/CD pipelines, contributing to deployment best practices and enabling rapid, reliable delivery of platform changes. Our current implementation uses ArgoCD — strong experience with GitOps principles and comparable tooling is what matters.Observability & Reliability — Contribute to the full observability stack implementation — designing dashboards, defining SLOs, building alerting frameworks, and ensuring deep visibility into LLM workload performance and platform health. We use the LGTM stack (Loki, Grafana, Tempo, Mimir) — experience with Prometheus, OpenTelemetry, ELK, Datadog, or equivalent platforms is welcomed.Cloud Infrastructure — Participate in cloud infrastructure design across compute (including GPU instance families), storage, networking, and IAM, with a strong emphasis on cost optimization and operational excellence. Our primary cloud is AWS — candidates with strong GCP or Azure backgrounds who are prepared to work in AWS are encouraged to apply.Security & Compliance — Engage actively in the vulnerability assessment and remediation program across all platform components, contributing to security standards and ensuring the LLM platform meets organizational and regulatory compliance requirements.Collaborative Engineering — Participate in technical design reviews, contribute to roadmap discussions, and serve as a knowledgeable resource and collaborative partner across AIOps and MLOps disciplinesRequired Experience & Skills:7+ years of experience in DevOps, Platform Engineering, MLOps, or a closely related infrastructure discipline.Deep Kubernetes expertise — production experience operating Kubernetes at scale on any major managed platform (EKS, GKE, AKS) or on-premises, with advanced knowledge of scheduling, autoscaling, networking, RBAC, and cluster operations.Cloud infrastructure proficiency — extensive experience designing and operating production workloads on at least one major cloud provider (AWS, GCP, or Azure), covering compute, storage, networking, and identity and access managementMLOps / AI Infrastructure experience — demonstrated experience building and operating infrastructure that supports ML training, model serving, or LLM workloads, including GPU resource management and scheduling at scaleCI/CD & GitOps — strong hands-on experience with GitOps principles and modern CI/CD pipeline design, using any mainstream tooling (ArgoCD, Flux, GitHub Actions, Tekton, or equivalent)Observability Engineering — production experience designing and operating observability platforms including metrics, logging, and distributed tracing, using any modern stack (Grafana/LGTM, Prometheus, Datadog, ELK, or equivalent)Infrastructure as Code — strong proficiency with Terraform, Helm, or comparable IaC and configuration management tooling.Programming & Scripting — solid coding ability in Python and/or Go, with experience writing automation, tooling, and infrastructure integrations.Security Mindset — hands-on experience with vulnerability scanning, remediation workflows, and cloud security best practices including RBAC hardening and secrets managementStrongly Preferred:Direct experience with Flyte or comparable ML workflow orchestration platforms (Kubeflow, Airflow, Prefect, Metaflow)Experience operating JupyterHub or equivalent multi-user interactive compute platforms at scaleFamiliarity with LLM-specific infrastructure — model serving frameworks (vLLM, Triton, TorchServe), GPU cluster management, large-scale distributed training setupsHands-on experience with AWS (EKS, EC2 GPU families, S3, IAM, VPC) as our current primary cloud environmentExperience with FinOps practices — cloud cost attribution, rightsizing, and spot/preemptible instance strategies for ML workloadsRelevant certifications: CKA / CKS, AWS/GCP/Azure Solutions Architect or DevOps Engineer, or equivalentWho You Are:A systems thinker who understands how architectural decisions ripple across reliability, performance, cost, and security — regardless of which cloud or tooling stack those decisions are made withinOperationally minded — you build things to be observable, maintainable, and resilient from day oneDeeply curious about AI and LLMs — you understand why the infrastructure you build matters and stay current with how the AI landscape is evolvingProactive and ownership-driven — you identify problems before they become incidents and drive solutions to completionAn effective collaborator and communicator who can translate complex infrastructure concepts for AI researchers, data scientists, and engineering leadership alikeComfortable operating with autonomy in a fast-moving environment where priorities evolve alongside the AI landscapeWhy This Role Stands Out:LLM infrastructure is one of the most technically demanding and strategically important engineering domains in the industry today. As a senior member of our AIOps team you will:Directly shape the platform that enables LLM development and productionization — your contributions will have immediate, measurable impactWork on genuinely hard infrastructure problems — GPU scheduling, large-scale distributed workloads, high-throughput model serving, and multi-tenant ML environmentsBe positioned at the epicenter of the AI infrastructure space, one of the fastest growing and highest-demand engineering disciplines in the industryHave a clear voice in technical direction — your experience and opinions on platform design are genuinely valued and actively soughtBring your full experience to the table — whether you've built on AWS, GCP, Azure, or hybrid environments, your platform engineering expertise is what drives impact hereWhat’s in it for you:
  • Competitive Compensation: Attractive salary, comprehensive benefits, and generous time-off policies.
  • Flexible Work Schedules: Enjoy 4-day summer work weeks and a winter holiday break.
  • Retirement Support: 401(k) / DCPP matching.
  • Performance Rewards: Annual bonus program to recognize your contributions.
  • Innovative Workspace: Casual, dog-friendly offices designed for creativity and collaboration.
Hear from our Team: Our accolades speak for themselves:
  • 10X Virginia Business Best Places to Work
  • 9X Washingtonian Great Places to Work
  • 9X Washington Post Top Workplace
  • St. Louis Post-Dispatch Best Places to Work
Vacancy Status:
This posting is for an existing vacancy.Base Salary:
The anticipated base salary range for this position is CAD $92,500 to $136,000 annually. Final base salary will be determined based on geographical location, experience, and qualifications.Benefits:Join a company that values your total wellbeing. Carfax offers competitive compensation, comprehensive healthcare coverage, and the chance to make a meaningful impact in an industry-leading organization. Our benefit offerings can be found at: .
Create a job alert for this search

Senior Software Engineer - Dev Ops • London, Ontario

Similar jobs

Cloud & Dedicated Hosting Provider Senior Ops. Architect

\"RMS\" Retail Marketing Solutions LLClondon, on, ca
Full-time +1

Retail Marketing Solutions LLC “RMS”) is a Global Operations Agency specializing in:.Business Infrastructure and Practices Consultation).Domains | VPS & Dedicated Servers | Infrastructure Managemen... Show more

 • Promoted

Senior Software Developer

LYNKED Inc.London, ON, CA
Full-time

LYNKED is at the forefront of providing innovative technology solutions that drive efficiency and sustainability.With a diverse product portfolio that includes smart thermal energy and water meters... Show more

 • Promoted

Senior Sre: Ai Gpu Infra & Multi-Cloud Scale - $170,000 A Year

lumalabs.aiLondon, Canada
Full-time

Seeking an SRE Engineer with 8+ years experience in site reliability or infrastructure engineering.Will manage multi-cloud GPU clusters and build automation tools. Show more

 • Promoted

Software Applications

Aversan IncLondon, ON, CA
Full-time

Aversan delivers leading-edge and reliable safety-critical electronics and software systems to the aerospace, defence, and space industries.We are looking for an experienced Software Developer to j... Show more

 • Promoted

Lead Mobile Developer For Operations Systems

ITPS (Canada) Ltd.London, Canada
Full-time

Become a Lead Mobile Developer at ITPS Canada Ltd.London, Ontario, focusing on mobile solutions for our Flight School Management System.Your work will bridge the gap between technology and real-wor... Show more

 • Promoted

Senior Software Engineer, Fullstack (Consumer Engineering)

AffirmLondon, ON, CA
Full-time

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest.Consumer Growth Experienc... Show more

 • Promoted

Full-Time Software Engineer at Aversan

Aversan IncLondon, ON, CA
Full-time

Explore a career as a Software Engineer with Aversan Inc.London, Ontario, specializing in video software applications.This on-site, full-time position combines front-end and back-end development.So... Show more

 • Promoted

Lead Mobile Developer for Operations Systems

ITPS (Canada) Ltd.London, Ontario, Canada
Full-time

Become a Lead Mobile Developer at ITPS Canada Ltd.London, Ontario, focusing on mobile solutions for our Flight School Management System.Your work will bridge the gap between technology and real-wor... Show more

 • Promoted

Ztr Cloud Qa Engineer Focused On Iot

ZTR LLCLondon, Canada
Full-time

Enhance your career at ZTR as a Cloud QA Engineer, focusing on testing IoT-enabled systems in AWS.Ensure quality through automation and API validation in this pivotal role.At ZTR, renowned for its ... Show more

 • Promoted

Senior/ Lead - AI Engineer

FICOlondon, on, ca
Full-time

As a Senior Engineer on our Applied AI team, you will be at the forefront of building AI-powered software that transforms how our platform operates.You will design, build, and maintain production-g... Show more

 • Promoted

Software Applications

Aversan Inc.London, ON, CA
Full-time

Aversan delivers leading-edge and reliable safety-critical electronics and software systems to the aerospace, defence, and space industries.We are looking for an experienced Software Developer to j... Show more

 • Promoted

Data Entry Clerk (Remote) - Paid Product Testing Survey Taker

ApexFocusGroupSt. Thomas, ON, CA
Remote
Full-time +1

Now accepting applicants for Focus Group studies.Earn up to $850 per week part-time working from home.Must register to see if you qualify.No Data Entry experience needed.Data Entry Clerk Work From ... Show more

 • Promoted

Senior Software Engineer

VoicesLondon, ON, CA
Full-time

Location: Ontario (Hybrid - See “Hybrid Work Details” section below for more information).Vacancy Status: This posting is for an existing vacancy.Voices is the trusted voice partner for brands, tec... Show more

 • Promoted

Intermediate Full-Stack Software Developer - Remote (Canada)

Info-Tech Research GroupLondon, ON, CA
Remote
Full-time

Info-Tech Research Group delivers impartial, highly pertinent IT research, enabling CIOs and IT leaders to make well-informed, strategic decisions.We are currently serving over 30,000 professionals... Show more

 • Promoted

Survey Taker: Earn up to $25 per survey (Remote)

Earn HausLucan Biddulph, ON, CA
Remote
Full-time +1

Looking for people to participate in taking online surveys for Fortune 500 brands.All you need to do is complete online surveys by sharing your opinion.You will help influence brand decisions on se... Show more

 • Promoted

Application Release Engineer (Ops) – Deploy & Scale

Highbrow LLCLondon, ON, CA
Full-time

A technology consulting company is seeking an experienced Application Release Engineer to deploy and manage business applications on key technology platforms.The role requires a strong background i... Show more

 • Promoted

Senior Full-Stack Tech Lead — Remote Canada

Race RosterLondon, ON, CA
Remote
Full-time

A leading digital services firm in Canada is seeking a Senior Software Developer to lead the design, development, and delivery of complex software applications.The role includes mentoring other dev... Show more

 • Promoted

Remote Sales Representative

Spade RecruitingLucan Biddulph
Remote
Full-time +1

We are a well-established provider of supplemental benefits, dedicated to delivering value, transparency, and exceptional service to working families.Our team is growing, and we are currently seeki... Show more

 • Promoted

Sr Mobile Engineer (IOS and Android) - Mastech Digital

Mastech Digitallondon, on, ca
Full-time

Design, build, and ship mobile features for device pairing, BLE connectivity, telemetry, health monitoring, and command/control — using AI (Claude Code) as a first-class collaborator in your daily ... Show more

 • Promoted

Senior Full Stack Developer At Itps Canada

ITPS Canada LtdLondon, Canada
Full-time

Take the lead in developing a sophisticated ERP and AI system at ITPS Canada as a Senior Full Stack Developer.Help shape the future of flight school management efficiently.In this role, you will be... Show more