Talent.com
Advanced Micro Devices, Inc
Director of Software Validation Engineering – ROCmAdvanced Micro Devices, Inc • MARKHAM, Ontario, Canada
Director of Software Validation Engineering – ROCm

Director of Software Validation Engineering – ROCm

Advanced Micro Devices, Inc • MARKHAM, Ontario, Canada
9 days ago
Job type
  • Full-time
Job description


WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.Together, we advance your career.




THE TEAM

The ROCm software organization at AMD builds and maintains the open-source GPU software stack powering AI training, inference, and HPC workloads across AMD's data center and consumer GPU portfolio. ROCm is the foundation on which developers, researchers, and enterprises run their most demanding AI and HPC workloads. Quality and reliability are existential to our success. We operate at the intersection of cutting-edge hardware and software — and we move fast. Our team is deeply invested in open-source, community-driven development, and engineering excellence at every layer of the stack.

THE ROLE

We're looking for a hands-on Director of Test Engineering to lead and transform the quality function for ROCm. This is not a program management role — it's a deeply technical leadership position for someone who understands the hardware/software interface of GPUs, has built test engineering organizations from the ground up, and is ready to lead the next wave of AI-native, agentic quality engineering.

You will own the vision, strategy, and execution of test engineering for ROCm — from kernel-level driver validation to user-space ML framework testing. Critically, you will be the driving force behind scaling your team's impact through AI and agentic tooling, building a modern, autonomous quality organization that moves faster than any traditional QA team could.

THE IMPACT YOU WILL HAVE

  • Define and own the test engineering strategy for ROCm across the full HW/SW stack, from driver interfaces to ML framework validation.
  • Transform the quality organization into an AI-first, agentic team — scaling coverage, speed, and reliability without proportional headcount growth.
  • Build and operate continuous testing and validation infrastructure including long-running soak, stress, failure/recovery, and staging environments for product reliability.
  • Raise the bar on test engineering discipline: shift-left practices, SDET-caliber test development, and deep ownership of quality metrics.
  • Partner directly with hardware, firmware, and software engineers to ensure quality is embedded at every stage of development.
  • Drive adoption of AI-assisted testing workflows, intelligent test selection, automated root cause analysis, and agentic CI/CD pipelines across the organization.

THE PERSON

The ideal candidate is a technical leader who has built and scaled test engineering teams in complex, hardware-adjacent software environments. You are hands-on when it matters — able to prototype a test framework, debug a GPU driver failure, or design a validation architecture. You also understand how customers actually use the product: the AI inference and training workloads they run, the parallelism strategies they deploy, the performance they expect, and the failure modes they hit. That customer-workload knowledge is what separates a QA team that writes blackbox sanity checks from one that designs tests targeting the exact code paths real users exercise. You see AI agents not as a novelty but as the primary lever for scaling your team's output. You are impatient with manual, reactive QA and energized by building systems that catch bugs before humans even see them.

KEY RESPONSIBILITIES

  • Own the overall test engineering strategy and architecture for ROCm, spanning driver validation, runtime testing, compiler/toolchain quality, and ML framework integration — with test coverage designed around real customer workload patterns, not synthetic benchmarks.
  • Lead, grow, and mentor a team of SDETs and test engineers, instilling SDET-level engineering discipline and a culture of automation-first quality.
  • Architect and operate continuous testing/validation infrastructure: staging environments for soak testing, stress testing, failure injection, recovery validation, and long-duration reliability runs.
  • Champion AI-first and agentic test engineering: drive adoption of LLM-assisted test generation, autonomous failure triage, intelligent test prioritization, and agentic CI/CD workflows.
  • Hands-on prototyping of new test frameworks, validation tooling, and agentic testing pipelines — especially in early-stage or high-ambiguity situations.
  • Define, track, and improve quality KPIs: test coverage, defect escape rate, time-to-detection, device utilization, and validation cycle time.
  • Collaborate closely with hardware, firmware, and software engineering teams to ensure quality is integrated from design through release.
  • Partner with DevOps and infrastructure teams to evolve the CI/CD pipeline with robust, scalable, GPU-aware test automation.
  • Engage with the open-source ROCm community and external customers on quality feedback loops and reliability expectations, translating their workload patterns and failure reports into structured test coverage.
  • Partner with compiler, runtime, and framework integration teams on numerical correctness validation — understanding shared scope boundaries and ensuring the test organization contributes meaningfully to catching precision regressions across floating-point formats and parallelism configurations.
  • Establish and maintain HW/SW test automation for both Linux and Windows platforms across AMD's GPU product lines.

REQUIRED QUALIFICATIONS

  • 12+ years of experience in software engineering or test engineering, with significant experience in hardware-adjacent or systems-level software.
  • 5+ years of engineering management, including building and scaling test engineering or SDET organizations.
  • Deep hands-on expertise in test automation at scale — framework design, CI/CD pipeline development, and continuous validation systems.
  • Demonstrated experience with hardware + software test automation, including HW bring-up, driver validation, or firmware/software co-testing.
  • Strong understanding of GPU architecture or hardware/software interfaces (PCIe, memory subsystems, compute kernels, or equivalent).
  • Experience designing and operating always-on test infrastructure: soak/stress environments, failure injection, and reliability/recovery validation pipelines.
  • Proven track record of adopting and scaling AI or automation tooling to multiply team throughput.
  • Python proficiency: able to write test automation, tooling, and scripted validation workflows independently.
  • Practical understanding of how AI inference and training workloads are deployed on GPU hardware — including common parallelism strategies (tensor parallel, pipeline parallel, data parallel), serving configurations, and performance expectations — sufficient to translate customer use cases into targeted test coverage.
  • Hands-on software development skills sufficient to prototype test frameworks, write automation tooling, and review SDET-level code.

PREFERRED QUALIFICATIONS

  • Direct experience with ROCm, CUDA, or GPU compute software stacks (runtime, compiler, ML frameworks).
  • Experience integrating LLMs, AI agents, or agentic workflows into software development or test engineering processes.
  • Expertise in open-source development practices and community-facing quality processes (GitHub Actions, open CI, etc.).
  • Background in SDET or test engineering in a semiconductor, HPC, or AI infrastructure company.
  • Experience with GPU-specific test challenges: non-determinism, thermal behavior, multi-device coordination, driver stability.
  • Track record of shipping test frameworks or validation tools used across large engineering organizations.
  • Familiarity with ML training/inference workload validation: throughput, latency, numerical stability across precision formats (FP32/BF16/FP8), and multi-GPU collective communication correctness.
  • Experience with GPU profiling and trace analysis tooling (e.g., rocprof, omniperf, PyTorch profiler) to identify kernel-level performance and correctness anomalies.
  • Familiarity with HIP, CUDA, or low-level GPU programming — sufficient to understand what is being tested at the runtime and kernel level, even if not writing kernels directly.


#LI-G11

#LI-HYBRID

Note: This role is intentionally scoped as a hands-on technical leadership position. Candidates whose primary background is program management or traditional QA management without deep engineering execution experience may not be the right fit.




Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

THE TEAM

The ROCm software organization at AMD builds and maintains the open-source GPU software stack powering AI training, inference, and HPC workloads across AMD's data center and consumer GPU portfolio. ROCm is the foundation on which developers, researchers, and enterprises run their most demanding AI and HPC workloads. Quality and reliability are existential to our success. We operate at the intersection of cutting-edge hardware and software — and we move fast. Our team is deeply invested in open-source, community-driven development, and engineering excellence at every layer of the stack.

THE ROLE

We're looking for a hands-on Director of Test Engineering to lead and transform the quality function for ROCm. This is not a program management role — it's a deeply technical leadership position for someone who understands the hardware/software interface of GPUs, has built test engineering organizations from the ground up, and is ready to lead the next wave of AI-native, agentic quality engineering.

You will own the vision, strategy, and execution of test engineering for ROCm — from kernel-level driver validation to user-space ML framework testing. Critically, you will be the driving force behind scaling your team's impact through AI and agentic tooling, building a modern, autonomous quality organization that moves faster than any traditional QA team could.

THE IMPACT YOU WILL HAVE

  • Define and own the test engineering strategy for ROCm across the full HW/SW stack, from driver interfaces to ML framework validation.
  • Transform the quality organization into an AI-first, agentic team — scaling coverage, speed, and reliability without proportional headcount growth.
  • Build and operate continuous testing and validation infrastructure including long-running soak, stress, failure/recovery, and staging environments for product reliability.
  • Raise the bar on test engineering discipline: shift-left practices, SDET-caliber test development, and deep ownership of quality metrics.
  • Partner directly with hardware, firmware, and software engineers to ensure quality is embedded at every stage of development.
  • Drive adoption of AI-assisted testing workflows, intelligent test selection, automated root cause analysis, and agentic CI/CD pipelines across the organization.

THE PERSON

The ideal candidate is a technical leader who has built and scaled test engineering teams in complex, hardware-adjacent software environments. You are hands-on when it matters — able to prototype a test framework, debug a GPU driver failure, or design a validation architecture. You also understand how customers actually use the product: the AI inference and training workloads they run, the parallelism strategies they deploy, the performance they expect, and the failure modes they hit. That customer-workload knowledge is what separates a QA team that writes blackbox sanity checks from one that designs tests targeting the exact code paths real users exercise. You see AI agents not as a novelty but as the primary lever for scaling your team's output. You are impatient with manual, reactive QA and energized by building systems that catch bugs before humans even see them.

KEY RESPONSIBILITIES

  • Own the overall test engineering strategy and architecture for ROCm, spanning driver validation, runtime testing, compiler/toolchain quality, and ML framework integration — with test coverage designed around real customer workload patterns, not synthetic benchmarks.
  • Lead, grow, and mentor a team of SDETs and test engineers, instilling SDET-level engineering discipline and a culture of automation-first quality.
  • Architect and operate continuous testing/validation infrastructure: staging environments for soak testing, stress testing, failure injection, recovery validation, and long-duration reliability runs.
  • Champion AI-first and agentic test engineering: drive adoption of LLM-assisted test generation, autonomous failure triage, intelligent test prioritization, and agentic CI/CD workflows.
  • Hands-on prototyping of new test frameworks, validation tooling, and agentic testing pipelines — especially in early-stage or high-ambiguity situations.
  • Define, track, and improve quality KPIs: test coverage, defect escape rate, time-to-detection, device utilization, and validation cycle time.
  • Collaborate closely with hardware, firmware, and software engineering teams to ensure quality is integrated from design through release.
  • Partner with DevOps and infrastructure teams to evolve the CI/CD pipeline with robust, scalable, GPU-aware test automation.
  • Engage with the open-source ROCm community and external customers on quality feedback loops and reliability expectations, translating their workload patterns and failure reports into structured test coverage.
  • Partner with compiler, runtime, and framework integration teams on numerical correctness validation — understanding shared scope boundaries and ensuring the test organization contributes meaningfully to catching precision regressions across floating-point formats and parallelism configurations.
  • Establish and maintain HW/SW test automation for both Linux and Windows platforms across AMD's GPU product lines.

REQUIRED QUALIFICATIONS

  • 12+ years of experience in software engineering or test engineering, with significant experience in hardware-adjacent or systems-level software.
  • 5+ years of engineering management, including building and scaling test engineering or SDET organizations.
  • Deep hands-on expertise in test automation at scale — framework design, CI/CD pipeline development, and continuous validation systems.
  • Demonstrated experience with hardware + software test automation, including HW bring-up, driver validation, or firmware/software co-testing.
  • Strong understanding of GPU architecture or hardware/software interfaces (PCIe, memory subsystems, compute kernels, or equivalent).
  • Experience designing and operating always-on test infrastructure: soak/stress environments, failure injection, and reliability/recovery validation pipelines.
  • Proven track record of adopting and scaling AI or automation tooling to multiply team throughput.
  • Python proficiency: able to write test automation, tooling, and scripted validation workflows independently.
  • Practical understanding of how AI inference and training workloads are deployed on GPU hardware — including common parallelism strategies (tensor parallel, pipeline parallel, data parallel), serving configurations, and performance expectations — sufficient to translate customer use cases into targeted test coverage.
  • Hands-on software development skills sufficient to prototype test frameworks, write automation tooling, and review SDET-level code.

PREFERRED QUALIFICATIONS

  • Direct experience with ROCm, CUDA, or GPU compute software stacks (runtime, compiler, ML frameworks).
  • Experience integrating LLMs, AI agents, or agentic workflows into software development or test engineering processes.
  • Expertise in open-source development practices and community-facing quality processes (GitHub Actions, open CI, etc.).
  • Background in SDET or test engineering in a semiconductor, HPC, or AI infrastructure company.
  • Experience with GPU-specific test challenges: non-determinism, thermal behavior, multi-device coordination, driver stability.
  • Track record of shipping test frameworks or validation tools used across large engineering organizations.
  • Familiarity with ML training/inference workload validation: throughput, latency, numerical stability across precision formats (FP32/BF16/FP8), and multi-GPU collective communication correctness.
  • Experience with GPU profiling and trace analysis tooling (e.g., rocprof, omniperf, PyTorch profiler) to identify kernel-level performance and correctness anomalies.
  • Familiarity with HIP, CUDA, or low-level GPU programming — sufficient to understand what is being tested at the runtime and kernel level, even if not writing kernels directly.


#LI-G11

#LI-HYBRID

Note: This role is intentionally scoped as a hands-on technical leadership position. Candidates whose primary background is program management or traditional QA management without deep engineering execution experience may not be the right fit.

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Create a job alert for this search

Director of Software Validation Engineering – ROCm • MARKHAM, Ontario, Canada

Similar jobs

Director, Software Engineering (Site Reliability Engineering)

AffirmToronto, ON, CA
Full-time

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest.In this role, you will bu... Show more

 • Promoted

Quality Director Focused on AI Hardware and Systems Management

TenstorrentToronto, ON, CA
Full-time

Become the Quality Director dedicated to AI hardware innovation in a hybrid work environment.Foster excellence in quality systems and lead teams in a fast-paced, technology-driven landscape.This ro... Show more

 • Promoted

Director of Engineering, AI-Driven Health Software

Green Shield Canada (GSC)Toronto
Full-time

A not-for-profit health and benefits company in Toronto is seeking a Director of Engineering to lead a software engineering team.The ideal candidate will have over 10 years of senior leadership exp... Show more

 • Promoted

Director of Forward Deployed Engineering (Remote)

The New NetworkToronto, ON, CA
Remote
Full-time

A fast-growing SaaS company is seeking a Technical Director for its Forward Deployed Engineering team.The role focuses on leading the team that handles complex customer deployments and integrations... Show more

 • Promoted

Solutions Engineering Director - Remote Role

VersatermToronto, ON, CA
Remote
Full-time

Join as a Solutions Engineering Director, leading a dedicated team in a remote capacity while driving technical pre-sales initiatives.Foster collaboration to enhance public safety responses and inn... Show more

 • Promoted

Director of Optical Subsystems Validation

Coherent Corp.Toronto
Full-time

Director of Optical Subsystems Validation to lead product verification in the telecom sector.Enhance operational efficiency and customer satisfaction through strategic oversight.You will spearhead ... Show more

 • Promoted

Engineering Director for Collaborative Teams

eBayToronto, ON, CA
Full-time

Lead engineering initiatives with a focus on team development and strategic product execution.Elevate software quality and delivery outcomes in a supportive environment.In this strategic leadership... Show more

 • Promoted

Director of Software Validation Engineering – ROCm

Advanced Micro Devices, IncMarkham
Full-time

WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst... Show more

 • Promoted

Validation Engineering Associate

Prollenium Medical Technologies Inc.Richmond Hill, ON, Canada
Full-time

The Validation Specialist supports the manufacturing team by performing process valuations on new and current manufacturing processes.This position reports to the Senior Director of Manufacturing.C... Show more

 • Promoted • New!

Director of Engineering

Intuition MachinesToronto, ON, CA
Full-time

Intuition Machines uses AI/ML to build enterprise security products.We apply our research to systems that serve hundreds of millions of people, with a team distributed around the world.You are prob... Show more

 • Promoted

Hybrid Director of Software Engineering Toronto

Colliers InternationalToronto, ON, CA
Full-time

Lead and mentor top software engineering teams with Colliers in a hybrid role based in Toronto.Drive strategic technology planning, project delivery, and foster a culture of excellence across North... Show more

 • Promoted

Director of Software Development - AI-Driven Ingestion

LoopioToronto, ON, CA
Full-time

A dynamic software company is seeking a Director of Software Development to lead the RFX I/O Mission Team.This role involves developing and overseeing document ingestion and processing platforms, i... Show more

 • Promoted

Director, Software Engineering

Loblaw DigitalToronto, ON, CA
Full-time

At Loblaw Digital, we know that our customers expect the best from us.Whether that means building the best, most innovative online shopping experience, or designing an app that will impact the live... Show more

 • Promoted

Director of Engineering — Platform & Reliability (Remote)

CliniaToronto, ON, CA
Remote
Full-time

A tech-driven health company in Canada is seeking a Director of Engineering to lead an engineering team of 25.You will manage delivery, ensure platform reliability, and set engineering standards wh... Show more

 • Promoted

Director of Software Validation Engineering – ROCm

Advanced Micro DevicesMarkham
Full-time

WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst... Show more

 • Promoted

Principal Software Validation Specialist

ValGenesisToronto, ON, CA
Full-time

ValGenesis is a leading digital validation platform provider for life sciences companies.ValGenesis suite of products are used by 30 of the top 50 global pharmaceutical and biotech companies to ach... Show more

 • Promoted

Senior Director of Quality Engineering & Program Delivery

CGIToronto, ON, CA
Full-time

A leading IT consulting firm in Toronto seeks a Director of Quality Engineering to oversee a large-scale program with over 150 resources.This role involves defining and executing a comprehensive qu... Show more

 • Promoted

BMO Director of Software Engineering

Bank of MontrealToronto, ON, CA
Full-time

Shape the future of enterprise technology as the Director of Software Engineering at BMO.This hybrid role blends infrastructure expertise with a focus on resilience and compliance to deliver innova... Show more

 • Promoted

Technical Director for Software Engineering

HRBToronto, ON, CA
Full-time

Drive technical innovation in software engineering as a hands-on Technical Director.Collaborate with the CTO on full stack development, focusing on Go and Rust while mentoring a skilled engineering... Show more

 • Promoted

Director of Software Validation Engineering – ROCm

AMDMarkham, ON, CA
Full-time

WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst... Show more