- Search jobs
- Mirabel, QC
- data engineer
Data engineer Jobs in Mirabel, QC
Create a job alert for this search
Data engineer • mirabel qc
Site Reliability Engineer – GenAI Platform
Astra North Infoteck Inc.MONTREAL & MIRABEL, QC, caAssocié au Contrôle de la Qualité - Mirabel / Boisbriand / Saint-Jérôme - $23-25/hr
Network Wireless SolutionsMirabel, QC, CANfinancial controller
Énairco IncMirabel, QC, CAConstruction project manager
MongrainMirabel, QC, CASpecialist, Configuration & Data Management
L3Harris TechnologiesMirabel, QuebecRamp Handler
FedExMirabel, Quebec, CAAdministrative assistant
Randstad CanadaMirabel, Quebec, CAResponsable (Product Owner) Systèmes TI de production
Halo Pharmaceutical CanadaMirabel, QC, CAVirtual Data Entry Clerk
FocusGroupPanelMirabel, Quebec, CanadaMachining Method Technician
Safran Systèmes d’AtterrissageMirabel, QC, CA- Kitchener, ON (from $ 84,695 to $ 285,460 year)
- Iroquois Falls, ON (from $ 134,453 to $ 181,248 year)
- Prince George, BC (from $ 92,469 to $ 180,500 year)
- Thunder Bay, ON (from $ 120,000 to $ 177,500 year)
- Glace Bay, NS (from $ 120,000 to $ 177,500 year)
- Niagara Falls, ON (from $ 128,175 to $ 176,710 year)
- Spruce Grove, AB (from $ 153,483 to $ 173,684 year)
- Chatham-Kent, ON (from $ 121,003 to $ 173,490 year)
- Surrey, BC (from $ 112,450 to $ 172,519 year)
- Medicine Hat, AB (from $ 117,476 to $ 170,000 year)
Popular searches
Site Reliability Engineer – GenAI Platform
Astra North Infoteck Inc.MONTREAL & MIRABEL, QC, ca- Full-time
Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Skills:
Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation