Pay at Intact is about much more than just salary.
Flexible work arrangements and a hybrid work model
Possibility to purchase up to 5 extra days off per year
Multiple benefits offered to support physical and mental wellbeing, including telemedicine, Wellness account and much more
Share plan & other savings: up to 12% of salary or even more (ask how you could earn guaranteed income for life)
Salary range (but not limited to):
149,600 - 182,800
Annual bonus target, based on the base salary, with a potential payout of up to double the target (subject to personal and company performance):
15%
As part of our commitment to Win As A Team, we share our success with employees through our annual bonus plan and Employee Share Purchase Plan (ESPP) – with Intact matching 50% of your net shares.
Our pension offerings provide flexibility and long-term security for our employees beyond their careers. We are one of the few companies offering the opportunity to receive guaranteed income for life via our defined benefit pension plan.
Salary for the candidate will be determined taking into consideration a number of factors including: experience, skills, qualifications, anticipated contribution to role, internal equity, etc. The salary range presented above is based on a 35-hour workweek and would represent a majority of different candidate profiles. However, we encourage candidates who may fall outside of this range to apply as well.
About the role
We are seeking a Resiliency Architect to define and drive our end-to-end resiliency architecture and production reliability posture across Azure, AWS, Google Cloud, and on‑prem environments.
This person will be responsible to design standards, production readiness, and enforcement mechanisms at enterprise scale.
The ideal candidate combines deep SRE expertise with advanced systems architecture and a strong vision for explicit blue/green and chaos engineering practices—alongside AI/GenAI—to make systems reliable, leverage AI as a force multiplier for resiliency, transform team workflows, and deliver resilient, intelligent user solutions.
What you'll do here:
Core objectives :
Establish the enterprise resiliency architecture, patterns, and production guardrails for all critical platforms and services.
Govern design quality through rigorous architecture reviews and production readiness assessments.
Make blue/green deployments and chaos engineering first-class, codified practices across the estate: design, tooling, automation, and continuous validation.
Integrate AI/GenAI into reliability engineering: robust AI system architectures, AI-assisted observability, causal detection, and autonomous remediation.
Lead the evolution of disaster recovery, ransomware protection, and continuity strategies grounded in hard SLAs/SLOs and measurable business outcomes.
Key responsabilities
Own the resiliency reference architecture for multi-cloud/hybrid (multi-region/zone, active-active/passive, blast-radius reduction) and define/enforce NFRs (availability, latency, durability, RTO/RPO).
Establish governance via design reviews, production gates, policy-as-code, scorecards, and automated controls integrated with CI/CD, IaC, and runtime platforms.
Standardize blue/green deployment architecture and engineer safe traffic shifting, health gates, progressive cutovers, rollback, and zero-downtime data migrations.
Lead an enterprise chaos engineering program (experiments, failure injection, game days) and feed outcomes back into architecture guardrails and SLO improvements.
Define production readiness standards (capacity/saturation, graceful degradation, retries/backoff, circuit breakers, rate limiting) and codify runbooks, dependency maps, and failover topologies validated via DR drills and rehearsals.
Drive observability and SRE practices: OpenTelemetry adoption, distributed tracing, SLIs/SLOs/SLAs, error budgets, and executive reliability dashboards.
Architect DR and cyber-resilience (immutable/air-gapped backups, PITR, ransomware-resistant segmentation, recovery validation) aligned to regulatory and audit needs.
Guide platform and data resiliency across Kubernetes/service mesh, replication/consensus, geo-distribution, and event streaming (DLQs, backpressure, reprocessing).
Enable reliable AI/GenAI systems and AI-driven operations (monitoring/guardrails, anomaly detection, predictive modeling, human-in-the-loop remediation, ops copilots).
Serve as principal resilience authority: mentor teams, lead councils/forums, and communicate tradeoffs clearly to executives and engineers.
What you bring to the table:
10+ years in SRE/Platform/Infrastructure/Systems Architecture with proven large-scale, production-critical experience across Azure, AWS, GCP, and on‑prem.
Multi‑region traffic management, global load balancing, DNS/BGP, TLS/mTLS, CDN/edge patterns.
Kubernetes ecosystems (AKS/EKS/GKE), service meshes (Istio/Linkerd), autoscaling strategies, readiness/liveness, topology constraints.
Observability stacks: OpenTelemetry, Prometheus/Grafana, Jaeger/Tempo, ELK/OpenSearch, commercial APM; correlation and topology modeling.
Data resilience: consensus/replication (Raft/Paxos), partitioning, PITR, snapshots, CDC; caches (Redis), databases (Aurora, Cosmos DB, Spanner).
IaC and automation: Terraform/Pulumi, GitOps (Argo CD/Flux), policy‑as‑code (OPA), CI/CD patterns (blue/green, canary, progressive delivery).
Chaos engineering, DR orchestration, and automated failover at enterprise scale.
For candidates located in Quebec, bilingualism is required considering the necessity to interact on a regular basis with English speaking colleagues across the country.
No Canadian work experience required however must be eligible to work in Canada
AI/GenAI competencies:
Architecting reliable AI systems: model serving (Ray/SageMaker/Vertex), vector stores (Pinecone/FAISS/pgvector), retrieval pipelines, guardrails and safety.
ML/ops: model monitoring (drift, performance, hallucination detection), feature pipelines, lineage/observability, prompt/content governance.
Applying AI to operations: causal detection, predictive resiliency, autonomous remediation frameworks.
Strong software engineering skills (Go/Python/TypeScript) and systems thinking; excellent communication (written, visual, verbal) and executive presence.
#LI-Hybrid
Il s'agit d'un nouveau rôle au sein de notre équipe en plein croissance | This role is a new member of our growing team.