Job descriptionJob Description
Site Reliability Engineer – APM, Dynatrace, Observability
Duration: 12 months
Location: Toronto
Hybrid: 2 days in office a week
SRE Lead
Deep application and system-level knowledge across complex end-to-end environments, including tightly integrated on prem and cloud native services, supporting large-scale, multitier transaction flows
Prior hands-on experience with APM and observability platforms, including Dynatrace or comparable enterprise observability tools, with the ability to instrument, analyze, and troubleshoot complex distributed applications
Proven deep troubleshooting experience resolving issues across multilayer, end to end (E2E) environments, spanning application, infrastructure, network, and platform layers across on prem and cloud services
The person is to drive and execute the SREWCCS Roadmap
Hand-on role from day 1
Observability experience expectations please see description for Observability SME below
Deep knowledge and experience in implementing SRE practices and guiding complex SRE implementations across the industry
Would provide
o Assessments of current capability help identify gaps and contribute to the SRE WCCS roadmap
o Able to navigate multi-team SRE IT Ops to drive results
o Creative workaround and solutions
SRE Observability SME
Hands-on role from day 1
Day 1 Dynatrace expertise i.e.
o DQL
o Gen3 dashboards
o Traces on Grail
o Active-Gate Plugins
o SRG Workflow development
o Biz Events
Prior hands-on experience with APM and observability platforms, including Dynatrace or comparable enterprise observability tools, with the ability to instrument, analyze, and troubleshoot complex distributed applications
Deep troubleshooting expertise leveraging observability signals (metrics, events, logs, and traces) to identify root causes and resolve failures across multilayer E2E environments
Deep background on Observability fundamentals - MELT
Expert level Dashboard (related UIUX design)
Experienced in troubleshooting performance non-functional issues
Familiar with SRE concepts as outlined in Google SRE book workbook etc.
Expertise in AWS Observability, CW, Application Signals, Metrics, logs traces, Lambda, API-GW
Able to come up with creative ways to monitor observe systems like IBM Data power where sufficient observability isnt present
Development with Python, AWS Lambda, ECS, Azure Functions
Understands fundamentals of how AI based systems built and monitored
Background or knowledge of OTEL
Experienced in Financial Services are or equivalent i.e. very complex end-to-end transaction e.g. 50 systems working together to fulfil one customer request
Platform Engineering experience
Shipping platform capabilities (e.g., self-service onboarding pipeline, policy-as-code, golden signals-as-code, standardized instrumentation libraries).
Depth of knowledge for the role
Programming depth requires strong programming in Python and Node.js and building backend integrations components.
Looking for
Practically observability experience with multi-system integration
In-depth Observability
Requirements
60-70