Required Skills:
· Kubernetes
· Site Reliability Engineering (SRE)
· Dynatrace
Role Description:
· Responsible for developing and leading the company’s enterprise observability and reliability capability.
· The SRE and Observability Lead will collaborate across multiple teams to ensure comprehensive monitoring of all environmental components.
· This role will designate Dynatrace as the system of record for platform health and apply SRE practices to improve availability, performance, and incident outcomes across applications, infrastructure, and integrations.
· Own enterprise observability using Dynatrace across cloud, on-prem, ERP, WMS, eCommerce, APIs, and integrations.
· Design service topology, dashboards, alerts, and health indicators that reflect business impact.
· Apply SRE principles (SLIs, SLOs, error budgets where appropriate) to reduce incidents and improve resilience.
· Accelerate incident detection and root-cause analysis lead post-incident reviews focused on systemic fixes.
· Identify reliability, performance, and capacity risks before they impact the business.
· Define observability and SRE standards and enable teams to use them effectively.
· Must have 5 years in infrastructure, platform, operations, or reliability engineering.
· Must demonstrate hands-on experience implementing and operating Dynatrace.
· Must have a strong understanding of distributed systems, cloud hybrid environments, and integrations.
· Must have practical experience with SRE or reliability engineering concepts.
· Must be comfortable operating in high-impact incident and production environments