Talent.com
Lead (Infra Monitoring SRE)

Lead (Infra Monitoring SRE)

E-SolutionsToronto, ON
30+ days ago
Job description

Job Title : Lead (Infra Monitoring SRE)

Location : Toronto, ON

Skills and Responsibilities :

  • Technical Leadership : Provide architectural and technical guidance and mentorship to SRE teams, fostering skill development, and building strong and capable SRE practices, promote a culture of SRE learning and growth within the team.
  • Performance & Optimization : Manage the production infrastructure.
  • Proactively identify and resolve potential performance and scalability bottlenecks in our front-end and back-end systems and underlying infrastructure.
  • Infrastructure Management & Disaster Recovery : Understand Infrastructure design, distributed systems architecture, clustering & failover mechanism.
  • Develop a disaster recovery plan and facilitate disaster recovery testing.
  • Familiarity with Dirt exercises, Linux, F load balancer, SAN True Copy, OCAC Data Guard is a plus.
  • Software Engineering for Operations : Develop and maintain application features and services that enhance the efficiency and reliability of our operations.
  • Python / Java, Bash programming knowledge is a must.
  • Systems and Application Monitoring / Observability : Develop and maintain comprehensive monitoring and observability solutions using Splunk, Dynatrace & Sheller.
  • Ensure detailed visibility into system performance and application health.
  • Reliability and Production Environment Management : Ensure the reliability and stability of our production environments.
  • Continuously assess and improve system reliability, identifying and addressing potential points of failure.
  • Incident Response and Troubleshooting : Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence.
  • Participate in post-incident reviews and contribute to blameless postmortems.
  • Automation and Scripting : Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Java.
  • Implement and improve CI / CD SLOs and SLA Management : Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreement
Create a job alert for this search

Sre • Toronto, ON