Job Title : Lead (Infra Monitoring SRE)
Location : Toronto, ON
Skills and Responsibilities :
- Technical Leadership : Provide architectural and technical guidance and mentorship to SRE teams, fostering skill development, and building strong and capable SRE practices, promote a culture of SRE learning and growth within the team.
- Performance & Optimization : Manage the production infrastructure.
- Proactively identify and resolve potential performance and scalability bottlenecks in our front-end and back-end systems and underlying infrastructure.
- Infrastructure Management & Disaster Recovery : Understand Infrastructure design, distributed systems architecture, clustering & failover mechanism.
- Develop a disaster recovery plan and facilitate disaster recovery testing.
- Familiarity with Dirt exercises, Linux, F load balancer, SAN True Copy, OCAC Data Guard is a plus.
- Software Engineering for Operations : Develop and maintain application features and services that enhance the efficiency and reliability of our operations.
- Python / Java, Bash programming knowledge is a must.
- Systems and Application Monitoring / Observability : Develop and maintain comprehensive monitoring and observability solutions using Splunk, Dynatrace & Sheller.
- Ensure detailed visibility into system performance and application health.
- Reliability and Production Environment Management : Ensure the reliability and stability of our production environments.
- Continuously assess and improve system reliability, identifying and addressing potential points of failure.
- Incident Response and Troubleshooting : Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence.
- Participate in post-incident reviews and contribute to blameless postmortems.
- Automation and Scripting : Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Java.
- Implement and improve CI / CD SLOs and SLA Management : Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreement