Job descriptionJob Description Key Responsibilities: - Incident Management and Reliability: Lead the incident management process, ensuring high availability and performance of the applications. Develop and implement SRE practices to improve system reliability and resilience. - Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to monitor system health, detect anomalies, and provide actionable insights for performance optimization. - Root Cause Analysis: Conduct thorough root cause analysis of incidents and outages, developing long-term solutions to prevent recurrence. - DevOps Practices: Collaborate with development and operations teams to streamline CI/CD pipelines, automate workflows, and implement infrastructure as code (IaC) for efficient service deployment and management. - Networking Expertise: Provide expertise in networking technologies (Cisco, Arista, AVI, etc.), ensuring robust network infrastructure design, implementation, and troubleshooting. Utilize tools like Wireshark for in-depth network analysis and debugging. - Collaboration and Leadership: Work closely with cross-functional teams to share knowledge, mentor junior engineers, and lead by example in adopting best practices in SRE, DevOps, and networking. - Innovation and Continuous Improvement: Stay abreast of industry trends and new technologies, advocating for and implementing innovative solutions to enhance system reliability and performance. Qualifications: - Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field. - 10+ years of experience in an SRE/DevOps role, with a proven track record in managing high-availability systems. - Strong expertise in monitoring and observability tools (Dynatrace, Splunk, Grafana). - Proficient in network debugging and analysis tools, including Wireshark. - Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker). - Certifications in relevant technologies (Dynatrace, Splunk) are a plus. - Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams. - Excellent problem-solving skills, with the ability to conduct comprehensive root cause analysis and troubleshooting.