Site Reliability Engineer
On behalf of our client, Procom is seeking a Site Reliability Engineer for a full-time permanent position, that can be fully remote across Canada.
Site Reliability Engineer - Job Details
- We are looking for a self-driven Site Reliability Engineer (SRE) who likes taking engineering-based approaches to solve Supportability problems, with a history of engineering excellence and experience in supporting cloud services. You will be responsible for optimizing and operating supportability improvements in a data-driven manner, working closely with Software Engineers to design and deliver experience that adheres to services best practices, highly available, reliable, scalable, provides a great user experience, and meets our compliance policies and requirements.
- You’ll be focused on driving continuous improvements across the lifecycle of our services with automation in mind. You’ll also demonstrate a history of managing multiple priorities, deep technical and online services skills, a focus on using metrics and data, and a strong supportability-first mindset.
Site Reliability Engineer - Main Responsibilities
- Collaborating closely with several engineering teams on building and enhancing tooling and automation solutions for faster resolution of customer issues and avoiding them altogether when possible.
- Partnering with external platform teams building the support tooling with the ability to extend those to meet the needs of any special requirements.
- Ability to design and implement any changes to service telemetry for the automation to consume if it's not already available.
- Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
- Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.
- Engage and foster opportunities to improve existing planning, processes, and automation.
Site Reliability Engineer - Mandatory Skills
- Bachelor’s degree in Computer Science, Engineering, or related technical field.
- 5+ years of SRE or SWE experience running large scale online/hybrid services in cloud environments (Azure), applying site reliability principles and/or demonstrating sensitivity to operational concerns. Automation-related experience valued.
- Experience with any of C#/Java/Python as a primary language.
- Fluency in one or more automation languages like PowerShell, Python etc.
- Specifically desired is a deep understanding and familiarity with Observability and MELT (Monitoring, Events, Logging, and Tracing) design and implementation patterns for large-scale distributed services.
- Experience in hypothesis driven development, test-driven development/behavior driven development desirable.
- Familiar with Agile/Scrum/Lean Methodology.
- Strong problem-solving, troubleshooting, and analytical skills.
- Ability to deal with the ambiguity associated with working in a fast-paced and changing environment and aren't afraid to change things to make them better.
- Intellectual curiosity and high EQ (emotional intelligence) will serve the successful candidate well.
- Great communicator with the ability to analyze and clearly articulate complex issues.
- Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.
Site Reliability Engineer - Assignment Location
- Fully Remote, across Canada
Site Reliability Engineer - Assignment Location - Length