Site Reliability Engineer
On behalf of our client, Procom is seeking a Site Reliability Engineer for a full-time permanent position, that can be fully remote across Canada.
Site Reliability Engineer - Job Details
- We are looking for a self-driven Site Reliability Engineer (SRE) who likes taking engineering-based approaches to solve Supportability problems, with a history of engineering excellence and experience in supporting cloud services. You will be responsible for optimizing and operating supportability improvements in a data-driven manner, working closely with Software Engineers to design and deliver experience that adheres to services best practices, highly available, reliable, scalable, provides a great user experience, and meets our compliance policies and requirements.
- You’ll be focused on driving continuous improvements across the lifecycle of our services with automation in mind. You’ll also demonstrate a history of managing multiple priorities, deep technical and online services skills, a focus on using metrics and data, and a strong supportability-first mindset.
Site Reliability Engineer - Main Responsibilities
Collaborating closely with several engineering teams on building and enhancing tooling and automation solutions for faster resolution of customer issues and avoiding them altogether when possible.Partnering with external platform teams building the support tooling with the ability to extend those to meet the needs of any special requirements.Ability to design and implement any changes to service telemetry for the automation to consume if it's not already available.Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.Engage and foster opportunities to improve existing planning, processes, and automation.Site Reliability Engineer - Mandatory Skills
Bachelor’s degree in Computer Science, Engineering, or related technical field.5+ years of SRE or SWE experience running large scale online / hybrid services in cloud environments (Azure), applying site reliability principles and / or demonstrating sensitivity to operational concerns. Automation-related experience valued.Experience with any of C# / Java / Python as a primary language.Fluency in one or more automation languages like PowerShell, Python etc.Specifically desired is a deep understanding and familiarity with Observability and MELT (Monitoring, Events, Logging, and Tracing) design and implementation patterns for large-scale distributed services.Experience in hypothesis driven development, test-driven development / behavior driven development desirable.Familiar with Agile / Scrum / Lean Methodology.Strong problem-solving, troubleshooting, and analytical skills.Ability to deal with the ambiguity associated with working in a fast-paced and changing environment and aren't afraid to change things to make them better.Intellectual curiosity and high EQ (emotional intelligence) will serve the successful candidate well.Great communicator with the ability to analyze and clearly articulate complex issues.Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.Site Reliability Engineer - Assignment Location
Fully Remote, across CanadaSite Reliability Engineer - Assignment Location - Length
Permanent