Search jobs > Montreal, QC > Reliability engineer

Site Reliability Engineer

SAP
Montreal, Queb, CA
Full-time

We help the world run better

At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better.

How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work.

We offer a highly collaborative, caring team environment with a strong focus on learning and development, recognition for your individual contributions, and a variety of benefit options for you to choose from.

Montreal location.

PURPOSE AND OBJECTIVES

The Reliability Engineering organization provides multitude of products and services related to operations and continuity of business delivery.

The Site Reliability Engineering teams make the SAP Business Technology Platform run better by providing 24x7 deep technical coverage for Incident Management (Outages and other incidents with major customer impact) applying SRE principles.

We share a Live Site First culture and care for the business continuity of our customers running mission critical applications in the Cloud.

We are looking for an engineer to join an already established SRE team for the SAP Business Technology Platform.

EXPECTATIONS AND TASKS

As a Site Reliability Engineer, you will have the opportunity to operate and support business critical Cloud services. As part of your daily job, you will proactively monitor the service behavior and identify areas for improvement.

You will participate in the development of tools for monitoring and troubleshooting cloud services built on latest open source and SAP technologies, following SRE principles.

Responsibilities

  • Act as technical expert during Live site incidents (downtimes of supported services in scope), investigate and solve incidents on a deep technical level.
  • Drive root cause analysis and follow-up improvements to prevent issues from reoccurring.
  • Perform in-depth troubleshooting and log analysis to identify and solve complex issues in accordance with internal and external SLAs.
  • Build software-based solutions to address improvements in service reliability and stability.
  • Enhance infrastructure and platform monitoring by gathering system metrics (4 Golden Signals) and implementing tools for recovery.
  • Integrate and collaborate closely with development teams and work with them on outputs from Postmortems and product improvements.
  • Learn new technologies and keep up to date with latest development increments.
  • Create and maintain technical documentation.
  • Define, advocate, apply SRE best practices.
  • Participate in the on-call rotation (follow the sun approach) to react to major incidents. On-call has a special compensation package.

If you are interested in software engineering based on cutting-edge technology, you will find an inspiring and professional environment for your learning and growth.

You will be working in close collaboration with the development teams that build the services which are in our joint responsibility.

We emphasize teamwork and a trust-based working model. Collaboration with other teams in an international environment will be a regular part of your work.

EDUCATION AND QUALIFICATIONS / SKILLS AND COMPETENCIES

Required Skills and Competencies .

  • Experience with Kubernetes and good understanding of container technologies.
  • Understanding of modern cloud architectures (experience with Cloud Platforms such as AWS, Azure, GCP are a plus) .
  • Scripting skills, CI / CD (Concourse, Github Actions and ArgoCD are a plus) - enthusiasm for automation - make the computers do the work for you.
  • Working efficiently in emergency situations. Affinity to quickly analyze and solve problems in a global team setup .
  • Excellent team player, passionate about his / her work, self-motivated and driven .
  • Excellent communication skills - precise, based on facts .
  • Fluency in English , basic French

Preferred Additional Skills and Competencies

  • Coding experience with Python, Bash, GO
  • CKA / CKAD / CKS certifications
  • Experience with Unix / Linux operating system
  • Experience with modern monitoring, logging, and alerting tools (Grafana, Prometheus, Kibana, Loki, Splunk On-Call, Dynatrace)
  • Security best practices for application development and operations in a public Cloud Environment
  • Contribution to open-source projects

WORK EXPERIENCE

If you are interested in this position and would like to join our team, please apply even if you don’t meet all the qualifications listed in the job posting.

You may be offered a position according to your current working experience and expertise.

EDUCATION AND QUALIFICATIONS / SKILLS AND COMPETENCIES

Required Skills and Competencies .

  • Experience with Kubernetes and good understanding of container technologies.
  • Understanding of modern cloud architectures (experience with Cloud Platforms such as AWS, Azure, GCP are a plus) .
  • Scripting skills, CI / CD (Concourse, Github Actions and ArgoCD are a plus) - enthusiasm for automation - make the computers do the work for you.
  • Working efficiently in emergency situations. Affinity to quickly analyze and solve problems in a global team setup .
  • Excellent team player, passionate about his / her work, self-motivated and driven .
  • Excellent communication skills - precise, based on facts .
  • Fluency in English , basic French

Preferred Additional Skills and Competencies

  • Coding experience with Python, Bash, GO
  • CKA / CKAD / CKS certifications
  • Experience with Unix / Linux operating system

Posting in French

OBJECTIFS ET MISSION S

L'organisation d'ingénierie de fiabilité fournit une multitude de produits et de services liés aux opérations et à la continuité des services.

Les équipes d'ingénierie de fiabilité des sites améliorent le fonctionnement de la plateforme technologique SAP en fournissant une couverture technique approfondie 24h / 24 et 7j / 7 pour la gestion des incidents (pannes et autres incidents ayant un impact majeur sur les clients) en appliquant les principes de l'ingénierie de fiabilité des sites.

Nous partageons une culture axée sur la disponibilité des services avant tout et nous nous soucions de la continuité des activités de nos clients qui exécutent des applications critiques dans le Cloud.

Nous recherchons un ingénieur pour rejoindre une équipe d'ingénierie de fiabilité des sites déjà établie pour la plateforme technologique SAP ( SAP Business Technology Platform ).

ATTENTES ET TÂCHES

En tant qu'ingénieur en fiabilité des sites, vous aurez l'opportunité d'exploiter et de prendre en charge des services critiques pour SAP et ses clients.

Dans le cadre de votre travail quotidien, vous surveillerez de manière proactive le comportement du service et identifierez les opportunités d'améliorations.

Vous participerez au développement d'outils de surveillance et de dépannage des services Cloud basés sur les dernières technologies open source et technologies SAP, en suivant les principes de l'ingénierie de fiabilité des sites.

Responsabilités

  • Agir en tant qu'expert technique lors d’incidents de nos services en production, investiguer et résoudre les incidents à un niveau technique approfondi.
  • Mener des analyses des causes sources (RCA) et faire le suivi sur les possibilités d’améliorations afin de prévenir que les problèmes se reproduisent.
  • Effectuer des investigations approfondis et des analyses de journaux d’événements pour identifier et résoudre des problèmes complexes conformément aux promesses de niveau de service (SLA).
  • Concevoir des solutions logicielles pour améliorer la fiabilité et la stabilité des services.
  • Améliorer la surveillance de l'infrastructure et de la plateforme en amassant des métriques système (4 signaux en or) et implanter des outils pour aider à la récupération des services.
  • Intégrer et collaborer étroitement avec les équipes de développement et travailler avec elles pour implémenter les améliorations identifiées lors des post-mortem.
  • Rester à l’affut des nouvelles technologies et se tenir à jour techniquement.
  • Créer et maintenir une documentation technique.
  • Définir, promouvoir et appliquer les meilleures pratiques de l'ingénierie de fiabilité des sites.
  • Être sur appel (rotation) afin de réagir aux alertes et prévenir les incidents majeurs. Le temps sur appel bénéficie d'un régime de compensation spécial.

Nous pratiquons l’approche suivi du soleil

ÉDUCATION ET QUALIFICATIONS / COMPÉTENCES ET APTITUDES

Compétences et aptitudes requises

  • Baccalauréat en informatique ou dans un domaine technique connexe.
  • Expérience avec Kubernetes et bonne compréhension des technologies de conteneurisation.
  • Compréhension des architectures cloud modernes (une expérience avec des plateformes cloud telles que AWS, Azure, GCP est un plus).
  • Compétences en Scripting, CI / CD (Concourse, Github Actions et ArgoCD sont un plus) - enthousiasme pour l'automatisation - faire en sorte que les ordinateurs effectuent le travail à votre place.
  • Travailler efficacement dans des situations d'urgence. Affinité pour analyser et résoudre rapidement les problèmes au sein d'une équipe mondiale.
  • Excellente esprit d'équipe, passionné par son travail, motivé et dynamique.
  • Excellentes compétences en communication précis et basées sur des faits.
  • Maîtrise l'anglais, français de base.

Compétences et aptitudes supplémentaires donnant un avantage

  • Expérience de programmation avec Python, Bash, Go.
  • Certifications CKA / CKAD / CKS.
  • Expérience avec les systèmes d'exploitation Unix / Linux.
  • Expérience avec les outils modernes de surveillance, de journalisation et d'alerte (Grafana, Prometheus, Kibana, Loki, Splunk On-Call, Dynatrace).
  • Meilleures pratiques de sécurité pour le développement et l’opération d'applications cloud.
  • Participation à des projets open-source.

EXPÉRIENCE PROFESSIONNELLE

Si vous êtes intéressé par ce poste et souhaitez rejoindre notre équipe, veuillez postuler même si vous ne répondez pas à toutes les qualifications mentionnées dans l'offre d'emploi.

Vous pourriez vous voir offrir un poste en fonction de votre expérience professionnelle actuelle et de votre expertise.

Bring out your best

SAP innovations help more than four hundred thousand customers worldwide work together more efficiently and use business insight more effectively.

Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management.

As a cloud company with two hundred million users and more than one hundred thousand employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development.

Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves.

At SAP, you can bring out your best.

We win with inclusion

SAP’s culture of inclusion, focus on health and well-being, and flexible working models help ensure that everyone regardless of background feels included and can run at their best.

At SAP, we believe we are made stronger by the unique capabilities and qualities that each person brings to our company, and we invest in our employees to inspire confidence and help everyone realize their full potential.

We ultimately believe in unleashing all talent and creating a better and more equitable world.

SAP is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to the values of Equal Employment Opportunity and provide accessibility accommodations to applicants with physical and / or mental disabilities.

If you are interested in applying for employment with SAP and are in need of accommodation or special assistance to navigate our website or to complete your application, please send an e-mail with your request to Recruiting Operations Team : [email protected].

For SAP employees : Only permanent roles are eligible for the SAP Employee Referral Program, according to the eligibility rules set in the SAP Referral Policy .

Specific conditions may apply for roles in Vocational Training.

EOE AA M / F / Vet / Disability :

Qualified applicants will receive consideration for employment without regard to their age, race, religion, national origin, ethnicity, age, gender (including pregnancy, childbirth, et al), sexual orientation, gender identity or expression, protected veteran status, or disability.

SAP believes the value of pay transparency contributes towards an honest and supportive culture and is a significant step toward demonstrating SAP’s commitment to pay equity.

SAP provides the annualized compensation range inclusive of base salary and variable incentive target for the career level applicable to the posted role.

The targeted combined range for this position is 71000 - 150000CAD

CAD. The actual amount to be offered to the successful candidate will be within that range, dependent upon the key aspects of each case which may include education, skills, experience, scope of the role, location, etc.

as determined through the selection process. Any SAP variable incentive includes a targeted dollar amount, and any actual payout amount is dependent on company and personal performance.

Please reference this link for a summary of SAP benefits and eligibility requirements : www.SAPNorthAmericaBenefits.com

Requisition ID : 399416

Work Area : Software-Development Operations

Expected Travel : 0 - 10%

Career Status : Professional

Employment Type : Regular Full Time

Additional Locations :

LI-Hybrid

17 days ago
Related jobs
Royal Bank of Canada>
Montreal, Quebec

This role will be responsible for the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Digital Branch SRE organization. As the Engineering arm of the Digital Branch SRE organization, this team will work collaboratively with th...

Behavox
Canada

As a Site Reliability Engineer, you will be responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of all production systems and services. You will work together with other DevOps, Product, and Engineering teams to...

Leica Geosystems
Canada

Senior DevOps Engineer / Site Reliability. DevOps &/or Site Reliability Engineering principles. Senior DevOps Engineer / Site Reliability | Hexagon Geosystems. As a Senior DevOps/SRE Engineer, you will help build solutions that allow our cloud-based platform, HxDR, to continue to evolve and grow thr...

Royal Bank of Canada>
Montreal, Quebec

This role will be responsible for the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Digital Branch SRE organization. As the Engineering arm of the Digital Branch SRE organization, this team will work collaboratively with th...

Remotivate LLC
QC, CA
Remote

Off-Site Events: We aim to hold one or two off-site events each year, where the entire team gathers for a week to collaborate and connect in person (usually in a warm location). Hello, Lead Infrastructure Engineers!. We are looking to hire a motivated Lead Infrastructure Engineer who will be the arc...

Okta, Inc.
Canada

Working closely with the product engineers, quality engineers, platform engineers and architecture teams, your primary focus will be on ensuring production systems remain operational at all times, while continually setting and achieving long-term performance, reliability and scalability goals in a p...

Jobber
Canada
Remote

Senior Site Reliability Engineer. Our Software Engineering team is pivotal to Jobber's success, creating software that adds value to tens of thousands of users worldwide. As a part of our cloud infrastructure team (SRE), you'll play a critical role in empowering our product development teams, ensuri...

Mojio
Canada

Title: Senior Site Reliability  Engineer. ...

Royal Bank of Canada
Montreal, Quebec

This role will be responsible for the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by the Digital Branch SRE organization. As the Engineering arm of the Digital Branch SRE organization, this team will work collaboratively with th...

Bourse de Montreal Inc.
Montréal, Quebec

Previous experience as a Site Reliability Engineer (SRE). The Devops Engineering team is responsible for working closely with various business units and stakeholders to solve complex problems using innovative solutions, quickly and effectively using agile, lean and devops methodologies, while ensuri...