No longer accepting applications

Principal Site Reliability Engineer

AutodeskRivière-Des-Prairies-Pointe-Aux-Trembles, Canada, CA

3 days ago

Job type

Full-time

Job description

Job Requisition ID # 25WD85835We are seeking a highly motivated and experienced Principal Site Reliability Engineer (SRE) to manage critical cloud infrastructure and site reliability operations for Autodesk's global Product Access journey. This pivotal role focuses on ensuring the highest reliability, availability, and performance of our AWS-hosted cloud infrastructure.Reporting to the Engineering Manager, you will be leading design and development of resilient and scalable architecture and innovative solutions for the platform. You will independently manage and deliver end-to-end solutions while engaging with key stakeholders and partners.ResponsibilitiesLead architecture, solution design, development and maintenance of cloud infrastructure for micro-services architecture.Independently manage requirement analysis, solution design, implementation, and release planning.Ensure high adherence to trust and security compliance, guidelines and standards.Streamline CI / CD processes, improve system reliability, and ensure infrastructure scalability and security.Automate infrastructure deployment, scaling, and management using modern DevOps tools and practices.Implement and maintain configuration management and infrastructure as code (IaC) using Terraform.Lead Disaster Recovery (DR) strategies, failover exercises, gamedays, and period maintenance activities.Contribute to critical vulnerability (CVEs) remediation efforts.Promote and document security and best practices across all pillars of DevOps / SRE throughout system design.Provide real-time operational support and collaborate across functions to resolve system, infrastructure, and CI / CD issues.Participate in on-call rotations, providing critical 24x7 support for production systems.Minimum QualificationsBachelor’s degree or higher in Computer Science, Engineering, or a related field.8+ years of progressive experience in Site Reliability Engineering, DevOps, or a similar field.Proficiency with managing AWS resources and understanding of networking and security protocols.Expertise in infrastructure as code (IaC) and cloud automation tools such as Terraform, Serverless, and CloudFormation.Expertise in defining and building CI / CD processes with tools like Jenkins, GitHub, and Artifactory.Experience with container-based technologies like Docker and AWS ECS.Experience with monitoring and logging tools such as Dynatrace, Grafana, DataDog, ELK Stack, and CloudWatch.Experience in Linux Systems Administration, scripting, and troubleshooting in a production environment.Proficiency in programming languages such as UNIX, Python, Go, Bash, Groovy, and Node.Js.Technology Stack : Java / SpringBoot, AWS (ECS Fargate, Elastic Cache, Lambda, Kinesis, DynamoDB, VPC, IAM policies, API Gateway, NLB / ALB, Route 53, CloudWatch, Kibana, Open Search), Kafka, GoLang, Node.Js, Groovy, Python, Jenkins, GitHub, Jira, ServiceNow, and Splunk.Preferred QualificationsKnowledge in applying AI and ML solutions for engineering processes and / or DevOps automation.Knowledge of standardized observability frameworks such as OpenTelemetry.Relevant certifications (e.G., AWS Certified DevOps Engineer, AWS Site Reliability Engineer).Broad knowledge of AWS, Redis, server programming, databases, and cloud architectures.Broad knowledge with data streaming pipelines like Kinesis, Firehose, and Kafka.Knowledge on core Java and SpringBoot concepts in JVM optimization.Knowledge on build tools, e.G. Gradle.Strong interpersonal and communication skills to effectively collaborate in an Agile / Scrum-oriented environment.Self-directed team player and independent contributor, demonstrating accountability and end-to-end ownership.#J-18808-Ljbffr