Sanlam Fintech is a newly established digital first business within the Sanlam Group on a mission to democratize financial advice and solutions for everyone across the African continent. We exist to pioneer inclusive financial confidence helping people build strong foundations to bridge the gap in generational wealth. Our culture is that of agility and constant deployment, we believe in learning fast, learning cheap and learning forward. Our aim is to provide a work environment where knowledge workers can accelerate the development of their ideas and bring innovation to market, at the same time provide a compelling career and development proposition that will enable them to realize their dreams. Position Overview The Site Reliability Engineer (SRE) at Sanlam Fintech is responsible for ensuring the reliability, scalability, and performance of our cloud‑native infrastructure and services. This role bridges software engineering and operations, applying engineering principles to solve complex infrastructure challenges. The SRE will focus on building and maintaining resilient systems on AWS, implementing comprehensive observability solutions, and driving automation across the infrastructure lifecycle. Operating in a DevOps environment, the SRE takes full ownership of the systems they build and operate, ensuring high availability and optimal customer experience. They work closely with Software Engineers, Platform Engineers, and DevSecOps teams to deliver infrastructure solutions that support Sanlam Fintech business objectives and uphold our commitment to operational excellence. What will you do? Reliability & Resilience Build highly available, fault‑tolerant systems on AWS Define SLIs, SLOs and error budgets to track and improve reliability Plan and implement disaster recovery strategies (RTO/RPO) Lead incident response and root cause analysis Build self‑healing systems with automated fixes for common failures Run chaos engineering tests to find and fix weaknesses Observability & Monitoring Set up metrics, logs and traces for full system visibility Build dashboards and alerts for fast incident detection Implement distributed tracing to spot performance issues Set monitoring standards and maintain operational runbooks Publish regular uptime and operational metrics reports Infrastructure Automation Write and maintain Infrastructure as Code using Terraform and CloudFormation Automate provisioning, configuration and deployments with DevOps/Platform teams Build and manage CI/CD pipelines using GitHub Actions Implement GitOps practices and self‑service automation to reduce manual work Design and optimise serverless solutions (Lambda, API Gateway, Step Functions) Implement cloud‑native patterns like event‑driven and microservices architectures Optimise cloud costs and evaluate new AWS services Software Engineering & Development Build clean, well‑structured automation tools and scripts Apply Clean Architecture and Domain‑Driven Design to infrastructure code Improve internal tools to boost developer productivity Use AI tools (Claude, GPT) to automate routine tasks Work with cross‑functional teams using Jira, Confluence and JSM Participate in on‑call rotations and incident handoffs Mentor junior engineers in SRE practices Document decisions, procedures and run blameless postmortems Qualification and Experience 5+ years of experience in systems engineering, DevOps, or site reliability engineering roles 3+ years of hands‑on experience with AWS cloud services in production environments 2+ years of experience with Infrastructure as Code (Terraform and/or CloudFormation) Demonstrated experience in incident management and on‑call responsibilities Track record of implementing automation that reduced operational toil Educational Background Bachelor's degree in Computer Science, Information Technology, Engineering or related field; or equivalent practical experience Relevant professional certifications are advantageous but not required What will make you successful in this role? Strong expertise in AWS services including EC2, ECS, EKS, Lambda, API Gateway, Step Functions, S3, RDS, DynamoDB, CloudWatch and networking services such as VPC, Route53 and ALB/NLB Deep understanding of serverless architecture patterns and best practices Experience with Kubernetes cluster management, deployment strategies and service mesh concepts Knowledge of cloud security best practices including IAM, security groups and encryption Infrastructure as Code & Automation Proficiency in Terraform for multi‑environment infrastructure management Experience with AWS CloudFormation for native AWS resource provisioning Strong scripting skills in Python for automation and tooling development Experience with configuration management tools and practices Observability & Monitoring Expertise in Datadog, CloudWatch and OTEL for full‑stack observability including APM, infrastructure monitoring, log management and synthetic testing and monitoring Experience designing and implementing SLI/SLO frameworks Proficiency in creating effective dashboards, alerts and runbooks Understanding of distributed tracing and correlation across services Development & Version Control Strong experience with GitHub for version control, code review and CI/CD workflows Understanding of Clean Architecture principles and their application to infrastructure code Familiarity with Domain‑Driven Design concepts for complex system design Experience building and maintaining CI/CD pipelines using GitHub Actions Tools & Platforms Proficiency with Atlassian suite (Jira, Confluence) for project management and documentation Experience leveraging AI tools (Claude, GPT) for code generation, documentation and problem‑solving Familiarity with containerisation technologies (Docker) and orchestration platforms Experience with Linux system administration and troubleshooting Nice To Have Skills Experience with additional cloud providers (Azure and GCP) for multi‑cloud strategies Knowledge of FinOps practices and cloud cost optimisation techniques Experience with chaos engineering tools (AWS Fault Injection Simulator, Gremlin and Chaos Monkey) Familiarity with service mesh technologies (Istio and AWS App Mesh) Experience with database reliability engineering and performance tuning Knowledge of compliance frameworks relevant to financial services (POPIA and PCI‑DSS) Contributions to open‑source projects or community involvement AWS certifications (Solutions Architect, DevOps Engineer or SysOps Administrator) Kubernetes certifications (CKA and CKAD) Experience with event‑driven architectures using AWS EventBridge, SNS, SQS or Kafka Knowledge and Skills IT Data Analysis Software design and deployments Platform management and integration Business Requirements Personal Attributes Organisational Savvy – Contributing through others Manages Complexity – Contributing through others Plans and Aligns – Contributing through others Optimises Work Processes – Contributing through others Core Competencies Being Resilient – Contributing through others Collaborates – Contributing through others Cultivates Innovation – Contributing through others Customer Focus – Contributing through others Drives Results – Contributing through others #J-18808-Ljbffr
Senior Site Reliability Engineer
SANLAM LIMITED
bellville, bellville
Published 6 days ago
Report job