Responsibilities Own reliability, availability, scalability, and security of production systems Design and operate highly available, fault‑tolerant, multi‑region cloud architectures Define and manage SLOs, SLIs, SLAs, and error budgets for critical services Lead high‑severity incidents and drive effective post‑incident reviews Improve MTTD and MTTR through automation, tooling, and runbooks Operate and evolve Kubernetes (EKS) platforms and multi‑tenant deployments Work with Infrastructure‑as‑Code (Terraform, CloudFormation, Pulumi) at scale Build and improve CI/CD pipelines and deployment safeguards Design and maintain observability (metrics, logs, traces, alerting) Drive capacity planning, performance optimisation, and cloud cost efficiency Partner with Security & Compliance on SOC 2, ISO 27001, GDPR, and DORA controls Mentor SREs and influence reliability‑first engineering practices across teams Qualifications 6+ years in SRE, DevOps, or cloud infrastructure roles (2+ years in a senior/lead capacity) Strong AWS experience (EKS, RDS/Aurora, S3, MSK, VPC, IAM, ALB/NLB) Deep Kubernetes operational expertise Proven incident management and post‑mortem leadership Solid experience with IaC, CI/CD, and automation Strong scripting or programming skills (Python, Go, Bash) Hands‑on observability experience (Prometheus, Grafana, Datadog, ELK, OpenTelemetry) Excellent communication and cross‑team collaboration skills #J-18808-Ljbffr
Senior Site Reliability Engineer
YELLOSA
johannesburg, johannesburg
Published 4 days ago
Report job