Electrum Is a Next-generation Payment Software Technology Company. Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing. Our deep expertise has established us as a respected partner in high-volume, low-value payment schemes, enabling clients to deliver services to millions of South Africans daily. At Electrum, we are grounded in impact, creating together, making it safe, and backed by empowered strong teams. The Role As a Core Reliability Engineer, you will act as a central software team enabler. You will define the standards, observability tooling, and automation frameworks that enable our stream-aligned product teams to own their service health independently. Reliability in our FinTech niche involves processing high-volume, widely impacting financial transactions where a dropped message has real-world consequences. Your goal is to ensure reliable software is easy to build and that we know about failures before our clients do. Responsibilities Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets. Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted. Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR). Standardised Alerting & On-Call: Continuously improve our company-wide alerting and on-call frameworks to reduce alert fatigue and ensure that, when a pager goes off, the alert is highly actionable and symptom-based. Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code. Build the tooling that allows teams to prove and improve their service's recoverability through autonomous, evidence-based testing. Eliminate Toil: Develop systems, automations and tooling (e.g., pre- and post-deployment verification) to realize a hands-off reliability vision, primarily via Python or a similar language. Reliability-as-Code: Lead the drive to manage our entire reliability suite through Infrastructure as Code. Use Terraform to architect, deploy, and configure our observability stack—including ELK, Grafana, Loki, Prometheus, and Tracing—to ensure our monitoring environment is as reliable as our production systems. Requirements Bachelor's degree in Computer Science, Information Technology, or related field. 2+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering. Strong coding fluency: proficiency in Python (or similar) with ability to develop automation scripts. Cloud and IaC: Hands-on experience with AWS and a solid understanding of Infrastructure as Code (Terraform or CloudFormation). Deep Observability Knowledge: Experience with monitoring tools such as DataDog, Prometheus, ELK stack. Understanding of SRE concepts such as Golden Signals, high-cardinality data handling, and error-rate calculations. Systems Thinking: Ability to design for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments. Benefits People-first culture that encourages learning, collaboration, and autonomy. Career growth through exposure to products used by millions. Transparent communication of strategy, finances, and salaries. Flexible hours in an office-first environment with a fully stocked kitchen and daily catered lunch. Generous leave starting at 20 days per year. Regular team activities such as hikes, getaways, and dinners. #J-18808-Ljbffr
Software Engineer - Reliability/Sre
ELECTRUM
cape town, cape town
Published 1 days ago
Report job