Description Own the operational reliability, scalability, and performance of Ignition’s AI platform infrastructure and production AI applications, ensuring uninterrupted delivery of AI-enabled services that support business operations, digital transformation, and employee enablement across the organisation. Govern and optimise multi-vendor AI and LLM integrations, including API reliability, usage management, cost optimisation, fallback strategies, and platform resilience, enabling sustainable and cost-effective AI capability across the business. Rates maintained within acceptable thresholds; operational spend managed within approved parameters. Design, implement, and continuously improve secure CI/CD pipelines, infrastructure-as-code standards, and deployment processes that improve release quality, reduce operational risk, and strengthen platform scalability and deployment efficiency. Maintain accountability for platform observability, telemetry, monitoring, alerting, and incident management processes to ensure proactive issue detection, operational transparency, and data-driven platform optimisation across the AI ecosystem. Evaluate, recommend, and operationalise emerging AI tooling, integrations, and infrastructure enhancements that improve platform capability, operational efficiency, automation maturity, and long-term business enablement. Maintain technical governance through accurate runbooks, infrastructure documentation, operational standards, and knowledge management practices that improve operational continuity, reduce dependency risk, and strengthen support capability across the platform environment. Partner with cross-functional technical and operational stakeholders to improve platform stability, AI service delivery, operational efficiency, and adoption of AI-enabled solutions across the organisation, contributing to broader digital transformation objectives. Requirements Matric / Grade 12 — Required Tertiary qualification in Computer Science, Information Technology, Software Engineering, Cloud Infrastructure, DevOps, or related technical field — Advantageous Professional Certifications AWS, Azure, or Google Cloud certifications — Advantageous Terraform, Docker, Kubernetes, or CI/CD certifications — Advantageous Monitoring and observability certifications (Grafana, Datadog, OTEL) — Advantageous AI/LLM platform certifications or equivalent exposure — Advantageous Experience (Advantageous) 3–5 years of experience in platform engineering, DevOps, infrastructure engineering, systems administration, or cloud operations Experience working with production infrastructure, CI/CD pipelines, monitoring frameworks, APIs, containers, and infrastructure-as-code environments Exposure to AI tooling, LLM APIs, automation platforms, or agentic frameworks in either production or personal project environments Experience troubleshooting operational incidents and managing production platform reliability Exposure to fast-paced technology, AI, software engineering, or digital product environments — Advantageous Skills & Capabilities Advanced infrastructure and platform engineering capability within cloud and AI-enabled environments, including CI/CD, infrastructure-as-code, deployment automation, and operational reliability management Strong operational problem-solving capability, with the ability to diagnose root causes, evaluate technical risk, and implement sustainable platform improvements independently Proficiency in cloud infrastructure, containers, monitoring platforms, observability tooling, telemetry pipelines, and AI integration environments to deliver scalable and resilient platform operations Strong analytical and operational insight capability, including interpretation of telemetry, platform metrics, and AI usage data to drive optimisation and operational decision-making Effective stakeholder communication and technical documentation capability, including the ability to translate complex technical concepts into clear operational guidance and recommendations AI-enabled productivity and automation capability, with demonstrated ability to leverage AI tools and frameworks to improve operational efficiency, scalability, and delivery quality Adaptability and learning agility within rapidly evolving AI and technology environments, with the ability to manage competing operational priorities effectively Strong governance and operational control capability, including maintenance of platform standards, deployment discipline, documentation quality, and operational continuity practices #J-18808-Ljbffr
Agenic Platform Engineer
IGNITION GROUP
umhlanga rocks, umhlanga rocks
Published 4 days ago
Report job