Overview Get AI-powered advice on this job and more exclusive features. We’re looking for AI QA trainers who specialize in model evaluation, LLM safety, prompt robustness, data quality assurance, multilingual and domain-specific testing, grounding verification, and compliance readiness checks. You’ll evaluate advanced language models on tasks such as hallucination detection, factual consistency, prompt-injection and jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation. You will document every failure mode to raise the bar for quality. On a typical day, you will converse with the model on real-world scenarios and evaluation prompts, verify factual accuracy and logical soundness, design and run test plans and regression suites, build clear rubrics and pass/fail criteria, capture reproducible error traces with root-cause hypotheses, and suggest improvements to prompt engineering, guardrails, and evaluation metrics (e.g., precision/recall, faithfulness, toxicity, and latency SLOs). You’ll also partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time. We offer a pay range of $6-to-$65 per hour, with the exact rate determined after evaluating your experience, expertise, and geographic location. As a contractor you’ll supply a secure computer and high-speed internet; company-sponsored benefits such as health insurance and PTO do not apply. Employment type: Contract Workplace type: Remote Seniority level: Mid-Senior Level Responsibilities Converse with the model on real-world scenarios and evaluation prompts to verify factual accuracy and logical soundness. Design and run test plans and regression suites; build rubrics and pass/fail criteria. Capture reproducible error traces with root-cause hypotheses; suggest improvements to prompts, guardrails, and evaluation metrics. Collaborate on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time. Document failure modes to inform improvements in prompt engineering and evaluation tooling. Qualifications Bachelor’s, Master’s, or PhD in computer science, data science, computational linguistics, statistics, or a related field is ideal. Shipped QA for ML/AI systems; safety/red-team experience; test automation frameworks (e.g., PyTest); hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B) signal fit. Skills that stand out: evaluation rubric design, adversarial testing/red-teaming, regression testing at scale, bias/fairness auditing, grounding verification, prompt and system-prompt engineering, test automation (Python/SQL), and clear, metacognitive communication. Note Referrals increase your chances of interviewing. Get notified about new Instructor jobs in South Africa. #J-18808-Ljbffr
Ai Qa Trainer - Llm Evaluation - Freelance Project
INVISIBLE EXPERT MARKETPLACE
Remote, Remote
Published 10 days ago
Report job