How do you evaluate an LLM?
LLM evaluation replaces "it looks good" with a repeatable measurement. Because outputs are free-form text rather than labelled predictions, evaluation combines several methods rather than a single accuracy score.
The common approaches are: a golden dataset of inputs with known-good answers to test against; rubric scoring, where each output is rated on defined dimensions like correctness, faithfulness, and tone; LLM-as-judge, where a strong model scores another model's outputs at scale; and human review for the cases automation can't reliably judge. Specific failure modes get their own checks — hallucination detection for faithfulness, safety filters for harmful content.
The goal is to turn quality into numbers you can track over time, so you can tell whether a prompt change, model swap, or fine-tune actually made the system better or worse.
Offline vs online evaluation
Evaluation happens at two points in an LLM's lifecycle, and you need both.
| Offline evaluation | Online evaluation | |
|---|---|---|
| When | Before deployment | On live production traffic |
| Against | Golden datasets, test sets | Real user interactions |
| Catches | Regressions before release | Drift, edge cases, real-world quality |
| Method | Rubric + LLM-as-judge in CI | Sampling, logging, quality scoring |
Offline evaluation is your gate — it blocks a bad change from shipping. Online evaluation is your monitor — it catches problems that only appear with real users. A mature LLMOps setup runs offline checks in CI and continuously samples production for online scoring.
Why does evaluation matter in production?
Without evaluation, an LLM feature degrades silently. A prompt tweak, a model update, or a shift in user behaviour can quietly lower quality, and no one knows until customers complain. Evaluation makes that change visible and turns "ship and hope" into "measure and release."
It is also what makes cost optimisation safe. Prodinit distilled a GPT-4.1 teacher into a cheaper fine-tuned GPT-4o-mini student for a voice AI platform, and evaluation gates at every rollout stage — 10% → 25% → 50% → 75% → 90% — are precisely what allowed a 70% cost reduction with no measurable quality regression.