Key Takeaways
- Most LLM eval setups fail for three structural reasons: evaluating on metrics that don't reflect production failure modes, using golden datasets that have silently rotted, and running evals on a separate schedule from deployments
- The four-layer eval stack — unit, reference, rubric, and behavioral — catches different regression types; shipping without all four leaves blind spots
- GPT-4 as judge agrees with human experts 85% of the time on general tasks (Zheng et al., NeurIPS 2023), but that agreement drops to 60–68% in expert domains — calibrate before you trust it
- A February 2025 Amazon study found INT4 quantization caused a 39.46% accuracy drop on Llama-3.3 70B — silent regressions from "safe" model changes are real and statistically detectable (Kübler et al., arXiv 2025)
- Block deployments on rubric regressions ≥2% relative to the last passing run; warn on everything else
Why Most LLM Eval Setups Miss Regressions
42% of companies abandoned the majority of their AI initiatives in 2025, up from 17% in 2024 (S&P Global Market Intelligence, 2025). The default explanation is ROI. The technical explanation, in most cases, is that the system shipped fine and then quietly got worse — and nobody caught it until a customer did.
Three structural failure modes explain most missed regressions in production LLM systems.
Failure mode 1: Proxy metrics that don't predict production failure. Teams instrument BLEU score, exact match, or perplexity because those are easy to compute. A customer-facing summarisation model can maintain a BLEU score of 0.74 while its summaries become subtly contradictory after a retrieval change. BLEU measures token overlap; it doesn't measure factual consistency. The metric passed. The feature regressed.
Failure mode 2: Golden datasets that have silently rotted. A golden dataset built during initial evaluation captures the distribution of inputs that existed at that moment. Six months later, real traffic has drifted: new document formats, new query patterns, edge cases the original set never covered. Evaluating against a stale golden set produces a green score against a test that no longer represents the problem you're actually solving.
Failure mode 3: Evals that don't run at deployment time. Evaluation suites that run weekly, on a separate schedule from code deploys, detect regressions after they've been live for days. The culprit PR has already been merged and three others have been built on top of it. What you needed was a gate, not a report.
The Four-Layer Eval Stack
The single strongest change you can make to your eval setup is adding layers. Each layer catches different failure modes; each is cheap to run for what it surfaces. Shipping any one layer in isolation leaves a class of regression invisible.
Layer 1: Unit Evals
Unit evals test individual capabilities in isolation: does the model correctly extract a date from a structured input? Does it refuse an off-topic request? Does it stay within a 200-word limit when instructed to? These are deterministic — the answer is either correct or it isn't.
Unit evals run in milliseconds, require no LLM calls for evaluation, and give you a precise signal when a model update breaks a capability it previously had. They are the first gate in the pipeline: cheap to fail, cheap to fix.
Layer 2: Reference Evals
Reference evals compare model output against a gold-standard answer using a similarity metric. They're appropriate when outputs have a correct or near-correct form: code generation, factual Q&A with a known answer, structured extraction against a schema.
The weakness: reference evals degrade with output diversity. A model that answers correctly but in different words than the reference will score low. Use them where correctness has a tight definition. Avoid them for open-ended generation where paraphrase is acceptable.
Layer 3: Rubric Evals (LLM-as-Judge)
Rubric evals ask a separate LLM to score the output against a defined rubric. This is the only practical approach for evaluating coherence, helpfulness, or factual consistency at scale — human annotation doesn't scale to continuous deployment. Stanford's HELM benchmark applies seven evaluation metrics across 42 real-world scenarios using a comparable rubric-based approach at research scale.
Rubric evals are powerful but require calibration. See the LLM-as-Judge section below for the documented failure modes.
Layer 4: Behavioral Evals
Behavioral evals test system-level properties that don't reduce to a single output score: does the system stay in character across a 10-turn conversation? Does it escalate correctly when the user indicates distress? Does retrieval-augmented generation cite only sources it actually retrieved?
These require end-to-end test harnesses or carefully instrumented integration tests. They're more expensive to run but catch a class of regression that the other three layers cannot: failures that only manifest across interactions or under specific system conditions. They also run slower — which matters for your CI blocking policy, covered below.
Golden Datasets: How They Rot and How to Refresh Them
A golden dataset is the most valuable artifact your evaluation pipeline owns, and it has an expiry date nobody writes down.
Datasets rot in three ways. Input drift: real user queries evolve — new terminology, new intents, new edge cases — and your golden set stops representing them. Label rot: the correct answer changes. A customer service bot's golden dataset might contain ideal answers that reference a product feature that no longer exists. Coverage gaps: your initial dataset captured the happy path. Production traffic eventually surfaces the long tail that was never represented.
The practical fix is a two-track refresh strategy.
Track 1: Scheduled review. Every 90 days, pull a stratified sample of real production inputs — at minimum, 200 examples per major intent cluster — and manually verify that the golden labels are still correct. Flag rows where the ideal answer has changed. Retire rows from deprecated flows. Statsig's research on golden dataset maintenance recommends marking rows stale after 90 days unless re-verified; persistent drift is a signal the dataset no longer reflects reality.
Track 2: Failure-driven refresh. When a customer-reported regression reaches you, trace it back to the eval suite. If the failing case wasn't in the golden set, add it — annotated with why it failed and what the correct output should have been. A regression that reaches production is, at minimum, a contribution to the golden dataset. Don't waste the signal.
One diagnostic worth running: if your eval suite consistently scores above 90% but your support tickets are increasing, the dataset has drifted past the real problem space. That 90% is measuring something — it's just no longer measuring the right thing.
LLM-as-Judge: When It Works, When It Lies
LLM-as-judge is a necessary tool for evaluating open-ended outputs at scale. It's also unreliable in specific, documented ways. Use it without understanding those ways, and your rubric evals will give you false confidence.
What works. GPT-4 as judge achieves 85% agreement with human expert evaluators on general-task benchmarks (MT-Bench), and 83–87% agreement on Chatbot Arena evaluations (Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," NeurIPS 2023, via Eugene Yan). For general-purpose, non-expert tasks, LLM-as-judge is a defensible substitute for human annotation if you validate the judge against your specific rubric before deploying it.
What lies.
Verbosity bias. Both GPT-3.5 and Claude (v1) preferred longer responses over shorter ones more than 90% of the time, independent of correctness (Zheng et al., NeurIPS 2023). If your outputs tend to be long and verbose, a verbosity-biased judge will score them well even when they're wrong. Mitigate by normalising output length in your rubric prompt or running paired length-controlled evaluations.
Self-preference bias. GPT-4 as judge gave a 10% win-rate advantage to GPT-4-generated outputs; Claude v1 showed a 25% self-preference bias (Zheng et al., NeurIPS 2023). If your production model and your judge share a model family, expect inflated scores. Use a different model family for the judge.
Expert domain degradation. Agreement between LLM judges and human domain experts drops to 60–68% in fields like dietetics and mental health (ACL/EMNLP 2024, via ACM DL). If you're evaluating a healthcare, legal, or highly specialized technical application, LLM-as-judge is not a substitute for domain expert annotation on the rubric dimensions that matter most.
Calibration process. Before deploying a rubric eval in CI: (1) define explicit scoring criteria with labelled examples for each score level; (2) run the judge on 50–100 human-annotated examples and measure agreement; (3) if agreement is below 75% on your specific rubric, revise the rubric or change the judge model. The 2024 survey on LLM-as-a-Judge provides a comprehensive bias taxonomy useful as a calibration checklist. Treat LLM-as-judge as a probabilistic instrument you've validated — not a ground truth.
Wiring Evals into CI: What to Block On, What to Warn On
Running evals in CI without a blocking policy produces reports, not gates. The purpose of CI eval integration is to make a shipping decision: does this diff change behavior in a way that crosses a regression threshold?
The integration pattern that works in production:
# eval_pipeline.py — framework-agnostic eval runner
# Runs on every PR against main; blocks merge if BLOCK conditions fail
def run_eval_suite(model_version, golden_dataset, thresholds):
results = {}
# Layer 1: Unit evals — run all, block on any failure
results["unit"] = run_unit_evals(model_version)
# Layer 2: Reference evals — block if accuracy drops below floor
results["reference"] = run_reference_evals(
model_version,
golden_dataset,
metric="exact_match_normalized"
)
# Layer 3: Rubric evals — block on relative regression vs baseline
results["rubric"] = run_rubric_evals(
model_version,
golden_dataset,
judge_model="gpt-4o", # different family from production model
rubric=RUBRIC_CONFIG
)
# Layer 4: Behavioral evals — warn only; too slow to block on every PR
results["behavioral"] = run_behavioral_evals(model_version)
return evaluate_thresholds(results, thresholds)
THRESHOLDS = {
"unit": {"block_on_any_failure": True},
"reference": {"block_if_below": 0.92},
"rubric": {"block_if_regression_vs_baseline": 0.02}, # 2% relative
"behavioral": {"warn_only": True},
}
What to block on. Any unit eval failure. Reference accuracy falling below your defined floor. Rubric score dropping more than 2% relative to the last passing run on main. These signals have high signal-to-noise ratio — when they fire, they reliably indicate a regression rather than measurement variance.
What to warn on. Behavioral eval regressions (too slow and too variable to block every PR), single-dimension rubric drops that don't cross the aggregate threshold, and latency increases above your SLO. Warnings go into the PR review, not the merge gate.
The baseline problem. Your blocking threshold needs a reference point. Store eval results in a persistent store — a JSON file in the repo works; a purpose-built eval tracking system works better — and compare each run to the last green run on main. Don't compare to a fixed absolute. Compare to a rolling baseline that advances with intentional quality improvements.
Our AI Infrastructure & LLMOps service wires eval pipelines directly into deployment workflows so that model updates, retrieval changes, and prompt edits all pass through the same gate before reaching production.
A Regression That Slipped Through (and the Eval That Would Have Caught It)
A retrieval-augmented clinical documentation system was producing accurate outputs in testing. Production ROUGE-L scores were stable at 0.81. An infrastructure team updated the vector database and reindexed the embeddings corpus. No model weights changed. The migration was flagged as non-breaking.
Two weeks later: escalating complaints from clinical staff. Summaries were citing facts from adjacent patient records in a multi-tenant environment. The retrieval had started returning higher-cosine-similarity results from nearby tenant partitions due to an index partitioning bug introduced in the new release.
What the eval suite had: ROUGE-L score on golden summaries (Layer 2).
What it didn't have: a cross-tenant citation check (Layer 4 behavioral), or a factual grounding check verifying that every claimed fact appeared in the retrieved source documents (Layer 3 rubric).
The eval that would have caught it: A rubric eval scoring "all factual claims in the output are supported by at least one retrieved source document" — rated by an LLM judge with access to both the output and the retrieved context. This would have flagged outputs immediately: claims were present in the generation, but the supporting documents in context were from different records.
A behavioral eval running 20 end-to-end test cases with known tenant isolation requirements would have caught the regression in the first CI run after the index migration.
Neither eval existed because both required knowing what to test before the failure occurred. The lesson isn't that you should anticipate every specific bug. It's that behavioral evals should cover the properties your system must hold regardless of what changes — tenant isolation, citation grounding, output fidelity are invariants, not features. They belong in the eval suite from day one, not after the first production incident.
A related pattern appears in model optimisation: the Amazon arXiv study (February 2025) found that INT4 quantization — routinely treated as a cost-reduction step with negligible quality impact — caused a 1.73% accuracy drop on Llama-3.1 8B and a 39.46% drop on Llama-3.3 70B. The study also showed the McNemar statistical test can detect accuracy degradations as small as 0.3% — meaning you don't need large regressions to justify measurement. You just need to be measuring.
For a real example of how model optimisations require behavioral testing before production, see our LLM model distillation case study. Our model fine-tuning and optimisation engagements include eval suites built around the specific behavioral properties of each model before any weights change reaches a staging environment.
Frequently Asked Questions
Traditional software testing checks deterministic behavior: given input X, output is always Y. LLM evals check probabilistic behavior: given input X, the output should satisfy properties P1, P2, and P3 — but the exact text will vary. This requires rubric-based and statistical evaluation methods that don't exist in standard testing frameworks. The closest analogy is integration testing for probabilistic systems, where you define correctness criteria rather than exact expected outputs.
A minimum viable golden dataset for a focused LLM task needs at least 100–200 examples stratified across your main input categories. This gives sufficient statistical power to detect regressions of ≥5% at p < 0.05. For finer detection — catching a 1–2% regression before it affects users — aim for 500+ examples. The McNemar test can detect degradations as small as 0.3% with sufficient sample size, which matters for high-volume production systems where even small regressions affect many users (Kübler et al., arXiv 2025).
Use LLM-as-judge for continuous evaluation in CI pipelines where human annotation timelines would block deployments. Use human evaluation for: initial rubric calibration (validate the judge against 50–100 human-annotated examples before relying on it), expert-domain outputs where judge-human agreement drops to 60–68%, and any evaluation where the cost of a false positive — blocking a correct improvement — is high. Never use LLM-as-judge as the only gate for safety-critical properties (hallucination, medical accuracy, legal compliance) without human spot-checks.
Run each eval case 3–5 times with temperature > 0 and aggregate the scores. For pass/fail unit evals, require a majority pass (at least 3 of 5 runs). For rubric evals, use the mean score across runs. This smooths temperature variance without making your suite too slow for CI. For deterministic evaluation, set temperature = 0 and cache outputs — but note that temperature 0 does not guarantee determinism across API versions or model updates.
Model weight updates (fine-tune, distillation, quantization), retrieval index changes, prompt template changes, system message edits, and dependency version bumps to any model inference library. Also: after any production incident that revealed a failure mode not covered by the current golden dataset. A practical heuristic: if the change could affect output distribution at all, run the full suite. The cost of a false alarm is one blocked PR. The cost of a missed regression is a production incident.
