Why green tests stopped meaning shipped

We built an LLM-powered feature. The unit tests were green. The build was green. We shipped it. A week later someone pinged us saying the output looked "a bit weird," so we re-ran it on the same input and got something different from what they had seen. Different from what we had seen the day before, too.

Nobody had broken anything. Nothing had regressed. The contract we rely on every day (same input, same implementation, same output) becomes much weaker once an LLM is in the loop. Even when output can be made mostly deterministic, the behavior is still shaped by model version, prompt, context, retrieval data, decoding settings, and vendor-side changes. Most engineering teams haven't yet fully absorbed what that means for how they ship. Green tests stopped meaning shipped.

The contract that changed

In classic software development, you write a function and you can usually pretend it's a pure mapping from inputs to outputs.

That assertion is meaningful. If it passes today, it will pass tomorrow, on the same input, in the same environment. That's the assumption underneath unit tests, regression suites, and "green CI means I can ship."

Now run the same code three times against an LLM-backed summarize:

Three different strings. The test happens to pass, because we got lucky and "12%" was in all three. Swap the prompt next sprint, or upgrade the model, and you might get "The report notes growth in Q3" back, with no number at all. The test still says very little about whether the output is good.

The real issue is not merely that the strings differ. It’s that what we care about is semantic: did the summary preserve the key facts, omit unsupported claims, follow the requested format, and avoid introducing risk? Traditional assertions can catch some of this, but they are the wrong primitive for most of it.

A more honest mental model for what an LLM gives you is this:

Output is a sample from a probability distribution. Correctness stops being purely binary. It becomes graded, contextual, and multi-dimensional. Behavior depends on the data, model, prompt, and retrieval.

Once you accept that, a lot of the engineering hygiene we rely on stops doing what it used to.

Exact match is not semantic quality

Imagine a summarization step over this source text: "Revenue grew 12%, driven by North America."

Three plausible model outputs:

"Q3 revenue grew 12%, led by North America." (accurate)
"The report discusses Q3 performance." (vague but not wrong)
"Revenue grew 12% in Q3, led by Europe." (hallucinated)

Run any of them through assertEqual(source, output). The first one is the one we'd happily ship. The third one is the one that ends up in a customer escalation. Exact text match is not the right solution here. Exact matching still has a place: JSON schemas, deterministic transforms, tool-call contracts, and other known invariants. However, it is a poor default for judging natural-language quality.

What an eval is

What you actually need for a proper evaluation are three things: a dataset that reflects how your users will actually use the product, a grader that understands meaning, and metrics that go beyond a single pass/fail rate. That's what we've been calling an eval.

The output of an eval is a scorecard, for example:

Quality is several columns, not one. Latency and cost are first-class signals too, because the best-quality model in the world isn't useful if it's too slow or too expensive in production.

Not every column on that scorecard should block a release. Some metrics should: severe safety failures, groundedness regressions, anything tied to a hard requirement. Others should warn: small cost increases, non-critical tone drift, latency creep within budget. Decide which is which up front; otherwise every regression turns into an argument.

What changes for engineers

Almost every axis we care about shifts when an LLM is in the system.

Debugging in particular is no longer "read the stack trace and find the bad line." It's pulling fifty failed examples, reading them like a researcher, and asking what they have in common. Versioning isn't just git anymore; you have to pin the model, the prompt, the retrieval index, and the eval set together if you ever want to reproduce a result.

This isn't a reason to abandon unit tests. Deterministic logic (parsing, auth, math, anything where the contract still holds) still belongs under unit tests. Evals are a new category, guarding a new category of risk: prompt changes, model upgrades, RAG retrieval tweaks, fine-tunes. Different tools, different risks. You need the full stack.

It helps to be explicit about which layer is guarding what:

Evals don't replace anything in that table. They add a row that didn't exist before.

Anatomy of a good eval

A bad eval is worse than no eval, because it gives you false confidence.

A quick shortlist:

Representative. The dataset mirrors production inputs, including the messy edge cases. If 30% of your traffic is "contact support," the dataset reflects that.
Clear rubric. A human can apply it consistently. This quickly gets tricky in regulated domains like medicine, where two clinicians can read the same answer and reach different conclusions about whether it's correct. The bar is not a perfect agreement; it's a calibrated disagreement. Reviewers may still disagree on borderline cases, but they should understand the rubric the same way and disagree for legible reasons. If they don't, the rubric isn't ready.
Severity-aware. A minor tone miss and a dangerous hallucination should not have the same weight. Track failure type and severity separately so the average score does not hide unacceptable risks.
Calibrated. If you use an LLM as a judge, validate it against human labels first and periodically re-check agreement. Prefer a judge independent from the system under test. A model grading its own outputs can overestimate quality, especially on style, reasoning, and borderline correctness. The goal is not a perfect judge; it is knowing where automation is reliable and where humans still need to review.
Versioned. Dataset, graders, and prompts live in git, alongside the code they evaluate.
CI-friendly. Fast subset runs on every MR; a full run goes nightly.
Split. A dev set you iterate against, and a held-out set you never tune on. Tune on the test set and your numbers go up while your product gets worse.
Maintained. Production inputs change over time. Users discover new edge cases. Model behavior shifts. An eval set that was representative six months ago may no longer cover the risk surface. Treat eval maintenance like test maintenance: add cases from incidents, near misses, support tickets, and recurring failure modes.

Don't use 1-to-5 Likert scales in your rubric. Annotators (and judges) collapse to the middle, and the resulting score is mostly noise. Break each rubric into binary checks instead:

You can always average binary checks back into a percentage.

The new dev loop

In classical software, the inner loop is roughly: write code → run tests → ship. With AI features, it looks more like:

The step, which most teams underinvest in, is the failure-analysis step at the end. The eval tells you "73%". Staring at the 27% that failed is what tells you why, and why is what you need to fix it. Tooling that makes failure inspection fast (clustering, side-by-side diffs of inputs and outputs, easy filtering by category) pays for itself almost immediately.

This is also where most "vibes-based" AI work falls down. Five hand-picked examples isn't an eval. Twenty examples is enough to start catching obvious failures, but not enough to measure small regressions with confidence. If you don't have an eval in CI, regressions ship. None of these are theoretical; we hit all of them at one point or another.

The minimum viable eval

You don't need a platform to start. You need:

20 to 50 representative inputs from production, staging, or realistic synthetic cases.
Expected criteria, not necessarily expected strings: what must be present, what must be absent, and what counts as a severe failure.
A grader: rule-based, human, LLM-as-judge, or a hybrid.
A baseline from the current system.
A threshold that blocks or warns on regression.
A CI job that runs on prompt, model, retrieval, or code changes.
An owner who reviews failures and updates the eval when production behavior changes.

That's it. Your code, a CSV, and a CI job. You can grow into a real platform later; what you can't recover is the year of prompt changes that shipped without a baseline to compare against.

How we run evals on Agents: Promptfoo

For our practitioner agent we use Promptfoo, an open-source eval framework that runs scenarios against your prompts and models and scores the outputs against a mix of assertions and rubrics. Scenarios are authored as YAML, one per behavior we care about, covering the higher-risk surfaces of the agent: search, lookup, recommendation, safety and refusal, tool error handling, and multi-turn flows.

Two scoring paths run on the same scenarios. Deterministic assertions inspect the normalized execution trace: which tools were called, in what order, what state the agent ended up in, how long the response took, how many tokens it used. They're fast, cheap, and easy to explain, which makes them suitable for CI gating. LLM-as-judge rubrics handle what the assertions can't: factual correctness, retrieval groundedness, hallucination, safety refusal, conversation coherence, instruction adherence. We run every scenario across multiple model profiles, so a "regression" doesn't just mean "this prompt got worse on one model."

CI runs in stages. Merge requests trigger a deterministic-only run as the first gate. Nightly jobs run the full suite with rubrics and archive the results so we can compare runs over time. The split is deliberate: deterministic checks are cheap, LLM-graded checks are slow and add judge variance. We also evaluate production traces. A sampler pulls a batch of recent traces, replays them through the same scoring path, and surfaces real conversations that synthetic scenarios haven't covered yet.

How we evaluate clinical output: PracLLM

PracLLM is the function that turns a patient's intake and lab results into a personalized, practitioner-style interpretation. The eval problem is harder here than for the agent. We aren't checking whether the agent picked the right tool, we're checking whether the clinical narrative is safe, accurate, and grounded in the data we actually presented. A summary that confidently misreads sodium at 122 mmol/L should be treated as a patient-safety event, not as a tone issue.

Triage runs every output through a stack of judges before it reaches a practitioner:

First, a set of deterministic rules covers the things that don't need an LLM to decide, like catastrophic biomarker thresholds (sodium below 125 or above 155, glucose below 40 or above 400). We use regex for dangerous phrasing and a 732-entry blocklist of diagnostic markers we want to keep out of the output entirely.

Second, a layer of fourteen rubrics catches what rules can't: diagnostic language, false reassurance, intake fidelity. Tier-1 failures block the response. Tier-2 failures route to governance review. The specific patient information is injected into the judge prompt, so coverage is graded against what we expected the model to address rather than against a generic notion of thoroughness.

Third, a benchmark harness replays 11 named patient fixtures and grades them with a five-part 115-point rubric a clinician would recognize: section completeness, biomarker completeness, clinical accuracy, tone and language, recommendations. The rubric explicitly lists banned phrases ("confirms," "root cause," "excellent") and pins specific dose protocols so we catch the small things that look harmless until they don't.

Each judge is a different model class: Gemini for triage, OpenAI for ranking and the CI rubrics. No model judges its own output.

Evals as optimization targets

A good eval doubles as something else: the target function an AI coding tool can optimize against.

Without an eval, "make this prompt better" is a vibes task. The tool guesses, you review by feel, nothing is reproducible. With an eval, "improve groundedness on the 12 failing cases" is an optimization problem. The tool runs the eval, sees the failures, proposes a fix, re-runs, iterates. You stop being the bottleneck on every micro-iteration and become the reviewer of objective deltas: v2 beats v1 by 8% is a much better question to weigh in on than does this look better to you?

You also get to refactor more aggressively, because the eval guards the behavior even when the implementation underneath is changing.

Takeaway

The contract changed. Classical testing assumes determinism that LLM-backed code doesn't have, and our testing strategy has to change with it. Evals are the regression suite for behavior, and the infrastructure around them deserves the same care we give CI: versioned, automated, fast, trusted.

Retrofitting evals onto a year of prompt changes is painful, so don't put it off. The smallest useful eval is something you can build this week.

Green tests still matter. They just stopped being enough.