What an AI Eval Harness Actually Looks Like in Production

A founder told me recently that their team had “evals” for their AI feature. I asked what was in them. The answer: a Notion page with five example prompts and the hand-typed responses someone thought were good. Whenever a prompt change happened, an engineer would paste the prompts into the model, eyeball the new outputs, and decide if it still felt right.

That’s not an eval. That’s a vibe check.

Most teams shipping LLM features think they have evals because they have some version of “let me try a few prompts and see what happens.” The gap between that and a real eval harness is the gap between “we’ll find out in production” and “we know before we ship.” It’s also the single most common reason production AI features quietly degrade for weeks before anyone notices.¹

This is what an actual eval harness includes. Use it as a checklist against whatever you’re calling evals today.

What an eval is, operationally

An eval is a test that produces a numeric score, runs against a fixed dataset, and gates deployment. Three properties matter, and most “evals” fail at least one:

Reproducible. The same model, same prompt, same input → the same score. If your eval depends on someone’s judgment in the moment, it’s not reproducible. (LLM-as-judge can be reproducible if the judge prompt is fixed and the judge model is pinned.)
Bound to a test set. A fixed, version-controlled set of input cases that represents the real distribution of work. Curated, not just whatever was in last week’s logs.
Comparable across runs. Today’s score on prompt v3 is comparable to yesterday’s score on prompt v2. This requires the test set, scorers, and judge model to be locked at version-time.

If your “evals” can’t tell you whether prompt v4 is better than prompt v3 with a number, you have a vibe check.

The four parts of a real eval harness

1. The test set

This is the hardest part to get right and the most often skipped. A test set isn’t 5 example prompts. It’s 30 to 200+ cases, intentionally diverse, version-controlled in your repo, with an expected output (or a rubric for what counts as a passing output) for each.

What goes in:

Golden-path cases. The thing the feature is supposed to do, in the cleanest version. 10–30% of the set.
Edge cases. Empty inputs, oversized inputs, weird inputs (emoji-only, single-word, all-caps), inputs in the second-most-common language your users speak. 20–40%.
Adversarial cases. Prompt-injection attempts, jailbreak strings, attempts to extract the system prompt, attempts to make the model do something it shouldn’t. OWASP LLM Top 10’s prompt-injection category lists patterns; use them.² 20–30%.
Regression cases. Specific prompts that broke in production at some point. Every time something goes wrong in prod, that case enters the test set. The set grows with the feature.
Distribution-aligned samples. Real (anonymized) inputs from production logs, sampled to match the actual distribution of user behavior. This is what catches “the model works in our heads but not for our users.”

A 50-case test set, curated this way, is more useful than a 5,000-case test set of randomly sampled production traffic. The intent matters.

2. The scorers

Each test case needs to produce a pass/fail or a numeric score. There are three valid approaches; pick the one that matches the case.

Exact match / regex. When the output is structured (JSON, function call, classification label), exact match or regex extraction works. Fast, deterministic, free.
Programmatic checks. When the output should satisfy properties — “must include the customer’s order number,” “must not contain the customer’s social security number,” “must be valid SQL” — write the check as code. Fast, deterministic, free.
LLM-as-judge. When the output is open-ended and must be evaluated against a rubric — “is this response empathetic,” “does this summary preserve the key facts” — use a stronger model with a structured judge prompt to score. Pin the judge model. Score 1–5 against named criteria, not a single “is it good” question.³

The mistake: defaulting to LLM-as-judge for everything because it’s the most flexible. LLM judges are slow, expensive, and noisier than they look. Use programmatic scorers wherever possible. Reserve judges for the genuinely subjective.

3. The regression gate

The regression gate is the policy that says prompt changes don’t ship without an eval pass. In CI terms, it’s a job that runs the harness on every PR that touches a prompt, model parameter, or system instruction. If the score drops below threshold or any flagged case fails, the PR doesn’t merge.

This is where most teams fold. Running the eval is fine. Blocking deployment when the eval fails is hard, because (a) someone has to maintain the test set, (b) someone has to investigate every regression, (c) the gate slows down the team’s ship velocity from “minutes” to “hours.”

That’s the cost of production discipline. Federal Reserve SR 11-7 has required exactly this for financial models since 2011 — what they call “ongoing monitoring” with “specific procedures and triggers” for when a model is unfit for use.⁴ The same standard applies to LLM features now; most teams just haven’t been told it does yet.

4. Drift detection

A regression gate catches prompt changes. Drift detection catches model changes — when your provider silently rolls a new version, fine-tunes, or rate-limits in a way that affects output quality. NIST’s AI Risk Management Framework calls this out under MEASURE 2 (performance and robustness) as an ongoing requirement, not a launch-only one.⁵

In practice, this means: run the eval harness on a schedule (nightly is fine for most features) against a stable subset of cases. Track the score over time. Alert on a drop. Investigate.

You will discover that vendor API behavior changes more than the changelog admits. This is not gossip; it’s been documented across multiple LLM providers, including silent updates to base models that shifted output distributions in production-affecting ways.⁶ The eval harness is what tells you first, before your customers do.

What anti-patterns look like

If your team has any of these, you have a vibe check, not an eval harness:

The test set lives in someone’s head or in a one-off script that gets re-run manually. It must be in version control alongside the prompt.
The “score” is “looks good to me.” Not reproducible.
The eval runs only when someone remembers to run it. Should be in CI.
Failing eval doesn’t block merge. Then it’s a status check, not a gate.
Every test case has the same scorer. Programmatic and judge scorers should be mixed; using one for everything is a sign of laziness.
The judge model is the same as the production model. Reduces signal — the judge can have the same blind spots as what’s being tested.
Adversarial cases aren’t represented. OWASP’s prompt-injection category alone should generate 5–15 test cases for any user-facing LLM feature.²

The minimum viable harness for a five-engineer team

You don’t need a platform. You need:

A evals/ directory in the repo with a cases.jsonl file. One case per line: {"id": ..., "input": ..., "expected": ..., "scorer": "programmatic" | "judge" | "exact"}.
A evals/scorers/ directory with a function for each programmatic scorer and the judge-prompt for the LLM-as-judge cases.
A runner — typically 50–100 lines of Python or TypeScript — that loads cases, calls the model under test, runs scorers, and emits a JSON report (per-case pass/fail + aggregate score).
A CI job that runs the harness on every PR touching prompts/, evals/, or model config. The job fails if aggregate score drops below the configured threshold or any case marked gating: true fails.
A nightly job that runs the same harness against the stable case set and posts the score to a tracked location (Slack, KV, a dashboard — anywhere durable).

That’s it. The above is achievable in a week of focused work for someone who’s done it before. The hard part is curating the test set; the runner is mechanical.

If you’ve been thinking about evals as a future investment for after the feature is built, you have it inverted. The eval harness is what makes the feature shippable in the first place — the rest of “production AI” is downstream of it.

Why this matters for hiring and outsourcing

Most LLM-integration packages priced at $497 or even $5,000 don’t ship an eval harness. The build is “wire Claude into your app” with a vibe-check at the end. The work doesn’t survive contact with real production load — not because the integration was wrong, but because there’s no instrument that can tell when it goes wrong.

Production AI without an eval harness is not production AI. It’s a demo that happens to be reachable from the internet.

The Foundation SKU at Mastascusa Holdings exists because this gap is consistent. Every two-week Foundation engagement ships:

A test set of 50–150 cases curated against your real use case
Programmatic and LLM-as-judge scorers per case type
A regression gate wired into your CI
Drift-detection nightly runs with alerting
Documentation your team can extend without me

That’s what mastascusa.com/build is selling. The article you just read is what the deliverable looks like before it lives in your repo.

If you’re scoping AI features and “evals” is still a vibe in your head, tell me what you’re building and we’ll see if a Foundation engagement is the right fit. If you’re an audit buyer and you want to read how the same eval discipline gets scored under our methodology, /audit/methodology covers it under the Process Documentation pillar.

Sources

Stanford Institute for Human-Centered AI, The 2025 AI Index Report, Responsible AI chapter (2024 incident data: 233 documented incidents, 56.4% YoY increase). https://hai.stanford.edu/ai-index/2025-ai-index-report/responsible-ai ↩
OWASP Gen AI Security Project, OWASP Top 10 for LLM Applications 2025 — see LLM01 Prompt Injection for the canonical taxonomy of patterns to include in adversarial test cases. https://genai.owasp.org/llm-top-10/ ↩ ↩²
Anthropic, Building effective evaluations — official guidance on rubric-based LLM-as-judge scoring, including pinning the judge model and using named criteria. https://docs.anthropic.com/en/docs/build-with-claude/develop-tests ↩
Board of Governors of the Federal Reserve System / Office of the Comptroller of the Currency, SR 11-7: Supervisory Guidance on Model Risk Management, April 4, 2011. Section V — Ongoing Monitoring requires “specific procedures and triggers” for when a model is unfit for use. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107.htm ↩
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 26, 2023. MEASURE 2 (performance and robustness) explicitly requires ongoing measurement, not launch-only assessment. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf ↩
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 26, 2024. The Profile names model-update risk explicitly under “Information Integrity” and “Value Chain and Component Integration.” https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf ↩