A Concrete Evaluation Framework for LLM-Powered Pipelines
The Model Landscape Changes Every 90 Days.
Does Your Evaluation Strategy?
I noticed a pattern over the past couple of years. I constantly got distracted by the “model du jour.” It started by picking a model six months ago that performed well on test documents. It had a unique approach to the use case and then it was on to the next task. Since then, twelve significant model releases happened. Three of them specifically claimed improvement on document understanding tasks. Two of them are cheaper than what is running now. One of them might outperform the current model on the specific extraction task that matters most.
I needed to know whether to switch models or whether it was just this week’s AI slop. When I realized the amount of time I was wasting vetting this, I felt there was an opportunity to create an approach and a framework — because most teams tasked with AI engineering are not “engineers” and tend to lack a testing mindset.
This is not because they are lazy. The problem is that most teams don’t have an evaluation strategy — they have an evaluation event. They ran some tests before launch, the tests were good enough, they shipped. There’s no mechanism to re-run those tests when a new model drops, no structured way to compare candidates, no definition of “better” that isn’t someone’s gut feeling.
This article discusses that approach.
Why Generic Benchmarks Lie to You
Everyone has a plan until they get punched in the face. — Mike Tyson
When a new model releases, the announcement includes benchmark scores. MMLU, HumanEval, MATH, HellaSwag — a table of numbers that says this model is 4.2 points better than the last one. Those benchmarks tell you almost nothing useful about whether that model is better for your use case.
A benchmark is a standardized test against a fixed dataset that the model has likely seen, directly or indirectly, during training. It measures general capability across a broad distribution of tasks. Your pipeline is not a broad distribution of tasks. It’s a specific, narrow set of inputs — your documents, your schema, your domain — and the benchmark distribution doesn’t match your production distribution.
I’ve seen teams select models based entirely on leaderboard position and be genuinely surprised when performance degraded on their actual workload. A model that scores well on general reasoning benchmarks can still butcher entity extraction from financial disclosure documents, because financial disclosure documents are not well-represented in MMLU.
The only benchmark that matters is the one you build from your own data. Which brings us to the framework.
The Framework: Five Steps That Turn Model Selection into an Engineering Discipline
This isn’t a new idea — it’s borrowed directly from software testing. What BDD and specification testing did for application development, this framework does for AI model selection: it turns a subjective judgment call into a repeatable, automatable process.
1. Standardize Task Schemas
Before you can evaluate anything, you need to define what “correct” looks like. For AI pipelines processing structured outputs, this means defining the input/output contract for each task.
Treat each task as a typed transaction:
- Input: document (image or text), optional context, optional ontology
- Output: structured JSON conforming to a defined schema
- Contract: the schema is the definition of correctness
Entity extraction from a KYC document, for example, has a clear contract: the output must be a valid
JSON array where each element has a type field that is a member of your entity taxonomy, a value
field, and a confidence score. Anything that doesn’t conform to that schema is a failure before you
even look at the values.
This step is unglamorous and most teams skip it. Don’t. The schema is your specification. Without it, you’re evaluating vibes.
2. Build Your Gold Set
The gold set is the foundation of everything else. It’s a collection of real inputs paired with verified correct outputs — the ground truth your evaluation runs against.
This is where most frameworks break down in practice, because teams use synthetic data or toy examples instead of real production documents. Synthetic data is easy to generate and tells you almost nothing. Your model will face your documents, not the documents someone generated to approximate your documents.
A few principles for building a gold set that actually works:
Start silver, promote to gold. Generate candidate annotations synthetically or with a strong model, then have a subject matter expert review and correct them. Relying solely on hand-annotation from scratch, while it’s the best method, will generally be abandoned before you have enough examples.
25 examples to start, 100 before a production decision. Below 25 you’re not learning much. Below 100 you don’t have enough statistical reliability to make a defensible choice between two close candidates. 500 is better.
Cover the full distribution. Happy path documents, edge cases, adversarial cases, documents that have caused failures in production. If your gold set only contains clean, well-structured documents, you’ll select for models that perform well on clean, well-structured documents — which is not what you’ll get at 2am on a Tuesday.
Expand and rotate. A gold set that never changes becomes something your team implicitly optimizes for. Refresh it on a schedule — every six months is a reasonable cadence — with new examples from production.
3. Define Assertions, Not Just Test Cases
A test case is an input paired with an expected output. An assertion is the claim you’re making about the relationship between them. This distinction matters more than it sounds.
For structured output tasks, most assertions are purely deterministic and don’t require an LLM-as-judge at all:
- Does the output conform to the schema? Pure schema validation.
- Are the extracted entity types valid members of the ontology? Set membership check.
- Does the F1 score against the gold set meet the threshold? Arithmetic.
LLM-as-judge evaluation is useful — but only for the genuinely subjective things. Whether an extracted assertion is semantically faithful to the source text, for example, is hard to express as a deterministic rule. Whether an entity type is a valid string from a defined vocabulary is not.
The mistake I see constantly: teams reach for LLM-as-judge as the default evaluation method because it feels sophisticated, and in doing so they introduce cost, latency, and non-determinism into evaluations that should be running in milliseconds against a schema validator. Evaluation needs to be objective — use deterministic assertions for deterministic questions, and LLM-as-judge only for the rest.
4. Build and Version Test Case Definitions
Test cases are the pairing of a gold set example with its assertions. They’re the executable unit of your evaluation — the equivalent of a unit test in your CI pipeline.
A few things that matter here that most teams miss:
Version your test cases alongside your other assets, like your ontology. If your entity taxonomy changes, your test cases need to change with it. A test case that was correct against version 1 of your schema may be incorrect against version 2. If they’re not versioned together, you’ll get false positives and never know.
Cover all three categories explicitly:
- Happy path: clean documents, unambiguous entities, high-confidence extractions
- Edge cases: ambiguous entity types, partial information, poor document quality
- Adversarial: documents designed to stress the model — unusual formatting, multilingual content, documents where the correct answer is “nothing to extract here”
The adversarial category is the one teams skip. It’s also the one that predicts production failures.
5. Runner → Dashboard
The runner is the execution layer: for each candidate model, for each test case, run inference, collect the output, score it against the assertions, and record the result.
The critical design principle: the runner must be model-agnostic. Any model is just an inference endpoint. GPT-4o, Claude, Qwen running locally on your hardware, a fine-tuned model on Azure — the runner doesn’t care. It sends a prompt, receives a response, scores it. The model identifier is a configuration parameter, not a hardcoded dependency.
This is what enables the horse-race. You run the same test suite against five candidate models and get a comparable score for each. The dashboard then answers the questions that actually matter for a decision:
- Which model has the highest F1 on entity extraction across the gold set?
- Which model has the fewest ontology violations?
- Where does each model fail specifically — not just an aggregate score, but which test cases failed?
- How many tokens were consumed?
- What was the average inference speed?
- What was the cost per inference?
- How does the leading candidate compare to the current production model?
That last question is the regression view. Before you replace your current model with a newer one, you need to know definitively: is it better on the things that matter, and is it not worse on anything that currently works?
Evaluation as a Discipline, Not an Event
The value of this framework isn’t in any individual evaluation run. It’s in the cadence.
Every 90 days — roughly the cycle time for meaningful model releases — you run the suite against your current model and against the most promising new candidates. You get a scored comparison. You make a data-driven decision about whether to upgrade, stay, or wait.
That’s it. The framework turns a scary, expensive, opinionated decision into a routine process. The teams that will use AI effectively over the next five years are not the ones who pick the best model in 2025 and stick with it. They’re the ones who build the infrastructure to keep picking the best model as the landscape evolves — and who have the documented, reproducible evidence to justify every decision they make.
The model landscape changes every 90 days.
Your evaluation strategy should too.
I’ve built an open evaluation harness — quintus-tester — that implements this framework for document intelligence pipelines. The full implementation, schemas, and gold set curation tooling are available to Plaiground AI Advisory paid subscribers .