Hunter Gerlach
Evals are systematic methods for measuring the quality of AI and LLM outputs. They serve the same purpose as Test Automation
giving you confidence that your system behaves correctly - but they are fundamentally different.
Traditional tests give you certainty. Evals give you confidence. A unit test is binary: the function returns 4 or it does not, and you can verify that with 100% accuracy. AI outputs are non-deterministic - the same prompt can produce different responses each time, and "correct" is a spectrum, not a binary. No single eval can tell you the system is working perfectly. Instead, multiple evals across different dimensions combine to build a level of confidence about your system's quality - confidence that is measurable, trackable, and sufficient to make informed decisions, but never 100% certain.
This is a critical distinction for leaders and stakeholders to understand. Expecting evals to provide the same pass/fail certainty as traditional tests sets the wrong expectation and leads to either false confidence or unnecessary disappointment. Evals tell you "we are 85% confident the system produces helpful, accurate responses across our test scenarios" - and that level of confidence, tracked over time, is what you use to decide whether to ship, roll back, or invest more in quality.
The right confidence threshold depends on the use case. If your AI system is doing something that was previously impossible or prohibitively expensive, even modest confidence may be a massive improvement over the status quo. But if humans are already doing the task well, the bar is much higher - and getting there follows a pattern that Andrej Karpathy calls "the march of nines": every additional nine of reliability (90% to 99% to 99.9%) requires roughly as much effort as all the previous nines combined. Set your target confidence level based on what the use case actually requires and what existed before, not on an assumption that every AI system must reach near-perfection to be valuable.
An AI system is made up of many parts, and you can eval at different scopes:
While much of the eval conversation focuses on subjective quality judgments, some evals are fully deterministic. Checking that a required field is present in the output, that a response is valid JSON, that a summary stays under a token limit, or that no forbidden content appears - these are simple binary checks, just like traditional tests. They either pass or they fail.
These deterministic evals form the base of the eval pyramid (see Step 3). They are fast, cheap, and reliable. The layers above - model-graded evals and human evals - handle the judgments that code cannot make on its own, like whether a response is helpful, accurate, or appropriate in tone. A good eval strategy starts with deterministic checks where they make sense and layers on the more expensive, less certain eval types only where needed.
Evals and A/B tests answer different questions. Evals measure quality against a rubric or reference - you can run them with zero users. A/B tests compare two variants by measuring user behavior at scale - you need significant traffic to reach statistical significance.
In practice, the choice often depends on the number of users: low-traffic AI features may never get enough volume for meaningful A/B tests, making evals the primary quality signal. High-traffic features can and should use both. Evals are also essential for higher-risk or regulated use cases where you need to demonstrate quality against specific criteria before exposing users to a change, not after.
Work with domain experts to create a rubric with 3–5 evaluation criteria. For each criterion, document examples of good, acceptable, and poor outputs. This rubric is the foundation everything else builds on - without it, evals produce inconsistent, unreproducible results.
Start with 20-50 test cases that cover common scenarios, edge cases, adversarial inputs, and historical failures. It is perfectly reasonable to start with synthetic data - either model-generated examples or real data that has been altered to suit your needs. You do not need production traffic to get started. Grow the dataset over time from production data and Human-in-the-Loop reviews. A small, well-curated dataset is more valuable than a large, sloppy one.
Because eval costs can spiral quickly (see "Cost optimization" above), the order in which you build your eval suite matters. Like the testing pyramid, start with high-volume, low-cost evals at the base and work upward. The base layer is deterministic and nearly free. Each layer above adds nuance but also adds cost. The pyramid ensures you are not paying LLM-as-judge prices to catch problems that a simple format check would have caught for free:
Run evals locally during prompt engineering for fast feedback. Add evals to your Continuous Integration pipeline so that prompt, model, or retrieval changes are automatically evaluated on every change. Define quality thresholds and treat them the same way you treat failing tests - the build is red until the issue is fixed. This is the same principle behind Test Driven Development applied to AI systems. A companion practice, Eval-Driven Development, applies TDD's red-green-refactor cycle directly to evals.
Run evals on sampled live traffic to detect drift and regressions in real time. Offline evals validate before deployment; online evals verify that things are working in the wild. When online eval scores diverge from offline scores, it is a signal that your eval dataset no longer represents real-world usage.
Leverage both explicit and implicit user feedback as a source of new eval cases. Explicit feedback includes thumbs up/down, corrections, escalations, and complaints. Implicit feedback includes behavior signals like users rephrasing a question, abandoning a flow, copying an answer versus ignoring it, or escalating to a human. Bucket feedback to find patterns: individual complaints are noise; clusters of similar feedback reveal the true pattern worth evaluating. Once you identify a pattern, codify it as a new eval test case. This is the MLOps lifecycle applied to generative AI: users surface issues, patterns emerge, new evals catch regressions, and quality improves.
Track eval scores as trends over time, not just point-in-time snapshots. Share results with stakeholders to maintain visibility and trust. Treat your eval suite as a living artifact - regularly refresh datasets with new cases from production and user feedback to prevent staleness.
Check out these great links which can help you dive a little deeper into running the Evals practice with your team, customers or stakeholders.