Agent Evaluation Suites: Testing What Your Agent Does
Unit tests cover deterministic functions. Agent loops are not deterministic. The evaluation gap is where most production agent failures live, and where the regressions are easiest to catch with a small amount of disciplined infrastructure. The agents that survive contact with real users are the ones with a regression suite the team can run on every model bump.
Three Dimensions of Agent Eval
An agent has three things worth evaluating, and a useful eval suite covers all three. Confusing them is the most common mistake.
- Output correctness. Did the final answer or final action achieve the user's goal?
- Trajectory quality. Did the agent take a sensible path to get there? Were the tool calls reasonable, the loop short, the reasoning sound?
- Cost and latency. Did the agent stay inside the acceptable token budget and time budget?
A model upgrade can hold output correctness steady while degrading trajectory: same right answer, but ten tool calls instead of three. The cost regression is invisible in an output-only eval. A new tool can preserve trajectory while breaking outputs: the agent uses the tool with the same pattern, but the tool now returns subtly wrong data. The eval suite has to detect both.
Building a Labelled Eval Set
The labelled set is the single highest-value asset for an agent in production. The structure that works:
{
"id": "incident_triage_017",
"input": {
"alert": "Database connection pool exhausted on shard 3",
"context": { ... }
},
"expected": {
"classification": "database_capacity",
"severity": "P2",
"tools_used": ["query_metrics", "lookup_runbook"],
"tools_not_used": ["page_oncall"],
"max_steps": 6,
"max_cost_usd": 0.08
},
"rubric": "Should classify as database capacity, not connectivity. Should consult the metrics tool before paging. P2 because shard isolation prevents customer impact."
}
The labels are explicit about what should and should not happen. The rubric explains the reasoning so future maintainers do not lose the intent. Aim for 50 to 150 cases at the start, weighted toward the failure modes you have already seen in production. The eval set grows with the agent.
Where do the cases come from? Production traces. Every interesting failure, every customer escalation, every weird trajectory in the observability tool becomes an eval case. Production is the corpus.
Trajectory Evaluation
Trajectory eval asks: was the path reasonable? The metrics that travel well:
- Steps taken vs expected. If the labelled max is 6 and the agent took 14, that is a regression even if the answer is right.
- Tool selection precision and recall. Which tools were used vs expected; which tools were skipped vs expected.
- Argument validity. Did the agent pass sensible arguments to each tool, or did it hallucinate fields?
- Loop completion. Did the agent terminate on a final answer, or hit max_steps?
These are programmatic checks. No LLM judge needed. They run on every eval pass in seconds and they catch the most common regressions.
Output Evaluation: When LLM-as-Judge Works
Output correctness often resists programmatic checking. Was the summary good? Did the response answer the question? Did the agent's plan make sense? An LLM judge model can score these, with two important caveats.
The judge is reliable when the rubric is concrete and the judge has the reference answer. It is unreliable when asked to grade open-ended quality without a reference. "Is this a good response?" produces noise. "Does this response contain the three facts in the rubric, in any order?" produces signal.
// Judge prompt that produces signal
const judgePrompt = `
You are evaluating an incident summary against a rubric.
REQUIRED FACTS (all three must be present):
1. The root cause was database connection pool exhaustion on shard 3.
2. The mitigation was a pool size increase, not a failover.
3. Customer impact was limited to users on shard 3.
SUMMARY TO EVALUATE:
${candidate}
Return JSON: { "fact_1_present": bool, "fact_2_present": bool,
"fact_3_present": bool, "hallucinations": [] }
`;
Calibrate the judge by having a human grade 20 to 30 cases first, then comparing judge scores to human scores. If agreement is below 80%, the rubric is too vague; tighten it before relying on the judge for thousands of cases.
Cost and Latency Regression
Every eval run records total tokens, total cost, total wall time, and number of steps. A regression here is silent in output-only evals and is often the first signal of a model bump that needs investigation. The cost line in the eval report is as important as the accuracy line.
A useful default: fail the suite if median cost per case rises more than 25% or median latency rises more than 30% versus the prior baseline, even if accuracy holds. The investigation is cheap; the silent cost climb is not.
Where to Run the Suite
Three contexts, with different cadences:
- Pre-commit on agent code changes. A fast subset (10–20 cases) runs on every PR. Cheap, catches obvious breakage.
- Nightly on the full set. The complete eval runs against the current production configuration. Report goes to the team channel.
- On every model or tool change. A new model version, a tool schema change, or a prompt template change triggers the full suite before deploy.
Continuous eval in production (sampling live traffic and judging it asynchronously) is a fourth option. It is valuable for very high-volume agents and adds infrastructure cost; not the first investment for most teams.
Tools to Consider in 2026
Several frameworks make eval suites less work to build:
- Promptfoo. YAML-driven, runs locally or in CI, good fit for teams that want everything in the repo. Handles judges, golden-dataset comparison, cost tracking.
- DeepEval. Pytest-integrated, useful when the team already lives in Python testing. Strong on RAG-specific metrics.
- LangSmith. If you are on LangGraph or LangChain, the integration cost is near zero. Strong dashboard, paid service.
- Azure AI Studio evaluations. Native to Azure AI Foundry deployments; covers grounding, relevance, fluency on RAG flows. Useful when the platform constraint is already Azure.
The choice matters less than the discipline. Teams that built a 100-case suite in a week and ran it on every change shipped more reliable agents than teams that evaluated three frameworks for three months and ran no evals at all.
The Failure Mode Eval Catches Best
Silent degradation on a model upgrade. The provider releases a new model checkpoint. The team upgrades. Accuracy on visible tasks holds. Cost climbs. Latency climbs. The trajectory shifts to use more tools. A week later, a user notices the agent feels slower. The team has no evidence either way.
An eval suite turns this into a single graph. Old model versus new model, on the same 150 cases, with median cost, median latency, and trajectory-step distribution. The team sees the regression in an hour, not a week.
Start Small, Add to the Set Forever
The first eval suite does not need to be exhaustive. Fifty cases representing the top five workflows beats two thousand cases that nobody updates. Every production failure becomes an eval case before it is forgotten. Every customer escalation produces one or two new cases. Over a year, the suite grows to several hundred genuinely representative cases, and the agent's reliability story is the suite.
Agent code without an eval suite is a system the team is afraid to change. Agent code with one is a system the team can iterate on for years.