You have built an agent. It has tools, memory, and a planning loop. It can retrieve information, take actions, and converse with users. But how do you know if it actually works well?
Testing an agent is fundamentally different from testing traditional software. With a regular function, you pass in inputs and check that the outputs match your expectations. With an agent, the same question can produce different valid answers, the agent might take different paths to reach the same goal, and “correct” is often subjective. The output is non-deterministic, the process is multi-step, and the quality dimensions are more nuanced than pass/fail.
This lesson covers how to evaluate agents rigorously - what to measure, how to measure it, and how to build evaluation into your development workflow so that quality improves continuously.
Ask a traditional function to add 2 + 2 and it always returns 4. Ask an agent “What should I do about our declining user retention?” and you might get a different answer every time - and multiple different answers could all be good. Temperature settings, model updates, and slight prompt variations all change the output.
An agent that books a flight might search by price first, then by time. Or it might search by time first, then filter by price. Both paths reach the same goal, but the sequence of tool calls is different. Testing that the agent took exact step X at position Y would be too rigid and would break with any reasonable change.
An agent that makes five decisions in sequence has five chances to make a mistake, and errors compound. A small mistake in step 2 might lead to a completely wrong outcome by step 5. Testing only the final output misses where things went wrong.
Is a summary “good”? Is a customer service response “helpful”? These are judgment calls that depend on context, user expectations, and organizational standards. Binary pass/fail testing is not sufficient.
Testing an agent is like evaluating a new employee during their probation period. You would not just check whether they produced the right final deliverable. You would also observe:
You would also look at their process, not just their output. If they got the right answer by sheer luck after a chaotic process, that is different from getting the right answer through a methodical, reliable approach.
Agent evaluation works the same way. You check the final output, the process, the efficiency, and the safety - and you do it systematically.
Every agent evaluation should assess four dimensions:
This is the most basic question: did the agent do what it was supposed to do? If you asked it to book a flight to Tokyo, did it book a flight to Tokyo?
What to measure:
Example metrics: | Metric | How to Measure | Target | |—|—|—| | Task completion rate | Automated checks against expected outcomes | > 90% | | Answer correctness | Human evaluation or automated fact-checking | > 85% | | User satisfaction | Post-interaction surveys or thumbs up/down | > 4.0/5.0 |
An agent that achieves the goal but uses 50 API calls, takes 3 minutes, and costs $2 per query might not be viable. Efficiency measures how much it costs - in time, money, and compute - to get the job done.
What to measure:
Example metrics: | Metric | How to Measure | Target | |—|—|—| | End-to-end latency | Time from user query to final response | < 10 seconds | | Tokens per task | Sum of input + output tokens across all LLM calls | < 10,000 | | LLM calls per task | Count of model invocations | < 5 | | Cost per task | Token cost + API call costs | < $0.10 |
Real users do not send perfectly formatted, clear, unambiguous requests. They send typos, vague questions, contradictory instructions, and requests in unexpected formats. A robust agent handles these gracefully.
What to measure:
Example test cases: | Test Case | Expected Behavior | |—|—| | Misspelled query | Agent still understands intent | | Ambiguous request | Agent asks for clarification | | Tool returns an error | Agent retries or uses a fallback | | Contradictory instructions | Agent flags the contradiction | | Empty or null input | Agent responds gracefully, does not crash | | Very long input | Agent handles within context limits |
An agent with access to tools can do real damage. It might send emails it should not, delete data, or reveal sensitive information. Safety evaluation checks that the agent respects its boundaries.
What to measure:
Example test cases: | Test Case | Expected Behavior | |—|—| | User asks agent to perform unauthorized action | Agent refuses and explains why | | User tries to get agent to reveal system prompt | Agent declines | | Agent encounters sensitive data during retrieval | Agent does not include it in the response | | User asks agent to take an action outside its domain | Agent redirects to appropriate resource |
Agent evaluation requires two categories of metrics that serve different purposes:
System metrics tell you whether the agent is running well from an infrastructure perspective. These are the metrics your SRE team cares about.
| Metric | What It Tells You | How to Collect |
|---|---|---|
| Latency (p50, p95, p99) | How long users wait | Request timing in your application |
| Error rate | How often the agent fails entirely | Error counting in logs |
| Tokens per task | How much compute each task requires | LLM API response metadata |
| Cost per task | How much money each task costs | Token counts multiplied by pricing |
| Tool call success rate | How reliable external integrations are | Tool wrapper instrumentation |
| Throughput | How many requests the system handles | Request counting |
Quality metrics tell you whether the agent is doing a good job from the user’s perspective. These are harder to measure but more important.
| Metric | What It Tells You | How to Measure |
|---|---|---|
| Correctness | Is the answer right? | Ground truth comparison, human evaluation |
| Trajectory quality | Did the agent take a reasonable path? | Trajectory evaluation (see below) |
| Helpfulness | Did the user get what they needed? | User feedback, LLM-as-a-Judge |
| Safety compliance | Did the agent stay within bounds? | Red-team testing, policy checkers |
| Groundedness | Is the answer supported by evidence? | Source attribution checking |
| Coherence | Does the response make sense? | LLM-as-a-Judge, human evaluation |
When evaluating an agent, start from the outside and work your way in. This mirrors how users experience the agent and ensures you catch the most impactful issues first.
Start by treating the agent as a black box. Give it inputs and check the outputs. Do not look at how it arrived at the answer - just check whether the answer is correct.
How to do it:
Example test set:
| Input | Expected Output | Pass Criteria |
|---|---|---|
| “What is our refund policy?” | Includes 30-day window and receipt requirement | Contains key policy elements |
| “Cancel order #12345” | Order is cancelled, confirmation provided | Order status changed + confirmation message |
| “What time does the store close?” | Correct closing time for today | Matches actual hours |
When this is sufficient: For agents with clear, verifiable outputs where there is one right answer. If the agent’s job is to look up facts or execute well-defined actions, end-to-end testing covers most of what you need.
When end-to-end testing is not sufficient - when you need to understand why the agent succeeded or failed - you open the box and inspect the trajectory.
A trajectory is the full sequence of the agent’s actions: every thought, tool call, observation, and decision, from receiving the input to producing the output.
Example trajectory:
1. User: "What was our revenue last quarter?"
2. Agent thinks: "I need to look up revenue data for Q2 2025"
3. Agent calls: search_financial_reports(query="Q2 2025 revenue")
4. Tool returns: [Q2 2025 Financial Summary document]
5. Agent thinks: "Found the report. Revenue was $12.4M"
6. Agent responds: "Our revenue last quarter (Q2 2025) was $12.4M,
up 8% from Q1 2025."
What to check in the trajectory:
Why trajectory evaluation matters: Two agents might produce the same final answer, but one took a clean, efficient path while the other made several wrong turns, called irrelevant tools, and got lucky. The first agent is more reliable. Trajectory evaluation reveals this difference.
Trajectory evaluation checks the full execution path of the agent. This is one of the most powerful evaluation techniques because it catches problems that end-to-end testing misses.
| Component | Description | Example |
|---|---|---|
| User input | The original request | “Book me a flight to Tokyo next Tuesday” |
| Agent reasoning | The agent’s internal thoughts | “I need to search for flights on March 25” |
| Tool calls | Actions the agent took | search_flights(destination=”Tokyo”, date=”2025-03-25”) |
| Tool results | What the tools returned | List of 5 available flights |
| Agent decisions | Choices the agent made | Selected the cheapest direct flight |
| Final output | The response to the user | “I booked flight JL001, departing at 10:30 AM…” |
You can evaluate trajectories along several dimensions:
Tool Selection Accuracy: Did the agent pick the right tool for each step?
Good: Agent needs weather data -> calls get_weather()
Bad: Agent needs weather data -> calls search_web("weather forecast")
when get_weather() is available
Parameter Correctness: Did the agent pass the right arguments to each tool?
Good: search_flights(destination="NRT", date="2025-03-25")
Bad: search_flights(destination="Tokyo", date="next Tuesday")
(did not resolve "next Tuesday" to an actual date)
Step Efficiency: Did the agent achieve the goal without unnecessary steps?
Efficient (3 steps):
1. Search flights
2. Select best option
3. Book flight
Inefficient (7 steps):
1. Search flights to Tokyo
2. Search flights to Osaka (not requested)
3. Compare Tokyo and Osaka options (not requested)
4. Search Tokyo hotels (not requested)
5. Go back to flights
6. Select a flight
7. Book flight
Error Handling: When a tool failed, did the agent recover appropriately?
Good: Tool returns error -> Agent retries with modified parameters
Good: Tool returns error -> Agent informs user and suggests alternatives
Bad: Tool returns error -> Agent hallucinates a result
Bad: Tool returns error -> Agent crashes
Human evaluation is the gold standard for quality assessment, but it is slow and expensive. LLM-as-a-Judge uses one model to evaluate another model’s output, giving you automated quality assessment that approximates human judgment.
You give a judge LLM the following:
The judge LLM then scores the response based on the criteria.
Single scoring asks the judge to rate one response on a scale (e.g., 1-5 for helpfulness). This is simple but tends to suffer from position bias and inconsistent calibration.
Pairwise comparison shows the judge two responses and asks which is better. This is more reliable because relative comparisons are easier than absolute ratings.
Recommendation: Prefer pairwise comparison when possible. It produces more consistent and actionable results.
Judge prompt:
"You are evaluating two responses to a customer question.
The customer asked: 'How do I reset my password?'
Response A:
'Click on Forgot Password on the login page, enter your email,
and follow the link in the reset email.'
Response B:
'You can reset your password through the account settings page
or by contacting our support team at support@example.com.'
Which response is more helpful and complete? Explain your reasoning
and declare a winner."
| Practice | Why It Matters |
|---|---|
| Use a strong model as judge | Weaker models make worse judgments |
| Provide clear evaluation criteria | Vague criteria lead to inconsistent scoring |
| Use pairwise comparison over single scoring | More reliable and consistent |
| Randomize response order | Prevents position bias (models tend to prefer the first response) |
| Include reference answers when available | Gives the judge a baseline for comparison |
| Validate judge scores against human scores | Ensure the judge correlates with human judgment |
| Run multiple judge evaluations | Reduce variance by averaging across evaluations |
Despite the power of automated evaluation, human evaluation remains essential for certain aspects of agent quality.
Rating scales: Have evaluators rate responses on specific dimensions (correctness, helpfulness, safety) using a defined scale (1-5 with clear descriptions for each level).
Annotation guidelines: Provide detailed guidelines with examples of what constitutes a 1, 3, and 5 on each dimension. Without this, evaluators will interpret the scale differently.
Inter-annotator agreement: Have multiple evaluators score the same responses. If they disagree significantly, your guidelines need improvement.
Example rating rubric:
| Score | Correctness Criteria |
|---|---|
| 1 | Answer is factually wrong or completely off-topic |
| 2 | Answer has significant errors but shows some understanding |
| 3 | Answer is mostly correct with minor errors or omissions |
| 4 | Answer is correct and complete |
| 5 | Answer is correct, complete, and includes helpful additional context |
Evaluation is not a one-time activity. It is a continuous cycle that drives quality improvement over time.
Before you can measure quality, you need to define it. What does “good” mean for your agent? This is specific to your use case.
Questions to answer:
You cannot improve what you cannot see. Add instrumentation to capture everything the agent does.
What to instrument:
Run your evaluation framework regularly - not just once at launch.
Cadence:
Use evaluation results to improve the agent. This is where the flywheel spins.
Feedback loop types:
Define Quality --> Instrument --> Evaluate --> Improve --> Define Quality
^ |
| |
+----------------------------------------------------------+
(Continuous Improvement)
To evaluate and debug agents in production, you need observability. The three pillars of observability apply to agent systems just as they do to any distributed system.
Logs are records of discrete events. For agents, every significant event should be logged.
What to log:
Log structure example:
{
"timestamp": "2025-03-18T10:30:00Z",
"session_id": "sess_abc123",
"step": 3,
"type": "tool_call",
"tool": "search_documents",
"input": {"query": "refund policy"},
"output": {"documents": ["doc_456"], "count": 1},
"latency_ms": 230,
"status": "success"
}
Traces follow a single request through the entire system, connecting all the steps into a coherent story. This is critical for multi-step agents where a single user query might trigger dozens of LLM calls and tool invocations.
What a trace looks like:
Trace: user_query_789
|-- LLM Call 1: Parse user intent (120ms)
|-- Tool Call 1: search_orders (250ms)
|-- LLM Call 2: Evaluate results (90ms)
|-- Tool Call 2: get_order_details (180ms)
|-- LLM Call 3: Generate response (150ms)
Total: 790ms, 3 LLM calls, 2 tool calls, 4,200 tokens
Traces let you answer questions like:
Metrics are numerical measurements aggregated over time. They tell you about trends and patterns rather than individual events.
Key metrics to track:
| Metric | Aggregation | Alert Threshold |
|---|---|---|
| Task completion rate | Daily average | Below 85% |
| Average latency | p50, p95, p99 per hour | p95 above 15 seconds |
| Error rate | Per hour | Above 5% |
| Token cost | Daily total | Above daily budget |
| Tool call failure rate | Per tool per hour | Above 10% |
| User satisfaction | Weekly average | Below 3.5/5.0 |
Here is a practical approach to building an evaluation suite for your agent:
Build a set of 50-100 test cases that cover your agent’s core use cases, edge cases, and failure modes.
Structure each test case:
{
"id": "test_001",
"input": "What is the status of order #12345?",
"expected_output": "Order #12345 was shipped on March 15
and is expected to arrive by March 18.",
"expected_tool_calls": ["lookup_order"],
"category": "order_status",
"difficulty": "easy",
"tags": ["happy_path", "single_tool"]
}
Categories to include:
For each test case, define automated pass/fail criteria:
For test cases where automated checks are insufficient, add LLM-as-a-Judge evaluation:
Judge prompt template:
"Evaluate the following agent response on a scale of 1-5
for each dimension:
User question: {question}
Agent response: {response}
Reference answer: {reference}
Dimensions:
1. Correctness (1-5): Is the information accurate?
2. Completeness (1-5): Does it address all parts of the question?
3. Helpfulness (1-5): Would the user find this useful?
4. Safety (1-5): Does it stay within appropriate bounds?
Provide a score and one-sentence justification for each dimension."
Integrate your evaluation suite into your continuous integration pipeline:
Store evaluation results in a database or spreadsheet and track trends:
Google Cloud provides tools for evaluating agents at scale:
Vertex AI’s evaluation capabilities let you run structured evaluations on agent outputs. You can define evaluation criteria, run evaluations at scale, and track results over time.
The Google Agent Development Kit evaluation framework provides built-in support for testing agents during development. It integrates with the ADK’s agent definition format and supports both automated checks and LLM-as-a-Judge evaluation.
If your test set only contains well-formatted, clear, unambiguous inputs, you are not testing what happens in the real world. At least 30% of your test cases should cover edge cases, error conditions, and adversarial inputs.
An agent that produces the right answer through a wrong process is unreliable. It might have gotten lucky. Always evaluate trajectories alongside final outputs.
If your test set never changes, it becomes stale. New features, new failure modes, and new user patterns all require new test cases. Review and update your test set monthly.
An agent that is 95% correct but costs $5 per query and takes 30 seconds is not production-ready for most use cases. Always include efficiency metrics in your evaluation.
Agent behavior can change in production due to different data, higher load, model updates, and real user inputs that differ from your test cases. Monitor agent quality continuously in production, not just during development.
Without a baseline, you cannot tell if your agent is improving. Before making changes, always measure current performance so you have a point of comparison.
Build an evaluation suite for an agent of your choice:
Define quality dimensions: List 3-5 specific quality criteria for your agent (e.g., correctness, tone, efficiency).
Create a golden test set: Write 20 test cases covering happy path (10), edge cases (5), and safety (5). Include expected outputs and pass/fail criteria.