Lesson 9: evaluating and testing agents - how to know if your agent works

Introduction

You have built an agent. It has tools, memory, and a planning loop. It can retrieve information, take actions, and converse with users. But how do you know if it actually works well?

Testing an agent is fundamentally different from testing traditional software. With a regular function, you pass in inputs and check that the outputs match your expectations. With an agent, the same question can produce different valid answers, the agent might take different paths to reach the same goal, and “correct” is often subjective. The output is non-deterministic, the process is multi-step, and the quality dimensions are more nuanced than pass/fail.

This lesson covers how to evaluate agents rigorously - what to measure, how to measure it, and how to build evaluation into your development workflow so that quality improves continuously.

Why agent evaluation is hard

Non-Deterministic Outputs

Ask a traditional function to add 2 + 2 and it always returns 4. Ask an agent “What should I do about our declining user retention?” and you might get a different answer every time - and multiple different answers could all be good. Temperature settings, model updates, and slight prompt variations all change the output.

Many valid paths

An agent that books a flight might search by price first, then by time. Or it might search by time first, then filter by price. Both paths reach the same goal, but the sequence of tool calls is different. Testing that the agent took exact step X at position Y would be too rigid and would break with any reasonable change.

Compound errors

An agent that makes five decisions in sequence has five chances to make a mistake, and errors compound. A small mistake in step 2 might lead to a completely wrong outcome by step 5. Testing only the final output misses where things went wrong.

Subjective quality

Is a summary “good”? Is a customer service response “helpful”? These are judgment calls that depend on context, user expectations, and organizational standards. Binary pass/fail testing is not sufficient.

ELI5: the new employee analogy

Testing an agent is like evaluating a new employee during their probation period. You would not just check whether they produced the right final deliverable. You would also observe:

You would also look at their process, not just their output. If they got the right answer by sheer luck after a chaotic process, that is different from getting the right answer through a methodical, reliable approach.

Agent evaluation works the same way. You check the final output, the process, the efficiency, and the safety - and you do it systematically.

Four pillars of agent quality

Every agent evaluation should assess four dimensions:

1. effectiveness - does it achieve the goal?

This is the most basic question: did the agent do what it was supposed to do? If you asked it to book a flight to Tokyo, did it book a flight to Tokyo?

What to measure:

Example metrics: | Metric | How to Measure | Target | |—|—|—| | Task completion rate | Automated checks against expected outcomes | > 90% | | Answer correctness | Human evaluation or automated fact-checking | > 85% | | User satisfaction | Post-interaction surveys or thumbs up/down | > 4.0/5.0 |

2. efficiency - at what cost?

An agent that achieves the goal but uses 50 API calls, takes 3 minutes, and costs $2 per query might not be viable. Efficiency measures how much it costs - in time, money, and compute - to get the job done.

What to measure:

Example metrics: | Metric | How to Measure | Target | |—|—|—| | End-to-end latency | Time from user query to final response | < 10 seconds | | Tokens per task | Sum of input + output tokens across all LLM calls | < 10,000 | | LLM calls per task | Count of model invocations | < 5 | | Cost per task | Token cost + API call costs | < $0.10 |

3. robustness - does it handle edge cases?

Real users do not send perfectly formatted, clear, unambiguous requests. They send typos, vague questions, contradictory instructions, and requests in unexpected formats. A robust agent handles these gracefully.

What to measure:

Example test cases: | Test Case | Expected Behavior | |—|—| | Misspelled query | Agent still understands intent | | Ambiguous request | Agent asks for clarification | | Tool returns an error | Agent retries or uses a fallback | | Contradictory instructions | Agent flags the contradiction | | Empty or null input | Agent responds gracefully, does not crash | | Very long input | Agent handles within context limits |

4. safety - does it stay within bounds?

An agent with access to tools can do real damage. It might send emails it should not, delete data, or reveal sensitive information. Safety evaluation checks that the agent respects its boundaries.

What to measure:

Example test cases: | Test Case | Expected Behavior | |—|—| | User asks agent to perform unauthorized action | Agent refuses and explains why | | User tries to get agent to reveal system prompt | Agent declines | | Agent encounters sensitive data during retrieval | Agent does not include it in the response | | User asks agent to take an action outside its domain | Agent redirects to appropriate resource |

System metrics vs. quality metrics

Agent evaluation requires two categories of metrics that serve different purposes:

System metrics (operational health)

System metrics tell you whether the agent is running well from an infrastructure perspective. These are the metrics your SRE team cares about.

Metric What It Tells You How to Collect
Latency (p50, p95, p99) How long users wait Request timing in your application
Error rate How often the agent fails entirely Error counting in logs
Tokens per task How much compute each task requires LLM API response metadata
Cost per task How much money each task costs Token counts multiplied by pricing
Tool call success rate How reliable external integrations are Tool wrapper instrumentation
Throughput How many requests the system handles Request counting

Quality metrics (output goodness)

Quality metrics tell you whether the agent is doing a good job from the user’s perspective. These are harder to measure but more important.

Metric What It Tells You How to Measure
Correctness Is the answer right? Ground truth comparison, human evaluation
Trajectory quality Did the agent take a reasonable path? Trajectory evaluation (see below)
Helpfulness Did the user get what they needed? User feedback, LLM-as-a-Judge
Safety compliance Did the agent stay within bounds? Red-team testing, policy checkers
Groundedness Is the answer supported by evidence? Source attribution checking
Coherence Does the response make sense? LLM-as-a-Judge, human evaluation

The Outside-In Approach

When evaluating an agent, start from the outside and work your way in. This mirrors how users experience the agent and ensures you catch the most impactful issues first.

Level 1: Black-Box End-to-End Testing

Start by treating the agent as a black box. Give it inputs and check the outputs. Do not look at how it arrived at the answer - just check whether the answer is correct.

How to do it:

  1. Create a test set of input-output pairs (questions and expected answers)
  2. Run each input through the agent
  3. Compare the agent’s output to the expected output
  4. Track pass/fail rates

Example test set:

Input Expected Output Pass Criteria
“What is our refund policy?” Includes 30-day window and receipt requirement Contains key policy elements
“Cancel order #12345” Order is cancelled, confirmation provided Order status changed + confirmation message
“What time does the store close?” Correct closing time for today Matches actual hours

When this is sufficient: For agents with clear, verifiable outputs where there is one right answer. If the agent’s job is to look up facts or execute well-defined actions, end-to-end testing covers most of what you need.

Level 2: glass-box trajectory evaluation

When end-to-end testing is not sufficient - when you need to understand why the agent succeeded or failed - you open the box and inspect the trajectory.

A trajectory is the full sequence of the agent’s actions: every thought, tool call, observation, and decision, from receiving the input to producing the output.

Example trajectory:

1. User: "What was our revenue last quarter?"
2. Agent thinks: "I need to look up revenue data for Q2 2025"
3. Agent calls: search_financial_reports(query="Q2 2025 revenue")
4. Tool returns: [Q2 2025 Financial Summary document]
5. Agent thinks: "Found the report. Revenue was $12.4M"
6. Agent responds: "Our revenue last quarter (Q2 2025) was $12.4M,
   up 8% from Q1 2025."

What to check in the trajectory:

Why trajectory evaluation matters: Two agents might produce the same final answer, but one took a clean, efficient path while the other made several wrong turns, called irrelevant tools, and got lucky. The first agent is more reliable. Trajectory evaluation reveals this difference.

Trajectory evaluation in detail

Trajectory evaluation checks the full execution path of the agent. This is one of the most powerful evaluation techniques because it catches problems that end-to-end testing misses.

What a trajectory contains

Component Description Example
User input The original request “Book me a flight to Tokyo next Tuesday”
Agent reasoning The agent’s internal thoughts “I need to search for flights on March 25”
Tool calls Actions the agent took search_flights(destination=”Tokyo”, date=”2025-03-25”)
Tool results What the tools returned List of 5 available flights
Agent decisions Choices the agent made Selected the cheapest direct flight
Final output The response to the user “I booked flight JL001, departing at 10:30 AM…”

Evaluating trajectories

You can evaluate trajectories along several dimensions:

Tool Selection Accuracy: Did the agent pick the right tool for each step?

Good: Agent needs weather data -> calls get_weather()
Bad:  Agent needs weather data -> calls search_web("weather forecast")
      when get_weather() is available

Parameter Correctness: Did the agent pass the right arguments to each tool?

Good: search_flights(destination="NRT", date="2025-03-25")
Bad:  search_flights(destination="Tokyo", date="next Tuesday")
      (did not resolve "next Tuesday" to an actual date)

Step Efficiency: Did the agent achieve the goal without unnecessary steps?

Efficient (3 steps):
  1. Search flights
  2. Select best option
  3. Book flight

Inefficient (7 steps):
  1. Search flights to Tokyo
  2. Search flights to Osaka (not requested)
  3. Compare Tokyo and Osaka options (not requested)
  4. Search Tokyo hotels (not requested)
  5. Go back to flights
  6. Select a flight
  7. Book flight

Error Handling: When a tool failed, did the agent recover appropriately?

Good: Tool returns error -> Agent retries with modified parameters
Good: Tool returns error -> Agent informs user and suggests alternatives
Bad:  Tool returns error -> Agent hallucinates a result
Bad:  Tool returns error -> Agent crashes

LLM-as-a-Judge

Human evaluation is the gold standard for quality assessment, but it is slow and expensive. LLM-as-a-Judge uses one model to evaluate another model’s output, giving you automated quality assessment that approximates human judgment.

How it works

You give a judge LLM the following:

  1. The original question or task
  2. The agent’s response (or trajectory)
  3. Evaluation criteria
  4. Optionally, a reference answer

The judge LLM then scores the response based on the criteria.

Single scoring vs. pairwise comparison

Single scoring asks the judge to rate one response on a scale (e.g., 1-5 for helpfulness). This is simple but tends to suffer from position bias and inconsistent calibration.

Pairwise comparison shows the judge two responses and asks which is better. This is more reliable because relative comparisons are easier than absolute ratings.

Recommendation: Prefer pairwise comparison when possible. It produces more consistent and actionable results.

Pairwise comparison example

Judge prompt:
"You are evaluating two responses to a customer question.
The customer asked: 'How do I reset my password?'

Response A:
'Click on Forgot Password on the login page, enter your email,
and follow the link in the reset email.'

Response B:
'You can reset your password through the account settings page
or by contacting our support team at support@example.com.'

Which response is more helpful and complete? Explain your reasoning
and declare a winner."

Best practices for llm-as-a-judge

Practice Why It Matters
Use a strong model as judge Weaker models make worse judgments
Provide clear evaluation criteria Vague criteria lead to inconsistent scoring
Use pairwise comparison over single scoring More reliable and consistent
Randomize response order Prevents position bias (models tend to prefer the first response)
Include reference answers when available Gives the judge a baseline for comparison
Validate judge scores against human scores Ensure the judge correlates with human judgment
Run multiple judge evaluations Reduce variance by averaging across evaluations

Limitations of LLM-as-a-Judge

Human evaluation

Despite the power of automated evaluation, human evaluation remains essential for certain aspects of agent quality.

When you need humans

Structuring human evaluation

Rating scales: Have evaluators rate responses on specific dimensions (correctness, helpfulness, safety) using a defined scale (1-5 with clear descriptions for each level).

Annotation guidelines: Provide detailed guidelines with examples of what constitutes a 1, 3, and 5 on each dimension. Without this, evaluators will interpret the scale differently.

Inter-annotator agreement: Have multiple evaluators score the same responses. If they disagree significantly, your guidelines need improvement.

Example rating rubric:

Score Correctness Criteria
1 Answer is factually wrong or completely off-topic
2 Answer has significant errors but shows some understanding
3 Answer is mostly correct with minor errors or omissions
4 Answer is correct and complete
5 Answer is correct, complete, and includes helpful additional context

The agent quality flywheel

Evaluation is not a one-time activity. It is a continuous cycle that drives quality improvement over time.

Step 1: define quality

Before you can measure quality, you need to define it. What does “good” mean for your agent? This is specific to your use case.

Questions to answer:

Step 2: instrument for visibility

You cannot improve what you cannot see. Add instrumentation to capture everything the agent does.

What to instrument:

Step 3: evaluate the process

Run your evaluation framework regularly - not just once at launch.

Cadence:

Step 4: architect feedback loops

Use evaluation results to improve the agent. This is where the flywheel spins.

Feedback loop types:

Define Quality --> Instrument --> Evaluate --> Improve --> Define Quality
     ^                                                          |
     |                                                          |
     +----------------------------------------------------------+
                     (Continuous Improvement)

Observability: the three pillars

To evaluate and debug agents in production, you need observability. The three pillars of observability apply to agent systems just as they do to any distributed system.

Logs

Logs are records of discrete events. For agents, every significant event should be logged.

What to log:

Log structure example:

{
  "timestamp": "2025-03-18T10:30:00Z",
  "session_id": "sess_abc123",
  "step": 3,
  "type": "tool_call",
  "tool": "search_documents",
  "input": {"query": "refund policy"},
  "output": {"documents": ["doc_456"], "count": 1},
  "latency_ms": 230,
  "status": "success"
}

Traces

Traces follow a single request through the entire system, connecting all the steps into a coherent story. This is critical for multi-step agents where a single user query might trigger dozens of LLM calls and tool invocations.

What a trace looks like:

Trace: user_query_789
  |-- LLM Call 1: Parse user intent (120ms)
  |-- Tool Call 1: search_orders (250ms)
  |-- LLM Call 2: Evaluate results (90ms)
  |-- Tool Call 2: get_order_details (180ms)
  |-- LLM Call 3: Generate response (150ms)
  Total: 790ms, 3 LLM calls, 2 tool calls, 4,200 tokens

Traces let you answer questions like:

Metrics

Metrics are numerical measurements aggregated over time. They tell you about trends and patterns rather than individual events.

Key metrics to track:

Metric Aggregation Alert Threshold
Task completion rate Daily average Below 85%
Average latency p50, p95, p99 per hour p95 above 15 seconds
Error rate Per hour Above 5%
Token cost Daily total Above daily budget
Tool call failure rate Per tool per hour Above 10%
User satisfaction Weekly average Below 3.5/5.0

Building an evaluation suite

Here is a practical approach to building an evaluation suite for your agent:

1. create a golden test set

Build a set of 50-100 test cases that cover your agent’s core use cases, edge cases, and failure modes.

Structure each test case:

{
  "id": "test_001",
  "input": "What is the status of order #12345?",
  "expected_output": "Order #12345 was shipped on March 15
                      and is expected to arrive by March 18.",
  "expected_tool_calls": ["lookup_order"],
  "category": "order_status",
  "difficulty": "easy",
  "tags": ["happy_path", "single_tool"]
}

Categories to include:

2. implement automated checks

For each test case, define automated pass/fail criteria:

3. add llm-as-a-judge scoring

For test cases where automated checks are insufficient, add LLM-as-a-Judge evaluation:

Judge prompt template:
"Evaluate the following agent response on a scale of 1-5
for each dimension:

User question: {question}
Agent response: {response}
Reference answer: {reference}

Dimensions:
1. Correctness (1-5): Is the information accurate?
2. Completeness (1-5): Does it address all parts of the question?
3. Helpfulness (1-5): Would the user find this useful?
4. Safety (1-5): Does it stay within appropriate bounds?

Provide a score and one-sentence justification for each dimension."

4. run evaluations in CI/CD

Integrate your evaluation suite into your continuous integration pipeline:

5. track results over time

Store evaluation results in a database or spreadsheet and track trends:

Evaluation with Google Cloud

Google Cloud provides tools for evaluating agents at scale:

Vertex AI Evaluation

Vertex AI’s evaluation capabilities let you run structured evaluations on agent outputs. You can define evaluation criteria, run evaluations at scale, and track results over time.

Google ADK Evaluation

The Google Agent Development Kit evaluation framework provides built-in support for testing agents during development. It integrates with the ADK’s agent definition format and supports both automated checks and LLM-as-a-Judge evaluation.

Common evaluation mistakes

Mistake 1: only testing the happy path

If your test set only contains well-formatted, clear, unambiguous inputs, you are not testing what happens in the real world. At least 30% of your test cases should cover edge cases, error conditions, and adversarial inputs.

Mistake 2: only checking the final output

An agent that produces the right answer through a wrong process is unreliable. It might have gotten lucky. Always evaluate trajectories alongside final outputs.

Mistake 3: static test sets

If your test set never changes, it becomes stale. New features, new failure modes, and new user patterns all require new test cases. Review and update your test set monthly.

Mistake 4: ignoring cost and latency

An agent that is 95% correct but costs $5 per query and takes 30 seconds is not production-ready for most use cases. Always include efficiency metrics in your evaluation.

Mistake 5: testing only in development

Agent behavior can change in production due to different data, higher load, model updates, and real user inputs that differ from your test cases. Monitor agent quality continuously in production, not just during development.

Mistake 6: no baseline comparison

Without a baseline, you cannot tell if your agent is improving. Before making changes, always measure current performance so you have a point of comparison.

Hands-On Exercise

Build an evaluation suite for an agent of your choice:

  1. Define quality dimensions: List 3-5 specific quality criteria for your agent (e.g., correctness, tone, efficiency).

  2. Create a golden test set: Write 20 test cases covering happy path (10), edge cases (5), and safety (5). Include expected outputs and pass/fail criteria.

  3. Implement automated evaluation: Build a script that:
    • Runs each test case through the agent
    • Checks automated pass/fail criteria
    • Uses LLM-as-a-Judge for subjective dimensions
    • Produces a summary report
  4. Evaluate trajectory quality: For 5 test cases, capture the full trajectory and evaluate:
    • Were the right tools called?
    • Were parameters correct?
    • Was the path efficient?
  5. Document findings: Write up what you learned about your agent’s strengths and weaknesses.

Key takeaways

Further reading