You have built a working agent. It handles your test cases, impresses your team in demos, and feels like magic. Now you need to ship it to real users. This is where things get hard.
The gap between “works on my laptop” and “works reliably for thousands of users” is enormous. In traditional software, production readiness mostly means handling edge cases, scaling, and monitoring. With agents, you have all of that plus the fundamental challenge that your system’s behavior is non-deterministic and hard to fully predict.
You have been cooking great meals at home for your friends. Everyone loves your food. Now you want to open a restaurant. Suddenly you need to think about things that never mattered at home: health inspections, supply chains, consistent recipes so every dish tastes the same, training staff, handling complaints, managing costs, and making sure the kitchen does not catch fire on a busy Saturday night. The cooking skill is the same, but everything around it changes completely.
That is the prototype-to-production gap. Your agent’s core logic might not change much, but everything around it - evaluation, deployment, monitoring, cost management, team processes - needs to be built from scratch.
Key takeaway: The “last mile” from prototype to production is often 80% of the total effort. Plan for it from the start.
Here is what changes when you move from prototype to production:
| Dimension | Prototype | Production |
|---|---|---|
| Users | You and your team | Hundreds or thousands of real users |
| Inputs | Curated test cases | Anything anyone types, including adversarial input |
| Uptime | Restart when it breaks | Must be available 24/7 with graceful degradation |
| Latency | “It takes a few seconds” is fine | Users expect sub-second responses for simple queries |
| Cost | Burn rate does not matter | Every token costs money at scale |
| Quality | “Usually works” is acceptable | Consistent quality is required; bad responses erode trust |
| Safety | Informal testing | Systematic guardrails, monitoring, and incident response |
| Debugging | Print statements | Structured logs, traces, and metrics |
| Updates | Edit and restart | CI/CD pipeline with evaluation gates |
Demos work because the person giving the demo knows what inputs work well. They avoid edge cases. They retry failures off-screen. They pick the best example from multiple runs.
Production is the opposite. Real users will:
This is why evaluation-gated deployment is so important. You should not ship an agent version that has not been tested against a comprehensive set of real-world scenarios.
Productionizing an agent is not a solo effort. It typically involves several roles working together:
| Role | Responsibility |
|---|---|
| AI Engineer | Agent logic, prompt design, tool integration, eval creation |
| Platform Engineer | Infrastructure, deployment pipelines, service mesh, scaling |
| Data Engineer | Data pipelines for RAG, knowledge bases, training data management |
| ML Ops / AI Ops | Model serving, versioning, A/B testing, monitoring dashboards |
| DevOps / SRE | Reliability, incident response, alerting, cost tracking |
| Product Manager | User requirements, success metrics, prioritization |
| Security / Trust & Safety | Guardrails, red teaming, compliance, safety reviews |
In smaller teams, one person might wear multiple hats. But the responsibilities still exist regardless of how many people share them.
This is the single most important practice for shipping agents safely. The principle is simple: no agent version ships without passing evals.
In Lesson 9, we covered how to build evals. Here is how they fit into the deployment process:
Code Change --> Evals Pass? --No--> Fix and retry
|
Yes
|
v
Deploy to staging
|
v
Staging evals pass? --No--> Fix and retry
|
Yes
|
v
Deploy to production (canary)
|
v
Production metrics OK? --No--> Rollback
|
Yes
|
v
Full production rollout
Your eval suite for deployment should cover:
| Category | What to Test | Pass Criteria |
|---|---|---|
| Functional correctness | Does the agent produce correct answers? | >= threshold on accuracy metrics |
| Tool usage | Does the agent call the right tools with correct arguments? | Tools called correctly in >= X% of cases |
| Safety | Does the agent resist prompt injection and follow policies? | 100% pass rate on safety-critical cases |
| Latency | Does the agent respond within acceptable time? | P95 latency < target |
| Cost | Does the agent stay within token budgets? | Average cost per interaction < budget |
| Regression | Do previously passing cases still pass? | No regressions on known-good cases |
Safety evals should have a hard gate - any failure blocks deployment. Other categories might have softer thresholds where you accept small regressions if overall quality improves.
Continuous integration and continuous deployment for agents follows the same principles as traditional CI/CD but with agent-specific steps. Think of it in three phases.
These checks run quickly and catch obvious problems before code is merged.
# Example: Pre-merge checks
pre_merge:
- lint:
- Check prompt formatting and syntax
- Validate tool definitions match schemas
- Static analysis of agent configuration
- unit_tests:
- Test individual tool functions
- Test guardrail logic
- Test input/output parsers
- basic_evals:
- Run a small, fast eval set (50-100 cases)
- Focus on regression detection
- Target: completes in < 5 minutes
After code is merged, run more comprehensive checks before promoting to staging.
# Example: Post-merge validation
post_merge:
- staging_deployment:
- Deploy to staging environment
- Verify health checks pass
- broad_evals:
- Run full eval suite (500-1000+ cases)
- Include safety evals
- Include latency and cost benchmarks
- Target: completes in < 30 minutes
- integration_tests:
- Test end-to-end flows with real tool connections
- Verify external service integrations
The final check before real users see the new version.
# Example: Production gate
production_gate:
- full_evals:
- Complete eval suite including edge cases
- Adversarial test cases
- Cross-model consistency checks (if using multiple models)
- safety_review:
- Automated safety evals must pass 100%
- Human review for significant prompt changes
- Red team sign-off for major feature changes
- approval:
- Automated approval if all checks pass
- Manual approval required if any check is marginal
Prompts deserve the same version control discipline as code:
prompts/
customer_support/
system_prompt.txt # The main system instructions
tool_descriptions.txt # Tool descriptions and schemas
safety_rules.txt # Safety-specific instructions
version.txt # Current version identifier
Even with comprehensive evals, production can surprise you. Safe rollout strategies limit the blast radius when something goes wrong.
Route a small percentage of traffic to the new version. Monitor for problems before increasing the percentage.
Traffic ---> [Load Balancer]
| |
95% 5%
| |
v v
[Version 1] [Version 2 - Canary]
(current) (new)
How it works:
Maintain two identical production environments. Switch all traffic from one to the other.
Before: Traffic --> [Blue - v1.2 ACTIVE] [Green - idle]
During: Traffic --> [Blue - v1.2] [Green - v1.3 ACTIVE]
The advantage is a clean cutover and instant rollback (just switch back to Blue). The downside is you need double the infrastructure during the transition.
Route traffic to different agent versions and compare their performance on real interactions.
| Version A | Version B | Metric | Winner |
|---|---|---|---|
| GPT-based, verbose prompts | Gemini-based, concise prompts | Task completion rate | Compare after N interactions |
| ReAct pattern | Plan-then-execute | User satisfaction | Compare after N interactions |
| Model A, 3 tool retries | Model A, 1 tool retry | Cost per interaction | Compare after N interactions |
A/B testing is especially valuable for agents because it lets you compare different models, prompts, and architectures on real traffic.
Control agent capabilities with runtime flags that can be toggled without redeployment.
# Example: Feature flags for agent capabilities
if feature_flags.is_enabled("new_refund_flow", user_id=user.id):
agent.enable_tool("process_refund_v2")
else:
agent.enable_tool("process_refund_v1")
if feature_flags.is_enabled("extended_context_window"):
agent.set_max_context(128000)
else:
agent.set_max_context(32000)
Feature flags let you gradually roll out new capabilities, quickly disable problematic features, and run experiments on subsets of users.
Once your agent is running in production, you need to see what it is doing. Observability for agents has three pillars, just like traditional systems - but the specifics are different.
Structured logs that capture every significant event in the agent’s lifecycle:
{
"timestamp": "2025-06-15T10:23:45Z",
"session_id": "sess_abc123",
"event": "tool_call",
"tool": "search_knowledge_base",
"arguments": {"query": "return policy for electronics"},
"result_status": "success",
"latency_ms": 234,
"tokens_used": {"input": 1250, "output": 380}
}
What to log:
Traces show the full journey of a single request through your agent, including all the steps, tool calls, and decisions along the way.
[User Request] "Help me return my order"
|
+-- [LLM Call 1] Understand intent (150ms)
| Model: gemini-2.0-flash, Tokens: 800 in / 120 out
|
+-- [Tool Call] lookup_order(order_id="12345") (340ms)
| Status: success
|
+-- [Tool Call] check_return_eligibility(order_id="12345") (180ms)
| Status: success, eligible=true
|
+-- [LLM Call 2] Generate response (200ms)
| Model: gemini-2.0-flash, Tokens: 1200 in / 250 out
|
+-- [Output Guardrail] PII check (15ms)
| Status: pass
|
[Response] "Your order #12345 is eligible for return..."
Total: 885ms, Cost: $0.003
OpenTelemetry is the industry standard for distributed tracing. Many agent frameworks support OpenTelemetry out of the box, and Google Cloud’s operations suite (Cloud Logging, Cloud Trace, Cloud Monitoring) integrates with OpenTelemetry natively.
Aggregate metrics that tell you how your agent is performing overall:
| Metric | What It Tells You | Alert Threshold Example |
|---|---|---|
| Task completion rate | How often the agent successfully completes user requests | Drop below 85% |
| Average latency | How long users wait for responses | P95 exceeds 5 seconds |
| Cost per interaction | How much each conversation costs | Average exceeds $0.10 |
| Escalation rate | How often the agent hands off to humans | Exceeds 20% |
| Safety incident rate | How often guardrails are triggered | Any increase above baseline |
| Tool error rate | How often tool calls fail | Exceeds 5% |
| User satisfaction | Thumbs up/down or CSAT scores | Drops below 4.0/5.0 |
A production agent dashboard should show at a glance:
+-------------------------------------------------------+
| Agent Health Dashboard |
+-------------------------------------------------------+
| |
| Status: HEALTHY Active Sessions: 1,247 |
| |
| +-------------------+ +-------------------+ |
| | Completion Rate | | Avg Latency | |
| | 92.3% (+0.5%) | | 1.2s (-0.1s) | |
| +-------------------+ +-------------------+ |
| |
| +-------------------+ +-------------------+ |
| | Cost / Session | | Escalation Rate | |
| | $0.042 (-$0.003) | | 8.1% (+0.2%) | |
| +-------------------+ +-------------------+ |
| |
| Recent Safety Incidents: 0 (last 24h) |
| Recent Errors: 12 (last 24h, 0.04% of sessions) |
| |
+-------------------------------------------------------+
Production is not a destination. It is the beginning of a continuous improvement cycle.
+----------+
| Observe | <-- Collect metrics, logs, traces, user feedback
+----+-----+
|
v
+----+-----+
| Act | <-- Identify issues, prioritize improvements
+----+-----+
|
v
+----+-----+
| Evolve | <-- Update prompts, tools, evals, guardrails
+----+-----+
|
+-------> Back to Observe
Collect data about how your agent performs in production:
Turn observations into concrete actions:
Deploy improvements through your evaluation-gated CI/CD pipeline:
The key insight is that your eval suite grows over time. Every production incident, every user complaint, and every edge case becomes a new eval. This means your agent gets harder to break with each iteration.
LLM-based agents can be expensive at scale. A single conversation might involve multiple LLM calls, each consuming thousands of tokens. Multiply that by thousands of users and costs add up fast.
Use the cheapest model that can handle each task. Not every step requires your most powerful model.
User Query
|
v
[Router] --Simple query--> Gemini Flash-Lite ($)
|
+-----Medium complexity--> Gemini Flash ($$)
|
+-----Complex reasoning--> Gemini Pro ($$$)
| Task Type | Recommended Model Tier | Rationale |
|---|---|---|
| Intent classification | Small / Flash-Lite | Simple classification task |
| Information retrieval | Medium / Flash | Needs good comprehension, moderate generation |
| Complex reasoning | Large / Pro | Multi-step reasoning, nuanced judgment |
| Summarization | Medium / Flash | Good balance of quality and cost |
| Safety checks | Small / Flash-Lite | Pattern matching, classification |
Cache responses for repeated or similar queries to avoid redundant LLM calls.
| Caching Strategy | When to Use |
|---|---|
| Exact match cache | FAQ-style queries where many users ask the same thing |
| Semantic cache | Queries that are different in wording but identical in meaning |
| Tool result cache | Tool outputs that do not change frequently (e.g., product catalog lookups) |
| Prompt cache | Reuse cached prefixes for system prompts across calls (Vertex AI supports context caching) |
Set hard limits on how many tokens an agent can consume per session.
# Example: Token budget enforcement
class TokenBudget:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.used_tokens = 0
def can_proceed(self, estimated_tokens: int) -> bool:
return (self.used_tokens + estimated_tokens) <= self.max_tokens
def record_usage(self, actual_tokens: int):
self.used_tokens += actual_tokens
# Usage
budget = TokenBudget(max_tokens=50000) # per session
while agent.has_next_step():
estimated = agent.estimate_next_step_tokens()
if not budget.can_proceed(estimated):
agent.respond("I have reached my processing limit for this session. "
"Let me summarize what I have found so far.")
break
result = agent.execute_next_step()
budget.record_usage(result.tokens_used)
Track costs at multiple levels:
| Level | What to Track | Why |
|---|---|---|
| Per request | Tokens used, model tier, tool calls | Debug expensive individual requests |
| Per session | Total cost of a conversation | Set and enforce per-session budgets |
| Per user | Aggregate cost per user over time | Identify usage patterns and outliers |
| Per feature | Cost of specific agent capabilities | Decide which features are cost-effective |
| Overall | Daily/weekly/monthly spend | Budget planning and forecasting |
Before launching your agent to production users, walk through this checklist:
The prototype-to-production gap is real and large. Plan for production concerns from the beginning. The “last mile” is the majority of the work.
Evaluation-gated deployment is non-negotiable. No agent version should reach production without passing a comprehensive eval suite. Your eval suite is your quality guarantee.
CI/CD for agents has three phases. Pre-merge checks catch obvious issues fast. Post-merge validation runs broader evals. Production gates ensure safety and quality before real users are affected.
Safe rollout strategies limit blast radius. Canary deployments, feature flags, and A/B testing let you catch problems before they affect all users.
Observability is essential. You cannot improve what you cannot see. Invest in logs, traces, and metrics from day one.
Cost management requires active attention. Model routing, caching, and token budgets can reduce costs dramatically without sacrificing quality.
Production is the beginning, not the end. The Observe-Act-Evolve loop means your agent continuously improves based on real-world usage.
Next lesson: Getting Started with Vertex AI and ADK