Imagine you are working at a desk. The desk is your agent’s context window - it is where you put all the stuff you are actively working with.
The desk has limited space. You can spread out notes, reference books, your laptop, and a cup of coffee. But if you keep piling things on, eventually the desk gets so cluttered that you cannot find anything. Important notes get buried under less relevant papers. You start losing track of what you were doing.
That is exactly what happens with an AI agent’s context. The context window is the agent’s working surface. Everything the agent needs to think about - the user’s question, conversation history, tool results, instructions - has to fit on this desk. When the desk gets full, the agent has to decide what stays and what goes.
Good memory management is like being organized at your desk: keeping the important stuff visible and accessible while storing less critical information somewhere you can find it later.
The context window is the total amount of text (measured in tokens) that an LLM can process in a single request. Think of it as the model’s working memory - everything it can “see” at once.
+------------------------------------------------------------------+
| CONTEXT WINDOW |
| |
| [System instructions] |
| [Tool definitions] |
| [Conversation history - user and agent messages] |
| [Retrieved documents / RAG results] |
| [Current user message] |
| [Agent's working thoughts] |
| |
| ... all must fit here ... |
+------------------------------------------------------------------+
Context windows have grown dramatically:
| Model | Context window |
|---|---|
| GPT-3 (2020) | 4K tokens |
| GPT-4 (2023) | 128K tokens |
| Gemini 1.5 Pro | 1M tokens |
| Gemini 2.0 | 1M+ tokens |
A token is roughly 3/4 of a word in English. So 1 million tokens is approximately 750,000 words - that is about 10 novels worth of text.
You might think a bigger context window solves all problems. It helps, but there are trade-offs:
| Factor | Small context | Large context |
|---|---|---|
| Cost | Less expensive per request | More expensive per request |
| Latency | Faster responses | Slower responses |
| Accuracy | Focused, less noise | May struggle to find relevant info in a sea of text |
| Simplicity | Forces you to be selective | Tempting to dump everything in |
Research has shown that LLMs can struggle with information placed in the middle of very long contexts - sometimes called the “lost in the middle” effect. Just because you can fit 1M tokens does not mean you should.
Prompt engineering is about crafting the right input text to get a good output from an LLM. You write a static prompt, maybe with a few examples, and send it off.
This works well for simple tasks, but it falls apart for agents. Why? Because agents deal with dynamic, changing information:
Context engineering is the practice of dynamically assembling the right information into the context window at the right time. Instead of a static prompt, you are building a system that decides:
Think of prompt engineering as writing a good email. Context engineering is building the mail system.
Prompt Engineering: Context Engineering:
+------------------+ +---------------------------+
| Static prompt | | System instructions |
| + user question | | + Relevant memory |
| | | + Session history (pruned) |
| | | + Tool schemas (selected) |
| | | + Retrieved context |
| | | + Current task state |
| | | |
| "Answer this" | | All dynamically assembled |
+------------------+ +---------------------------+
An agent making a hotel booking needs different context than one writing code. Even the same agent needs different context at different stages of a task. Context engineering is what makes this dynamic assembly possible.
Agents need different types of memory for different purposes. This mirrors how human memory works.
What it is: The current conversation between the user and the agent. This lives directly in the context window.
Human analogy: Your working memory - what you are actively thinking about right now.
Characteristics:
Example:
User: "My name is Alex and I prefer dark mode."
Agent: "Got it, Alex. I will use dark mode settings."
[50 messages later]
User: "What theme am I using?"
Agent: "You are using dark mode, Alex." <-- Only works if the earlier
message is still in context
What it is: Information stored outside the context window that persists between sessions. The agent retrieves it when relevant.
Human analogy: Your long-term memories - facts you have learned and experiences you remember, even if you are not thinking about them right now.
Characteristics:
Example uses:
What it is: Knowledge about how to perform tasks - workflows, standard operating procedures, and step-by-step processes.
Human analogy: Knowing how to ride a bike or type on a keyboard. You do not think about each step - you just know how.
Characteristics:
Example:
Procedural memory: "When a user reports a bug:
1. Ask for reproduction steps
2. Check the error logs
3. Search for similar issues
4. Propose a fix or workaround"
What it is: Factual information the agent knows or has access to - data, documents, specifications, and reference material.
Human analogy: Knowing that Paris is the capital of France. A fact you can state, not a skill you perform.
Characteristics:
| Memory type | Duration | Location | Updated how | Example |
|---|---|---|---|---|
| Short-term | One session | Context window | Automatically (conversation grows) | Current chat messages |
| Long-term | Across sessions | External storage | Explicitly by the agent or system | User preferences |
| Procedural | Permanent | System prompt / config | By developers | Workflow instructions |
| Declarative | Varies | Knowledge bases / RAG | By data pipelines | Product documentation |
A session is the container that holds a conversation between a user and an agent. Sessions give agents continuity - the ability to remember what has happened so far in this interaction.
Session
+-----------------------------------------------+
| Session ID: abc-123 |
| User ID: user-456 |
| Created: 2025-01-15T10:30:00Z |
| |
| Events: |
| [User message: "Help me plan a trip"] |
| [Agent message: "Where would you like..."] |
| [User message: "Tokyo, 5 days"] |
| [Tool call: flight_search(dest="Tokyo")] |
| [Tool result: {flights: [...]}] |
| [Agent message: "I found several..."] |
| |
| State: |
| destination: "Tokyo" |
| duration: "5 days" |
| budget: null |
+-----------------------------------------------+
State is useful because it gives the agent quick access to important facts without having to re-read the entire conversation history.
Google Cloud provides session management through the Agent Development Kit (ADK) and Vertex AI:
These handle the infrastructure of storing and retrieving sessions, so you can focus on the agent logic.
Memory and Retrieval-Augmented Generation (RAG) are related but serve different purposes. Engineers often confuse them.
Memory is like having a personal assistant who remembers your preferences, your schedule, and your past conversations. They know you.
RAG is like having access to a library. When you need a fact, you look it up. The library does not know you personally - it just has a lot of information available.
| Aspect | Memory | RAG |
|---|---|---|
| What it stores | User-specific, interaction-specific data | General knowledge, documents, data |
| Who it is about | This user, this agent, this context | Anyone - it is shared knowledge |
| When it is written | During agent interactions | During data ingestion (usually offline) |
| When it is read | At the start of or during a session | When the agent needs specific information |
| Personalization | High - unique to the user | Low - same for everyone |
| Example | “This user prefers concise answers” | “The API rate limit is 100 requests/minute” |
In practice, agents use both:
User: "Remind me, what was that deployment issue we had last month?"
Agent:
1. Check MEMORY: "This user works on the payments team and
had a Cloud Run deployment failure on Jan 3."
2. Use RAG: Search incident reports for "Cloud Run deployment
failure January" to get specific details.
3. Combine: "Last month you had a Cloud Run deployment failure
related to a misconfigured service account. Here are the
details from the incident report..."
Memory tells the agent what to look for. RAG provides the detailed information.
We will cover RAG in depth in Lesson 8: Agentic RAG.
Context rot happens when critical information gets lost, diluted, or buried as the context window fills up. The agent “forgets” important things - not because the information was deleted, but because it is no longer in the part of the context the model pays attention to.
Remember the desk analogy? Context rot is what happens when your desk gets so cluttered that the critical sticky note with the database password gets buried under a pile of meeting notes. The note is technically still on the desk, but you cannot find it when you need it.
Long conversations. Each turn adds messages to the context. After 50+ turns, early messages are far from where the model focuses.
Verbose tool results. A tool returns a large JSON blob. Most of it is irrelevant, but it takes up valuable context space.
Accumulated instructions. System instructions, few-shot examples, and guardrails take up space that could hold user-relevant information.
Repetitive content. Similar messages or results pile up without being consolidated.
How it works: Keep only the most recent N messages in the context. Older messages are dropped.
Window size: 10 messages
Messages 1-5: [dropped]
Messages 6-15: [in context]
Message 16: [new message arrives, message 6 gets dropped]
Pros: Simple to implement, predictable memory usage.
Cons: Loses important early context. The user might reference something from message 2 that is no longer in the window.
Best for: Casual conversational agents where recent context matters most.
How it works: Periodically summarize older conversation turns and replace them with the summary. The summary takes up much less space than the original messages.
Before summarization:
[20 detailed messages about trip planning] -> 4,000 tokens
After summarization:
[Summary: "User is planning a 5-day trip to Tokyo with a budget
of $3,000. They prefer boutique hotels and want to visit temples
and try local food. Flights are booked for March 15-20."] -> 200 tokens
Pros: Preserves key information while saving space.
Cons: Summarization can lose nuance. The agent performing the summary might miss something important.
Best for: Long-running sessions where historical context matters.
How it works: Set a token budget for each section of the context (system instructions, conversation history, tool results) and truncate when any section exceeds its budget.
Total budget: 32,000 tokens
System instructions: 4,000 tokens (fixed)
Tool definitions: 2,000 tokens (fixed)
Conversation history: 20,000 tokens (sliding)
Current turn: 6,000 tokens (reserved)
Pros: Fine-grained control over context allocation. Ensures space is always available for the current task.
Cons: Requires careful tuning. Hard boundaries can cut off mid-message.
Best for: Production agents where you need predictable costs and latency.
How it works: Score each piece of context by relevance to the current task and keep only the most relevant items.
Current task: "Book a hotel in Tokyo"
High relevance (keep):
- User's destination: Tokyo
- User's dates: March 15-20
- User's budget: $3,000
- User's hotel preferences: boutique
Low relevance (drop):
- Discussion about restaurant recommendations
- Earlier conversation about flight options (already booked)
Pros: Maximizes the signal-to-noise ratio in context.
Cons: Relevance scoring is imperfect. The agent might drop something that turns out to be important later.
Best for: Complex agents with many types of context competing for space.
How it works: Store information in external memory (a database, vector store, or knowledge graph) and retrieve it only when needed.
Instead of keeping everything in context:
User preferences -> stored in user profile database
Past conversations -> stored in conversation archive
Reference docs -> stored in vector database
When needed:
Agent queries the relevant store and loads just the
pieces it needs into the current context.
Pros: Virtually unlimited storage. Only relevant information enters the context.
Cons: Adds latency for retrieval. Retrieval quality depends on the search system.
Best for: Agents that need access to large amounts of information but only use a small fraction at any time.
| Strategy | Context savings | Information loss risk | Complexity | Latency impact |
|---|---|---|---|---|
| Sliding window | High | High | Low | None |
| Summarization | High | Medium | Medium | Some (summarization step) |
| Token truncation | Medium | Medium | Medium | None |
| Importance selection | High | Medium | High | Some (scoring step) |
| Externalize + retrieve | Very high | Low | High | Higher (retrieval step) |
When you externalize memory, you need somewhere to put it. Here are the main options.
What they do: Store information as mathematical vectors (embeddings) and retrieve it by similarity.
How it works:
Good for: Finding semantically similar content. “What did we discuss about deployment?” will match past conversations about deploying, even if they used different words.
Examples: Vertex AI Vector Search, Pinecone, Weaviate, ChromaDB
Trade-offs:
What they do: Store information as entities and relationships in a graph structure.
How it works:
[User: Alex] --works_on--> [Project: Payments API]
[Project: Payments API] --deployed_on--> [Platform: Cloud Run]
[User: Alex] --prefers--> [Theme: Dark Mode]
Good for: Representing structured relationships between entities. “Who works on the Payments API?” or “What platform is the Payments API deployed on?”
Examples: Neo4j, Amazon Neptune, Google Cloud’s Knowledge Graph
Trade-offs:
In practice, many systems combine both:
"What does Alex prefer?"
-> Key-value store: {theme: "dark", language: "TypeScript"}
"What did Alex and I discuss about deployments?"
-> Vector search: [similar past conversations about deployments]
"What services does Alex's team own?"
-> Knowledge graph: Alex -> team -> services -> dependencies
| Need | Best approach |
|---|---|
| Semantic search over conversations | Vector database |
| User preferences and settings | Key-value store |
| Entity relationships | Knowledge graph |
| Recent session history | In-memory / session store |
| Document retrieval | Vector database + metadata filters |
| Complex multi-hop queries | Knowledge graph |
Here is how memory and context fit into an agent architecture:
+-------------------+
| User Message |
+--------+----------+
|
v
+-------------+-------------+
| Context Engineering |
| Layer |
| |
| 1. Load system prompt |
| 2. Retrieve relevant |
| memory |
| 3. Load session history |
| (pruned/summarized) |
| 4. Add tool definitions |
| 5. Include current |
| message |
+-------------+-------------+
|
Assembled context
|
v
+--------+----------+
| LLM |
+--------+----------+
|
Agent response
|
v
+-------------+-------------+
| Memory Update Layer |
| |
| 1. Save to session |
| 2. Extract key facts |
| for long-term memory |
| 3. Update user profile |
+---------------------------+
The context engineering layer assembles the right context before the LLM sees it. The memory update layer extracts important information after the LLM responds. Together, they give the agent the ability to remember and stay focused.
The context window is the agent’s working memory. Everything the agent considers has to fit in this space. Managing it well is critical.
Context engineering goes beyond prompt engineering. Agents need dynamic assembly of context, not just static prompts. What goes into the context window should change based on the situation.
Agents need multiple types of memory. Short-term for the current conversation, long-term for persistence across sessions, procedural for knowing how to do things, and declarative for facts.
Sessions provide conversational continuity. They track the history and state of an interaction, giving agents the ability to maintain coherent conversations.
Memory and RAG serve different purposes. Memory is personal and interaction-specific. RAG is about accessing general knowledge. Most agents need both.
Context rot is a real problem. As context grows, important information gets buried. Active management through summarization, pruning, and external storage keeps the agent effective.
Choose your storage based on your access patterns. Vector databases for semantic search, knowledge graphs for relationships, key-value stores for quick lookups. Hybrid approaches often work best.
Next lesson: Planning and Reasoning - How Agents Tackle Complex Tasks