Memory in AI Systems: From Agent Recall to Efficient LLM Caching
Memory is the backbone of intelligent systems. Without it, an AI agent would be like a person with severe amnesia—unable to learn from past interactions, reason across documents, or even maintain context within a conversation. Yet "memory" in AI means vastly different things depending on the context: it could be an agent's ability to recall facts from a knowledge base, an LLM's internal cache of attention states during generation, or a neural network's tendency to memorize training data instead of generalizing from it.
This article explores memory across three critical dimensions in modern AI systems: how agents remember and retrieve information, how language models optimize memory usage for efficient inference, and how we prevent unwanted memorization that leads to poor generalization. By the end, you'll understand both the theoretical foundations and practical implementation patterns that determine whether your AI system recalls accurately, forgets gracefully, or wastes precious memory on redundant computation.
Part 1: Memory in Multi-Agent Systems
Why Agent Memory Matters
Imagine you're building a customer service AI that spans multiple conversations over weeks. Without memory, it would ask "What's your account number?" every single time you contact it. With memory, it learns your preferences, recalls past issues, and even adapts its communication style. This isn't just convenience—it's the difference between a brittle tool and a genuinely helpful system.
Modern multi-agent systems like CrewAI, OpenAI's Agents SDK, and Agno all implement memory, but in strikingly different ways. A recent research framework called MemoryAgentBench attempted to answer a fundamental question: What does it actually mean for an agent to "remember"?
The answer? Memory isn't one capability—it's four:
- Accurate Retrieval (AR): Factual recall and multi-hop reasoning across long contexts
- Test-Time Learning (TTL): In-session learning of new concepts and user preferences
- Long-Range Understanding (LRU): Cross-document reasoning and high-level abstraction
- Selective Forgetting (SF): Intentional memory revision and recency bias handling
Understanding these dimensions is crucial because different applications prioritize them differently. A financial analysis agent needs exceptional accurate retrieval; a chatbot learning user preferences during a session needs strong test-time learning; a research assistant needs long-range understanding across dozens of papers.
The Four Competencies of Agent Memory
Accurate Retrieval: Factual Recall and Multi-Hop Reasoning
At its core, accurate retrieval means: given a query, can the agent find the right information in its memory and reason over it correctly?
This isn't trivial. Consider this example: "What are the carbon emissions from electricity generation in the region where [Company X] has its largest facility?"
This requires:
- Retrieving information about [Company X]'s facilities
- Finding the location of the largest one
- Looking up regional electricity generation emissions
- Aggregating and reasoning over multiple pieces of information
Evaluation uses long-context retrieval tasks where the agent must sift through relevant and irrelevant information to find answers. The key metric: does it return factually correct information, or does it hallucinate?
Test-Time Learning: Learning During Interactions
Unlike traditional machine learning where learning happens offline during training, test-time learning (TTL) means the agent improves during its actual interactions with users.
Example: A user tells your agent "I prefer summaries in bullet points, not paragraphs." A good agent should remember this preference for the remainder of the conversation and future interactions. This is fundamentally different from retrieving facts—it's about the agent adapting to individual user behavior.
TTL is evaluated by:
- Introducing new concepts or preferences mid-conversation
- Checking if the agent applies them correctly to subsequent queries
- Measuring how quickly the agent "learns" these patterns
Long-Range Understanding: Cross-Document Reasoning
Many real-world problems require synthesizing information across multiple documents. A research agent analyzing 50 papers, a compliance system checking multiple regulatory documents, or an analyst building a competitive intelligence report all need long-range understanding.
This competency tests whether an agent can:
- Identify patterns across documents
- Spot contradictions in multiple sources
- Abstract high-level themes from diverse information
- Reason about relationships between concepts in different documents
Selective Forgetting: Intentional Memory Revision
This is the counterintuitive one: good memory systems sometimes need to forget.
Consider: A user corrects themselves mid-conversation—"Actually, I meant X, not Y." A good agent should update its memory, not stack new information on top of old misinformation. Or imagine an agent that knows outdated information; it needs to be able to "unlearn" old facts when better ones arrive.
Selective forgetting is measured through:
- Counterfactual updates: Does the agent correctly revise its understanding when given contradictory information?
- Recency bias: Does it prioritize recent information over stale data?
- Context window truncation: Can it cleanly drop irrelevant old context?
How Different Frameworks Implement Memory
Here's where theory meets reality. Different frameworks make radically different architectural choices:
OpenAI Agents SDK: Conversation Accumulation
The OpenAI Agents SDK treats memory as append-only conversation history backed by SQLite:
python# Conceptual representation of OpenAI SDK memory class OpenAIAgentMemory: def __init__(self): self.db = SQLite("agent_sessions.db") self.context_window = 4096 # tokens self.context_overlap = 200 # tokens def store_interaction(self, user_message, agent_response): # Simply append to conversation history self.db.insert("interactions", { "session_id": current_session, "user_msg": user_message, "agent_response": agent_response, "timestamp": now() }) def get_context(self, current_query): # Retrieve recent interactions up to context limit all_history = self.db.query("SELECT * FROM interactions") # Chunk with overlap: 4096 tokens max, 200 token overlap chunked = chunk_with_overlap(all_history, max_tokens=4096, overlap_tokens=200) # Return most recent chunk that fits context window return chunked[-1]
Strengths: Simple, no risk of losing information, deterministic Weaknesses: As conversation grows, older context gets dropped; no intelligent retrieval; context budgeting becomes the bottleneck
CrewAI: Multi-Tier Memory Architecture
CrewAI implements three distinct memory levels:
pythonfrom crewai import Agent, Task, Crew from crewai.memory import ShortTermMemory, LongTermMemory, EntityMemory # Short-term: Current task context (in-memory) short_term = ShortTermMemory() # Long-term: Persistent storage across tasks long_term = LongTermMemory() # Entity memory: Specific facts about entities (people, places, etc.) entity_memory = EntityMemory() class ResearchAgent(Agent): def __init__(self): super().__init__( role="Research Analyst", goal="Analyze market trends", short_term_memory=short_term, long_term_memory=long_term, entity_memory=entity_memory ) def execute_task(self, task: Task): # Memory is automatically injected into agent's context # before each decision point result = self.think_and_act(task) return result
Key difference: CrewAI has explicit ingestion and querying agents. When you ask a question, it doesn't just append—it routes through memory infrastructure.
Strength: Sophisticated multi-tier abstraction, memory-aware decision making Weakness: Opaque retrieval (you can't easily see why certain memories were selected)
Agno: Automatic SQLite Capture
Agno takes a middle path—automatic capture with less ceremony:
pythonfrom agno import Agent from agno.memory import AgentMemory # Memory is captured automatically during execution agent = Agent( name="data_analyst", memory=AgentMemory( storage="sqlite://agent_memory.db", auto_capture=True # Automatically store all interactions ) ) # During agent.run(), all queries, responses, and reasoning are captured # and available for retrieval in subsequent calls result = agent.run("What were the Q3 revenue trends?")
Strength: Minimal code overhead, automatic persistence Weakness: Limited control over what gets captured and how
Building a Memory Adapter: Unified Interface
Here's a practical pattern for working across different frameworks:
pythonfrom abc import ABC, abstractmethod from typing import List, Optional from dataclasses import dataclass from datetime import datetime @dataclass class MemoryEntry: """Unified memory entry across frameworks""" content: str timestamp: datetime tags: List[str] metadata: dict # Framework-specific metadata class MemoryAdapter(ABC): """Standard interface for heterogeneous memory backends""" @abstractmethod def reset(self) -> None: """Clear all session memory""" pass @abstractmethod def ingest(self, context: str, tags: List[str] = None) -> None: """Store context with optional tags for later retrieval""" pass @abstractmethod def query(self, question: str, top_k: int = 5) -> List[MemoryEntry]: """Retrieve most relevant memories matching question""" pass @abstractmethod def forget(self, tag: str) -> None: """Selectively forget memories matching a tag""" pass # Implementation for OpenAI SDK class OpenAIMemoryAdapter(MemoryAdapter): def __init__(self, session_id: str): self.session_id = session_id self.db = SQLite("openai_memory.db") def reset(self) -> None: self.db.delete("memories", f"session_id = '{self.session_id}'") def ingest(self, context: str, tags: List[str] = None) -> None: self.db.insert("memories", { "session_id": self.session_id, "content": context, "tags": ",".join(tags or []), "timestamp": datetime.now() }) def query(self, question: str, top_k: int = 5) -> List[MemoryEntry]: # Use semantic similarity to find relevant memories embeddings = self._embed_batch([question]) memories = self.db.query(f""" SELECT content, timestamp, tags FROM memories WHERE session_id = '{self.session_id}' """) # Score by semantic relevance (simplified—use real embeddings) scored = [(m, similarity_score(embeddings[0], m['content'])) for m in memories] top = sorted(scored, key=lambda x: x[1], reverse=True)[:top_k] return [MemoryEntry( content=m[0]['content'], timestamp=m[0]['timestamp'], tags=m[0]['tags'].split(","), metadata={} ) for m in top] def forget(self, tag: str) -> None: self.db.delete("memories", f"session_id = '{self.session_id}' AND tags LIKE '%{tag}%'") # Usage adapter = OpenAIMemoryAdapter(session_id="user_123") adapter.ingest("User prefers summaries in bullet points", tags=["preference"]) adapter.ingest("Company X is in the automotive industry", tags=["company_X"]) results = adapter.query("What format does the user prefer?", top_k=1) print(results[0].content) # "User prefers summaries in bullet points"
This adapter pattern enables you to:
- Swap frameworks without rewriting memory logic
- Evaluate different frameworks fairly
- Add cross-cutting concerns like compliance logging
- Test memory behavior independently
Part 2: Memory Efficiency in Language Models
While agent memory asks "What do we remember?", LLM memory asks a different question: "How do we store and retrieve information efficiently during generation?"
When an LLM generates text, it computes attention over all previous tokens. For long conversations, this becomes a memory and computation bottleneck. A 10-turn conversation with 1000 tokens per turn creates 10,000 tokens of Key-Value (KV) states that must be stored and repeatedly accessed. Scale to 100 turns on a large deployment, and you're storing megabytes of attention states per user.
The KV Cache Problem
To understand the solution, you need to understand the problem:
python# Simplified attention computation import torch import torch.nn.functional as F def attention(Query, Key, Value, mask=None): """Standard attention: O(N²) in sequence length""" scores = torch.matmul(Query, Key.transpose(-2, -1)) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) weights = F.softmax(scores, dim=-1) output = torch.matmul(weights, Value) return output def autoregressive_generation(model, prompt, max_tokens=100): """Token-by-token generation with KV cache""" # Initialize input_ids = tokenize(prompt) cache = {} # Stores Key, Value for each layer generated = [] for step in range(max_tokens): # Forward pass logits, new_cache = model(input_ids, kv_cache=cache) # Update cache with new KV states cache = new_cache # Sample next token next_token = sample(logits[-1, :]) input_ids = torch.cat([input_ids, next_token.unsqueeze(0)]) generated.append(next_token) return generated # Memory cost of KV cache: # batch_size=32, seq_len=4096, hidden_dim=4096, num_heads=32 # Memory per head = seq_len * hidden_dim/num_heads * 2 (K and V) * precision # Total = 32 * 4096 * 4096/32 * 2 * 2 bytes (FP32) # = 32 * 4096 * 512 * 4 = 268 MB per batch # With BF16: 134 MB # With FP8: 67 MB (50% reduction)
The problem: At generation time, you shift from compute-bound (matrix-matrix multiplication) to memory-bound (memory bandwidth limitation). Each new token requires you to read all previous KV states—no way around it.
But you can reduce the size of KV states. Enter three optimization techniques:
Solution 1: Multi-Head Latent Attention (MLA)
Instead of storing full Key and Value vectors for each head, compress them into shared latent representations:
pythonclass MultiHeadLatentAttention(torch.nn.Module): """Compress KV across heads into latent vectors""" def __init__(self, hidden_dim, num_heads, latent_dim): super().__init__() self.hidden_dim = hidden_dim self.num_heads = num_heads self.head_dim = hidden_dim // num_heads self.latent_dim = latent_dim # Project to latent KV space (shared across heads) self.to_latent_kv = torch.nn.Linear(hidden_dim, 2 * latent_dim) # Project latent KV back to heads self.from_latent_kv = torch.nn.Linear(latent_dim, num_heads * self.head_dim) def forward(self, hidden_states, kv_cache=None): batch, seq_len, hidden = hidden_states.shape # Compress: hidden_dim -> latent_dim latent_kv = self.to_latent_
Share this article
Related Articles
ReAct in Agentic AI: Building Intelligent Agents That Think and Act
A deep dive into ReAct in Agentic AI for AI engineers.
Circuit Breaking in Agentic AI: Building Resilient Autonomous Systems
A deep dive into Circuit breaking in Agentic AI for AI engineers.
Agentic AI Design: Building Intelligent Systems That Think and Coordinate
A deep dive into Case Study of Agentic AI Design for AI engineers.

