AIPython

Memory in AI Systems: From Agent Recall to Efficient LLM Caching

3/29/2026
10 min read

Memory is the backbone of intelligent systems. Without it, an AI agent would be like a person with severe amnesia—unable to learn from past interactions, reason across documents, or even maintain context within a conversation. Yet "memory" in AI means vastly different things depending on the context: it could be an agent's ability to recall facts from a knowledge base, an LLM's internal cache of attention states during generation, or a neural network's tendency to memorize training data instead of generalizing from it.

This article explores memory across three critical dimensions in modern AI systems: how agents remember and retrieve information, how language models optimize memory usage for efficient inference, and how we prevent unwanted memorization that leads to poor generalization. By the end, you'll understand both the theoretical foundations and practical implementation patterns that determine whether your AI system recalls accurately, forgets gracefully, or wastes precious memory on redundant computation.

Part 1: Memory in Multi-Agent Systems

Why Agent Memory Matters

Imagine you're building a customer service AI that spans multiple conversations over weeks. Without memory, it would ask "What's your account number?" every single time you contact it. With memory, it learns your preferences, recalls past issues, and even adapts its communication style. This isn't just convenience—it's the difference between a brittle tool and a genuinely helpful system.

Modern multi-agent systems like CrewAI, OpenAI's Agents SDK, and Agno all implement memory, but in strikingly different ways. A recent research framework called MemoryAgentBench attempted to answer a fundamental question: What does it actually mean for an agent to "remember"?

The answer? Memory isn't one capability—it's four:

  1. Accurate Retrieval (AR): Factual recall and multi-hop reasoning across long contexts
  2. Test-Time Learning (TTL): In-session learning of new concepts and user preferences
  3. Long-Range Understanding (LRU): Cross-document reasoning and high-level abstraction
  4. Selective Forgetting (SF): Intentional memory revision and recency bias handling

Understanding these dimensions is crucial because different applications prioritize them differently. A financial analysis agent needs exceptional accurate retrieval; a chatbot learning user preferences during a session needs strong test-time learning; a research assistant needs long-range understanding across dozens of papers.

The Four Competencies of Agent Memory

Accurate Retrieval: Factual Recall and Multi-Hop Reasoning

At its core, accurate retrieval means: given a query, can the agent find the right information in its memory and reason over it correctly?

This isn't trivial. Consider this example: "What are the carbon emissions from electricity generation in the region where [Company X] has its largest facility?"

This requires:

  1. Retrieving information about [Company X]'s facilities
  2. Finding the location of the largest one
  3. Looking up regional electricity generation emissions
  4. Aggregating and reasoning over multiple pieces of information

Evaluation uses long-context retrieval tasks where the agent must sift through relevant and irrelevant information to find answers. The key metric: does it return factually correct information, or does it hallucinate?

Test-Time Learning: Learning During Interactions

Unlike traditional machine learning where learning happens offline during training, test-time learning (TTL) means the agent improves during its actual interactions with users.

Example: A user tells your agent "I prefer summaries in bullet points, not paragraphs." A good agent should remember this preference for the remainder of the conversation and future interactions. This is fundamentally different from retrieving facts—it's about the agent adapting to individual user behavior.

TTL is evaluated by:

  • Introducing new concepts or preferences mid-conversation
  • Checking if the agent applies them correctly to subsequent queries
  • Measuring how quickly the agent "learns" these patterns

Long-Range Understanding: Cross-Document Reasoning

Many real-world problems require synthesizing information across multiple documents. A research agent analyzing 50 papers, a compliance system checking multiple regulatory documents, or an analyst building a competitive intelligence report all need long-range understanding.

This competency tests whether an agent can:

  • Identify patterns across documents
  • Spot contradictions in multiple sources
  • Abstract high-level themes from diverse information
  • Reason about relationships between concepts in different documents

Selective Forgetting: Intentional Memory Revision

This is the counterintuitive one: good memory systems sometimes need to forget.

Consider: A user corrects themselves mid-conversation—"Actually, I meant X, not Y." A good agent should update its memory, not stack new information on top of old misinformation. Or imagine an agent that knows outdated information; it needs to be able to "unlearn" old facts when better ones arrive.

Selective forgetting is measured through:

  • Counterfactual updates: Does the agent correctly revise its understanding when given contradictory information?
  • Recency bias: Does it prioritize recent information over stale data?
  • Context window truncation: Can it cleanly drop irrelevant old context?

How Different Frameworks Implement Memory

Here's where theory meets reality. Different frameworks make radically different architectural choices:

OpenAI Agents SDK: Conversation Accumulation

The OpenAI Agents SDK treats memory as append-only conversation history backed by SQLite:

python
# Conceptual representation of OpenAI SDK memory
class OpenAIAgentMemory:
    def __init__(self):
        self.db = SQLite("agent_sessions.db")
        self.context_window = 4096  # tokens
        self.context_overlap = 200  # tokens
    
    def store_interaction(self, user_message, agent_response):
        # Simply append to conversation history
        self.db.insert("interactions", {
            "session_id": current_session,
            "user_msg": user_message,
            "agent_response": agent_response,
            "timestamp": now()
        })
    
    def get_context(self, current_query):
        # Retrieve recent interactions up to context limit
        all_history = self.db.query("SELECT * FROM interactions")
        
        # Chunk with overlap: 4096 tokens max, 200 token overlap
        chunked = chunk_with_overlap(all_history, 
                                     max_tokens=4096, 
                                     overlap_tokens=200)
        
        # Return most recent chunk that fits context window
        return chunked[-1]

Strengths: Simple, no risk of losing information, deterministic Weaknesses: As conversation grows, older context gets dropped; no intelligent retrieval; context budgeting becomes the bottleneck

CrewAI: Multi-Tier Memory Architecture

CrewAI implements three distinct memory levels:

python
from crewai import Agent, Task, Crew
from crewai.memory import ShortTermMemory, LongTermMemory, EntityMemory

# Short-term: Current task context (in-memory)
short_term = ShortTermMemory()

# Long-term: Persistent storage across tasks
long_term = LongTermMemory()

# Entity memory: Specific facts about entities (people, places, etc.)
entity_memory = EntityMemory()

class ResearchAgent(Agent):
    def __init__(self):
        super().__init__(
            role="Research Analyst",
            goal="Analyze market trends",
            short_term_memory=short_term,
            long_term_memory=long_term,
            entity_memory=entity_memory
        )
    
    def execute_task(self, task: Task):
        # Memory is automatically injected into agent's context
        # before each decision point
        result = self.think_and_act(task)
        return result

Key difference: CrewAI has explicit ingestion and querying agents. When you ask a question, it doesn't just append—it routes through memory infrastructure.

Strength: Sophisticated multi-tier abstraction, memory-aware decision making Weakness: Opaque retrieval (you can't easily see why certain memories were selected)

Agno: Automatic SQLite Capture

Agno takes a middle path—automatic capture with less ceremony:

python
from agno import Agent
from agno.memory import AgentMemory

# Memory is captured automatically during execution
agent = Agent(
    name="data_analyst",
    memory=AgentMemory(
        storage="sqlite://agent_memory.db",
        auto_capture=True  # Automatically store all interactions
    )
)

# During agent.run(), all queries, responses, and reasoning are captured
# and available for retrieval in subsequent calls
result = agent.run("What were the Q3 revenue trends?")

Strength: Minimal code overhead, automatic persistence Weakness: Limited control over what gets captured and how

Building a Memory Adapter: Unified Interface

Here's a practical pattern for working across different frameworks:

python
from abc import ABC, abstractmethod
from typing import List, Optional
from dataclasses import dataclass
from datetime import datetime

@dataclass
class MemoryEntry:
    """Unified memory entry across frameworks"""
    content: str
    timestamp: datetime
    tags: List[str]
    metadata: dict  # Framework-specific metadata

class MemoryAdapter(ABC):
    """Standard interface for heterogeneous memory backends"""
    
    @abstractmethod
    def reset(self) -> None:
        """Clear all session memory"""
        pass
    
    @abstractmethod
    def ingest(self, context: str, tags: List[str] = None) -> None:
        """Store context with optional tags for later retrieval"""
        pass
    
    @abstractmethod
    def query(self, question: str, top_k: int = 5) -> List[MemoryEntry]:
        """Retrieve most relevant memories matching question"""
        pass
    
    @abstractmethod
    def forget(self, tag: str) -> None:
        """Selectively forget memories matching a tag"""
        pass

# Implementation for OpenAI SDK
class OpenAIMemoryAdapter(MemoryAdapter):
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.db = SQLite("openai_memory.db")
    
    def reset(self) -> None:
        self.db.delete("memories", f"session_id = '{self.session_id}'")
    
    def ingest(self, context: str, tags: List[str] = None) -> None:
        self.db.insert("memories", {
            "session_id": self.session_id,
            "content": context,
            "tags": ",".join(tags or []),
            "timestamp": datetime.now()
        })
    
    def query(self, question: str, top_k: int = 5) -> List[MemoryEntry]:
        # Use semantic similarity to find relevant memories
        embeddings = self._embed_batch([question])
        memories = self.db.query(f"""
            SELECT content, timestamp, tags 
            FROM memories 
            WHERE session_id = '{self.session_id}'
        """)
        
        # Score by semantic relevance (simplified—use real embeddings)
        scored = [(m, similarity_score(embeddings[0], m['content'])) 
                  for m in memories]
        top = sorted(scored, key=lambda x: x[1], reverse=True)[:top_k]
        
        return [MemoryEntry(
            content=m[0]['content'],
            timestamp=m[0]['timestamp'],
            tags=m[0]['tags'].split(","),
            metadata={}
        ) for m in top]
    
    def forget(self, tag: str) -> None:
        self.db.delete("memories", 
                      f"session_id = '{self.session_id}' AND tags LIKE '%{tag}%'")

# Usage
adapter = OpenAIMemoryAdapter(session_id="user_123")
adapter.ingest("User prefers summaries in bullet points", tags=["preference"])
adapter.ingest("Company X is in the automotive industry", tags=["company_X"])

results = adapter.query("What format does the user prefer?", top_k=1)
print(results[0].content)  # "User prefers summaries in bullet points"

This adapter pattern enables you to:

  • Swap frameworks without rewriting memory logic
  • Evaluate different frameworks fairly
  • Add cross-cutting concerns like compliance logging
  • Test memory behavior independently

Part 2: Memory Efficiency in Language Models

While agent memory asks "What do we remember?", LLM memory asks a different question: "How do we store and retrieve information efficiently during generation?"

When an LLM generates text, it computes attention over all previous tokens. For long conversations, this becomes a memory and computation bottleneck. A 10-turn conversation with 1000 tokens per turn creates 10,000 tokens of Key-Value (KV) states that must be stored and repeatedly accessed. Scale to 100 turns on a large deployment, and you're storing megabytes of attention states per user.

The KV Cache Problem

To understand the solution, you need to understand the problem:

python
# Simplified attention computation
import torch
import torch.nn.functional as F

def attention(Query, Key, Value, mask=None):
    """Standard attention: O(N²) in sequence length"""
    scores = torch.matmul(Query, Key.transpose(-2, -1))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, Value)
    return output

def autoregressive_generation(model, prompt, max_tokens=100):
    """Token-by-token generation with KV cache"""
    
    # Initialize
    input_ids = tokenize(prompt)
    cache = {}  # Stores Key, Value for each layer
    
    generated = []
    
    for step in range(max_tokens):
        # Forward pass
        logits, new_cache = model(input_ids, kv_cache=cache)
        
        # Update cache with new KV states
        cache = new_cache
        
        # Sample next token
        next_token = sample(logits[-1, :])
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0)])
        generated.append(next_token)
    
    return generated

# Memory cost of KV cache:
# batch_size=32, seq_len=4096, hidden_dim=4096, num_heads=32
# Memory per head = seq_len * hidden_dim/num_heads * 2 (K and V) * precision
# Total = 32 * 4096 * 4096/32 * 2 * 2 bytes (FP32)
# = 32 * 4096 * 512 * 4 = 268 MB per batch
# With BF16: 134 MB
# With FP8: 67 MB (50% reduction)

The problem: At generation time, you shift from compute-bound (matrix-matrix multiplication) to memory-bound (memory bandwidth limitation). Each new token requires you to read all previous KV states—no way around it.

But you can reduce the size of KV states. Enter three optimization techniques:

Solution 1: Multi-Head Latent Attention (MLA)

Instead of storing full Key and Value vectors for each head, compress them into shared latent representations:

python
class MultiHeadLatentAttention(torch.nn.Module):
    """Compress KV across heads into latent vectors"""
    
    def __init__(self, hidden_dim, num_heads, latent_dim):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        self.latent_dim = latent_dim
        
        # Project to latent KV space (shared across heads)
        self.to_latent_kv = torch.nn.Linear(hidden_dim, 2 * latent_dim)
        # Project latent KV back to heads
        self.from_latent_kv = torch.nn.Linear(latent_dim, num_heads * self.head_dim)
    
    def forward(self, hidden_states, kv_cache=None):
        batch, seq_len, hidden = hidden_states.shape
        
        # Compress: hidden_dim -> latent_dim
        latent_kv = self.to_latent_

Share this article

Chalamaiah Chinnam

Chalamaiah Chinnam

AI Engineer & Senior Software Engineer

15+ years of enterprise software experience, specializing in applied AI systems, multi-agent architectures, and RAG pipelines. Currently building AI-powered automation at LinkedIn.