How to Add Memory and Context to Your AI Agent in 2026

Dev Nakamura 12 min read Updated May 25, 2026

TL;DR

  • Build an AI agent that remembers conversation history across multiple interactions
  • Implement three memory strategies: short-term (conversation buffer), working memory (summarization), and long-term (vector retrieval)
  • Use Redis for persistent storage and ChromaDB for semantic memory search
  • By the end, you’ll have a customer support agent that recalls previous conversations, user preferences, and contextual details from weeks ago

Prerequisites

Before we start, make sure you have:

Required:

  • Python 3.11 or higher
  • An OpenAI API key (get one at platform.openai.com)
  • Docker Desktop installed (for Redis)
  • 2GB free disk space

Knowledge:

  • Basic Python and async/await patterns
  • Understanding of API calls and JSON
  • Familiarity with environment variables

Time: ~45 minutes to complete

Cost: OpenAI API usage (~$0.10-0.50 for testing), Redis and ChromaDB are free locally

What We’re Building

We’re building a customer support AI agent that maintains three types of memory:

  1. Short-term memory: Recent conversation turns (last 10 messages)
  2. Working memory: Compressed summaries of longer conversations
  3. Long-term memory: Semantic search over all past interactions

Architecture flow:

User message → Load conversation history (Redis) → Retrieve relevant past context (ChromaDB) → Build prompt with context → LLM generates response → Save to memory stores → Return response

This pattern is essential for production AI agents—without memory, every interaction starts from zero, frustrating users and wasting context windows with repeated information.

Step 1: Set Up the Project Environment

Create a new project directory and set up a Python virtual environment:

mkdir ai-agent-memory
cd ai-agent-memory
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install required dependencies:

pip install langchain==0.1.20 \
            langchain-openai==0.1.8 \
            langchain-community==0.0.38 \
            redis==5.0.4 \
            chromadb==0.4.24 \
            python-dotenv==1.0.1 \
            tiktoken==0.7.0

Create a .env file for your API key:

echo "OPENAI_API_KEY=your_actual_key_here" > .env

Expected output: No errors. Verify with pip list | grep langchain and you should see the installed packages.

Step 2: Start Redis for Conversation Storage

Redis will store our conversation history with fast key-value lookups by user ID.

Start a Redis container:

docker run -d \
  --name redis-agent-memory \
  -p 6379:6379 \
  redis:7.2-alpine \
  redis-server --appendonly yes

Verify Redis is running:

docker exec redis-agent-memory redis-cli ping

Expected output: PONG

This gives us a persistent store where conversation history survives application restarts. The --appendonly yes flag enables persistence to disk.

Step 3: Build the Short-Term Memory Manager

Create memory_manager.py to handle conversation buffers:

import json
import redis
from datetime import datetime, timedelta
from typing import List, Dict
import os

class ConversationMemory:
    def __init__(self, redis_host="localhost", redis_port=6379, max_messages=10):
        self.redis_client = redis.Redis(
            host=redis_host,
            port=redis_port,
            decode_responses=True
        )
        self.max_messages = max_messages
        self.ttl = 60 * 60 * 24 * 30  # 30 days in seconds
    
    def add_message(self, user_id: str, role: str, content: str):
        """Add a message to the conversation history."""
        key = f"conversation:{user_id}"
        message = {
            "role": role,
            "content": content,
            "timestamp": datetime.now().isoformat()
        }
        
        # Add message to the list
        self.redis_client.lpush(key, json.dumps(message))
        
        # Trim to max_messages (keep most recent)
        self.redis_client.ltrim(key, 0, self.max_messages - 1)
        
        # Set expiration
        self.redis_client.expire(key, self.ttl)
    
    def get_conversation(self, user_id: str) -> List[Dict]:
        """Retrieve conversation history for a user."""
        key = f"conversation:{user_id}"
        messages = self.redis_client.lrange(key, 0, -1)
        
        # Redis returns most recent first, so reverse for chronological order
        return [json.loads(msg) for msg in reversed(messages)]
    
    def clear_conversation(self, user_id: str):
        """Clear conversation history for a user."""
        key = f"conversation:{user_id}"
        self.redis_client.delete(key)

This class handles our short-term memory:

  • lpush adds messages to the front of a list
  • ltrim automatically removes old messages beyond our limit
  • Messages expire after 30 days to comply with data retention policies
  • We reverse the list when retrieving so the LLM sees chronological order

Step 4: Create the Vector Store for Long-Term Memory

Create vector_memory.py for semantic retrieval:

import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from typing import List, Dict
import hashlib

class VectorMemory:
    def __init__(self, collection_name="agent_memory", persist_directory="./chroma_db"):
        self.client = chromadb.Client(Settings(
            persist_directory=persist_directory,
            anonymized_telemetry=False
        ))
        
        # Get or create collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    def add_interaction(self, user_id: str, interaction: str, metadata: Dict = None):
        """Store an interaction for semantic retrieval."""
        # Create unique ID from content hash
        doc_id = hashlib.md5(f"{user_id}:{interaction}".encode()).hexdigest()
        
        # Generate embedding
        embedding = self.embeddings.embed_query(interaction)
        
        # Prepare metadata
        meta = metadata or {}
        meta["user_id"] = user_id
        meta["timestamp"] = meta.get("timestamp", "")
        
        # Store in ChromaDB
        self.collection.add(
            ids=[doc_id],
            embeddings=[embedding],
            documents=[interaction],
            metadatas=[meta]
        )
    
    def retrieve_relevant_context(self, user_id: str, query: str, n_results: int = 3) -> List[Dict]:
        """Retrieve semantically similar past interactions."""
        query_embedding = self.embeddings.embed_query(query)
        
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results,
            where={"user_id": user_id}
        )
        
        if not results["documents"]:
            return []
        
        # Format results
        context_items = []
        for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
            context_items.append({
                "content": doc,
                "metadata": metadata
            })
        
        return context_items

Key concepts here:

  • text-embedding-3-small generates 1536-dimensional vectors for semantic search
  • ChromaDB uses HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor search
  • We filter by user_id to ensure users only retrieve their own memories
  • The hash-based ID prevents duplicate storage of identical interactions

Step 5: Implement Working Memory with Summarization

Create working_memory.py for conversation summarization:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from typing import List, Dict

class WorkingMemory:
    def __init__(self, model="gpt-4o-mini"):
        self.llm = ChatOpenAI(model=model, temperature=0.3)
        
        self.summarize_prompt = ChatPromptTemplate.from_messages([
            ("system", "You are an expert at summarizing conversations. Extract key facts, preferences, and important details."),
            ("user", "Summarize this conversation history, focusing on facts about the user and context that would be helpful for future interactions:\n\n{conversation}")
        ])
    
    def summarize_conversation(self, messages: List[Dict]) -> str:
        """Create a compressed summary of conversation history."""
        if len(messages) < 5:
            return ""  # Not enough to summarize
        
        # Format messages for summarization
        conversation_text = "\n".join([
            f"{msg['role'].upper()}: {msg['content']}"
            for msg in messages
        ])
        
        chain = self.summarize_prompt | self.llm
        response = chain.invoke({"conversation": conversation_text})
        
        return response.content
    
    def should_summarize(self, messages: List[Dict], threshold: int = 15) -> bool:
        """Determine if conversation should be summarized."""
        return len(messages) >= threshold

Working memory compresses long conversations into summaries, preventing context window overflow. We use:

  • gpt-4o-mini for cost-effective summarization
  • A threshold of 15 messages before triggering summarization
  • Low temperature (0.3) for consistent, factual summaries

Step 6: Build the Agent with Integrated Memory

Create agent.py to tie everything together:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.schema import HumanMessage, AIMessage, SystemMessage
from memory_manager import ConversationMemory
from vector_memory import VectorMemory
from working_memory import WorkingMemory

load_dotenv()

class MemoryAgent:
    def __init__(self):
        self.conversation_memory = ConversationMemory(max_messages=10)
        self.vector_memory = VectorMemory()
        self.working_memory = WorkingMemory()
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
        
        self.system_prompt = """You are a helpful customer support agent. 
        Use the conversation history and relevant context to provide personalized assistance.
        If you recall previous interactions or user preferences, mention them naturally.
        Be concise but friendly."""
    
    def chat(self, user_id: str, message: str) -> str:
        """Process a user message with full memory integration."""
        
        # 1. Retrieve short-term conversation history
        conversation_history = self.conversation_memory.get_conversation(user_id)
        
        # 2. Check if we need to summarize
        summary = ""
        if self.working_memory.should_summarize(conversation_history):
            summary = self.working_memory.summarize_conversation(conversation_history)
            # Store summary in vector memory
            self.vector_memory.add_interaction(
                user_id=user_id,
                interaction=summary,
                metadata={"type": "summary"}
            )
            # Clear old history and start fresh with summary
            self.conversation_memory.clear_conversation(user_id)
            conversation_history = []
        
        # 3. Retrieve relevant long-term context
        relevant_context = self.vector_memory.retrieve_relevant_context(
            user_id=user_id,
            query=message,
            n_results=3
        )
        
        # 4. Build prompt with all memory types
        context_str = ""
        if relevant_context:
            context_str = "\n\nRelevant past context:\n" + "\n".join([
                f"- {item['content']}"
                for item in relevant_context
            ])
        
        if summary:
            context_str += f"\n\nRecent conversation summary: {summary}"
        
        # 5. Format conversation history
        messages = [SystemMessage(content=self.system_prompt + context_str)]
        
        for msg in conversation_history:
            if msg["role"] == "user":
                messages.append(HumanMessage(content=msg["content"]))
            else:
                messages.append(AIMessage(content=msg["content"]))
        
        messages.append(HumanMessage(content=message))
        
        # 6. Generate response
        response = self.llm.invoke(messages)
        response_text = response.content
        
        # 7. Save to memory stores
        self.conversation_memory.add_message(user_id, "user", message)
        self.conversation_memory.add_message(user_id, "assistant", response_text)
        
        # Store significant interactions in vector memory
        interaction = f"User: {message}\nAgent: {response_text}"
        self.vector_memory.add_interaction(user_id, interaction)
        
        return response_text

This agent orchestrates all three memory systems:

  1. Loads recent conversation from Redis
  2. Checks if summarization is needed
  3. Retrieves semantically relevant past interactions
  4. Builds a context-rich prompt
  5. Generates a response with full context awareness
  6. Saves the new interaction to all memory stores

Step 7: Create a Test Interface

Create main.py to interact with your agent:

from agent import MemoryAgent
import sys

def main():
    agent = MemoryAgent()
    user_id = "test_user_123"
    
    print("AI Agent with Memory - Type 'quit' to exit\n")
    print("Try having a conversation, then restart and reference earlier topics!\n")
    
    while True:
        try:
            user_input = input("You: ").strip()
            
            if user_input.lower() in ["quit", "exit", "q"]:
                print("Goodbye!")
                break
            
            if not user_input:
                continue
            
            response = agent.chat(user_id, user_input)
            print(f"\nAgent: {response}\n")
            
        except KeyboardInterrupt:
            print("\n\nGoodbye!")
            sys.exit(0)
        except Exception as e:
            print(f"\nError: {e}\n")

if __name__ == "__main__":
    main()

Run your agent:

python main.py

Try this conversation flow to test memory:

You: Hi, my name is Alex and I'm interested in your premium plan
Agent: [responds and remembers name]

You: What are the pricing options?
Agent: [provides pricing]

You: I need to think about it
Agent: [acknowledges]

[Exit and restart the program]

You: I'm back, still considering the premium plan
Agent: [Should reference your name Alex and the previous pricing discussion]

Expected behavior: The agent recalls your name, the premium plan interest, and previous pricing discussion across sessions.

Testing Your Implementation

Create test_memory.py to verify all memory types work:

from agent import MemoryAgent
import time

def test_short_term_memory():
    """Test that agent remembers within a conversation."""
    agent = MemoryAgent()
    user_id = "test_short_term"
    
    # Set a preference
    agent.chat(user_id, "I prefer morning appointments")
    
    # Reference it immediately
    response = agent.chat(user_id, "What time should we schedule?")
    
    assert "morning" in response.lower(), "Agent didn't recall morning preference"
    print("✓ Short-term memory test passed")

def test_long_term_memory():
    """Test that agent retrieves semantically relevant past context."""
    agent = MemoryAgent()
    user_id = "test_long_term"
    
    # Create some history
    agent.chat(user_id, "I have a dog named Max")
    agent.chat(user_id, "Thanks")
    
    # Clear short-term but keep vector memory
    agent.conversation_memory.clear_conversation(user_id)
    
    # Wait for embedding to process
    time.sleep(1)
    
    # Ask related question
    response = agent.chat(user_id, "Do you remember my pet?")
    
    assert "max" in response.lower() or "dog" in response.lower(), "Agent didn't retrieve long-term memory"
    print("✓ Long-term memory test passed")

def test_conversation_persistence():
    """Test that conversations persist across agent instances."""
    user_id = "test_persistence"
    
    # First agent instance
    agent1 = MemoryAgent()
    agent1.chat(user_id, "My account number is 12345")
    
    # New agent instance (simulates restart)
    agent2 = MemoryAgent()
    response = agent2.chat(user_id, "What's my account number?")
    
    assert "12345" in response, "Agent didn't persist conversation to Redis"
    print("✓ Persistence test passed")

if __name__ == "__main__":
    print("Running memory tests...\n")
    test_short_term_memory()
    test_long_term_memory()
    test_conversation_persistence()
    print("\nAll tests passed!")

Run the tests:

python test_memory.py

Expected output:

Running memory tests...

✓ Short-term memory test passed
✓ Long-term memory test passed
✓ Persistence test passed

All tests passed!

Common Issues & Fixes

Problem: redis.exceptions.ConnectionError: Error connecting to Redis Cause: Redis container isn’t running or wrong port Fix: Check Docker and restart Redis:

docker ps | grep redis
docker start redis-agent-memory

Problem: openai.error.AuthenticationError: Incorrect API key Cause: Missing or invalid OpenAI API key Fix: Verify your .env file and reload:

cat .env  # Should show OPENAI_API_KEY=sk-...
source venv/bin/activate  # Reload environment

Problem: Agent doesn’t recall previous conversations after restart Cause: Using different user IDs or Redis data cleared Fix: Ensure consistent user IDs and check Redis has data:

docker exec redis-agent-memory redis-cli KEYS "conversation:*"

Problem: chromadb.errors.InvalidDimensionException Cause: Embedding model mismatch between sessions Fix: Delete the ChromaDB directory and restart:

rm -rf ./chroma_db
python main.py  # Will recreate with correct dimensions

Problem: Agent responses are too slow (>5 seconds) Cause: Vector search with large databases Fix: Reduce n_results in retrieval and add result caching:

relevant_context = self.vector_memory.retrieve_relevant_context(
    user_id=user_id,
    query=message,
    n_results=2  # Reduced from 3
)

Next Steps

You now have a production-ready memory system for AI agents. Here’s how to extend it:

Add user preference extraction: Parse conversations for explicit preferences (timezone, communication style) and store them as structured metadata:

def extract_preferences(self, message: str) -> Dict:
    # Use GPT to extract structured preferences
    # Store in Redis hash for fast lookup

Implement memory decay: Weight older memories less in retrieval by adding time-based scoring:

# In vector_memory.py
from datetime import datetime

def retrieve_with_decay(self, query, decay_factor=0.95):
    results = self.collection.query(...)
    # Apply exponential decay based on timestamp
    # More recent = higher score

Add multi-user memory sharing: For team contexts, allow agents to access relevant interactions from other team members while respecting privacy:

def retrieve_team_context(self, team_id: str, query: str):
    # Filter by team_id instead of user_id
    # Exclude PII-tagged interactions

Monitor memory costs: Track embedding and storage costs by adding telemetry:

import logging

class VectorMemory:
    def __init__(self, ...):
        self.embeddings_count = 0
        
    def add_interaction(self, ...):
        self.embeddings_count += 1
        logging.info(f"Total embeddings: {self.embeddings_count}")

Challenge: Implement “memory importance scoring” where the agent decides which interactions are worth storing long-term vs. ephemeral chitchat. Use a classifier to filter out low-value exchanges before adding to vector memory.

For more on AI agent patterns, check out our tutorial on “Building Production-Ready RAG Pipelines” and “Streaming Responses with LangChain Agents”.

Share:

Related Posts

tutorial 18 min read

How to Build a ReAct Agent with Claude and Tool Use in 2026

Learn to build a ReAct (Reasoning + Acting) agent that thinks through problems step-by-step using Claude's tool calling capabilities. This tutorial walks you through creating an agent that can use web search, perform calculations, and read files to answer complex questions.

Dev Nakamura