Why Context Windows Are Not Enough
Even 200k-token context windows don't solve the AI memory problem. Filling a context window costs money (at API pricing, 200k tokens per request becomes extremely expensive at scale) and degrades output quality as the model struggles to attend to relevant information buried in a massive context. More fundamentally, context windows reset with every new conversation — they're working memory, not long-term memory.
External Memory Architectures
External memory architectures move AI memory outside the model's context window and into a dedicated retrieval system. Instead of providing all relevant history as context, the system retrieves only the most relevant history on each query and injects a compressed summary. This allows arbitrarily large memory stores without context window limitations and dramatically reduces per-query cost.
The simplest external memory is a flat text database with full-text search. More sophisticated systems use dense vector embeddings for semantic retrieval. State-of-the-art systems combine sparse (BM25) and dense (vector) retrieval with re-ranking to maximize relevance at each query.
Knowledge Graphs for AI Memory
Knowledge graphs represent AI memory as a network of entities and relationships rather than a bag of text chunks. When an AI conversation mentions "our Q3 sales target of $2M," a knowledge graph system extracts the entities (Q3, sales target, $2M), creates nodes, and links them with the appropriate relationship. Subsequent queries about "revenue goals" or "quarterly targets" traverse the graph rather than searching raw text.
Knowledge graphs excel at tracking facts, people, and their relationships over time. They can represent change — "the budget was $500k in Q2 and increased to $750k in Q3" — in a way that vector search over flat text cannot reliably surface.
RAG-Based Memory Systems
Retrieval-Augmented Generation (RAG) for AI memory works by embedding every conversation chunk, storing vectors in a database, and retrieving top-k relevant chunks at query time. The retrieved chunks are added to the system prompt as dynamic context. This approach is straightforward to implement, has excellent tooling support (LangChain, LlamaIndex), and scales to millions of chunks with appropriate infrastructure.
Hybrid Approaches
Production AI memory systems at scale typically combine multiple retrieval strategies: episodic memory (recent conversations retrieved with recency weighting), semantic memory (factual knowledge stored in a knowledge graph), and associative memory (vector search over all historical content). The combination covers the failure modes of each individual approach.
Choosing the Right Architecture
For personal use (an individual's AI conversation history), a well-indexed vector database with semantic search is typically sufficient. For team use (shared knowledge across a department), a knowledge graph adds significant value for tracking facts and decisions. For enterprise use (organization-wide AI memory at scale), a hybrid system with separate tiers for recent, mid-term, and archival memory is the appropriate architecture.