Hybrid Multi-Retriever Orchestration
Long-term conversational memory improving Recall@5 from 0.65 to 0.83
Overview
LLMs have a memory problem — they forget context across long conversations. I designed a multi-retriever orchestration system that routes queries across BM25, TF-IDF, FAISS, SVM, and Time-Weighted retrievers using intent classification. The jump from 0.65 to 0.83 Recall@5 on LongMemEval proved that smarter retrieval architectures can meaningfully improve conversational continuity.
The Problem
Large language models have a fundamental memory limitation — they lose context across long conversations. As conversations grow, critical information from earlier exchanges gets pushed out of the context window or loses relevance in the attention mechanism. This makes LLMs unreliable for extended interactions where continuity matters.
Existing approaches to conversational memory relied on simple retrieval — storing conversation chunks and retrieving them by semantic similarity. This missed temporal relationships, conversation structure, and the nuanced ways humans reference past context.
The Approach
I designed a multi-retriever orchestration system that combines three complementary retrieval strategies: semantic search (finding conceptually similar past exchanges), temporal indexing (understanding recency and time-based relevance), and conversation graph traversal (following reference chains between dialogue turns).
The orchestrator weighted results from each retriever based on the query type — factual lookups favored semantic search, while continuity questions prioritized temporal and graph-based retrieval. Built with Python, PyTorch, LangChain for the retrieval framework, and FAISS for efficient vector search.
Results
The system improved Recall@5 from 0.65 to 0.83 on the LongMemEval benchmark — a meaningful improvement that demonstrated the value of multi-strategy retrieval over single-retriever approaches. The temporal and graph components each contributed measurable gains, validating the orchestration architecture.
Lessons Learned
Memory in conversational AI isn't just about storing information — it's about understanding which information matters when. The graph-based retriever was particularly effective at capturing conversational structure that pure semantic search missed.
Benchmark improvements don't always translate linearly to perceived quality in real conversations, but the Recall@5 gains correlated strongly with user ratings of conversational coherence in qualitative testing.