researchcompletedSep 2025 – Dec 2025

HyArg

Hybrid multi-retriever orchestration for long-term conversational memory - Recall@5 0.65 → 0.83

Role: ResearcherDuration: 4 months

Overview

LLMs forget context across long conversations, and no single retriever is good at everything - BM25 nails exact phrases, FAISS handles semantics, time-weighted methods catch recency. HyArg is the system I co-built that uses an LLM orchestrator to read each query, extract signals (temporal cues, quoted phrases, entity mentions, query length), and dynamically route to the best of five retrievers. We tested it on LongMemEval and Locomo across LLaMA 3.1 8B, Mistral 7B, and Qwen2.5 7B, and the jump from 0.65 → 0.83 Recall@5 on LongMemEval validated that dynamic retriever selection beats any fixed single-retriever baseline.

The Problem

Large language model conversational systems degrade sharply on long conversation histories - accuracy can drop 30–60% on conversations spanning ~115k tokens. The root cause isn't just context window size; it's that no single retriever is good at every kind of question. BM25 nails exact phrase and entity lookups. Dense retrievers like FAISS capture semantic similarity but miss lexical matches. Time-weighted methods optimize for recency but mishandle non-temporal queries. Conversational memory queries cut across all of these - information extraction, temporal reasoning, multi-session aggregation, and knowledge updates - so any fixed single-retriever baseline leaves recall on the table.

The Approach

HyArg is a hybrid multi-retriever orchestration system built around an LLM-based selector. The pipeline starts with a Signal Extractor that pulls features from each query - temporal cues ("recent", "yesterday", "last"), quoted phrases, entity mentions, query length, and structural patterns. A Prompt Builder then assembles those signals together with an orchestrator guide (decision rubric, tie-breakers, anti-patterns, and few-shot examples) and asks the LLM to choose which retriever should handle this specific query.

The Retriever Pool has five specialized retrievers - BM25 (sparse lexical, k1=1.5, b=0.75), TF-IDF (sparse statistical), FAISS (dense neural with sentence-transformers/MiniLM-L6-v2), SVM (linear-kernel semantic), and Time-Weighted BM25 with exponential decay. The selector outputs structured JSON of the form {retriever, CoT reasoning}, the router dispatches accordingly, and the chosen retriever pulls the top-k documents from session-based indexes.

The data pipeline deliberately preserves session boundaries and temporal metadata instead of consolidating dialogue into a single bag - that's what makes time-weighted retrieval and true cross-session synthesis possible.

Results

We evaluated HyArg on two benchmarks: LongMemEval (500 questions across 7 categories) and Locomo (1538 questions across 5 categories), using LLaMA 3.1 8B Instruct, Mistral 7B Instruct, and Qwen2.5 7B Instruct as base models.

On LongMemEval, HyArg with Qwen2.5 7B reached Recall@5 of 0.83 - an absolute gain of +18 percentage points over the best single retriever (0.65 → 0.83). On Locomo, HyArg with LLaMA 3.1 8B Instruct reached Recall@5 of 0.36 (+4 points over the best single-retriever baseline at 0.32). Both results validate the central thesis: dynamic retriever selection based on query characteristics outperforms any fixed retriever across diverse conversational memory tasks.

Lessons Learned

Memory in conversational AI isn't just about storing information - it's about routing the right kind of question to the right kind of retriever. The orchestrator guide (rules + few-shot examples) ended up being as important as the retrievers themselves: most of the gains came from getting the routing right.

Preserving session boundaries and temporal metadata instead of pre-consolidating dialogue mattered more than expected - it's what unlocked the time-weighted and multi-session aggregation paths in the first place.

Technology Stack

PythonPyTorchLangChainFAISSQwen2.5-7BLLaMA-3.1-8B