Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key

Reddit r/AI_Agents Tools

Summary

The author shares benchmark results for memweave, a Python library for agent memory, achieving 98% Recall@5 on LongMemEval-S using only local embeddings without LLM calls. The post details the methodology and compares performance against mempalace, highlighting stable retrieval across different question types.

I’ve been working on memweave — a Python library for persistent agent memory backed by plain Markdown files and SQLite. I wanted to share benchmark results on LongMemEval‑S and the methodology behind them. --- ## The benchmark LongMemEval‑S is a 500‑question retrieval benchmark (Wu et al., 2024). Each question comes with a haystack of ~53 multi‑session conversations. The task: retrieve the session(s) containing the answer. The benchmark defines 6 question types: - single‑session (user turn) - single‑session (assistant turn) - implicit preference - multi‑session - knowledge‑update - temporal‑reasoning **Setup** - Embeddings: `all-MiniLM-L6-v2`(local) - Indexed content: user turns only - No LLM calls, no API key, no cloud services at any stage - Parameters tuned on a 50‑question dev set only; the 450‑question held‑out split is evaluated once with no post‑hoc adjustments --- ## Results — held‑out split (450 questions) **Single run (best heuristic pipeline: ECR + IDF + CAATB)** | K | Recall@K | NDCG@K | |----|----------|--------| | 1 | 90.00% | 90.00% | | 3 | 96.44% | 93.45% | | **5** | **98.00%** | **93.75%** | | 10 | 99.11% | 93.76% | | 25 | **100.00%** | 93.83% | 100% recall is reached by **R@23**. **5‑seed cross‑validated (5 independent stratified splits, each with its own dev sweep)** | Metric | Mean | ±Std | |--------|----------|---------| | R@5 | 97.24% | ±0.12% | | R@10 | 98.76% | ±0.12% | | R@25 | 100.00% | ±0.00% | | NDCG@5 | 92.28% | ±0.69% | The ±0.12% std on R@5 suggests the result is stable across splits rather than a lucky dev/held‑out partition. --- ## Comparison with mempalace Mempalace is the closest comparable system — same benchmark, same embedding model, same “user‑turns‑only” indexing. Their best published result on this setup is Hybrid v4. | System | R@5 | R@10 | NDCG@5 | 100% recall at | |------------------------------|--------|--------|--------|----------------| | memweave (ECR + IDF + CAATB) | 98.00% | 99.11% | 93.75% | R@23 | | mempalace Hybrid v4 | 98.44% | 99.78% | — | R@30 | Mempalace scores slightly higher on R@5 and R@10. Memweave reaches 100% recall 7 ranks earlier (R@23 vs R@30). For pipelines that retrieve a fixed top‑K and then feed that into a re‑ranker or LLM, a smaller K that still guarantees full coverage can matter in practice. One methodological difference: mempalace Hybrid v4 injects synthetic “preference” documents at ingestion time — heuristic regex patterns generate additional index entries per session. Memweave reaches 98.00% without any ingestion‑time augmentation: only the original session text is indexed. --- ## How the scores were achieved The pipeline uses three post‑processors built on memweave’s plugin API (`mem.register_postprocessor(...)`). None of these lives in the core library (for now); they sit on top of a vanilla memweave memory. **ECR — EntityConfidenceReranker** Confidence‑adaptive entity boost. Additive, only fires where the vector model is relatively uncertain, and skips preference‑type queries where entity matching is unreliable. It never overrides very high‑confidence matches. **IDF — IDFKeywordBooster** Per‑question, corpus‑relative keyword boost. IDF is computed from the 200 retrieved candidates for that specific question, so terms that are common in that haystack score low. It’s multiplicative, so it preserves the relative ordering among strong vector hits while nudging up candidates with rare but important tokens. **CAATB — ConfidenceAdaptiveTemporalBooster** Temporal proximity boost for queries expressing time offsets (“4 weeks ago”, “last month”, “a couple of days ago”). No lexical gate — temporal proximity alone fires the boost. The boost is additive and confidence‑adaptive, so it mainly helps medium‑confidence candidates whose dates line up with the query, without pushing already top‑ranked sessions further ahead. --- ## Per question type (held‑out) | Question type | n | R@5 | NDCG@5 | |---------------------------|-----|--------|--------| | single‑session‑user | 63 | 100% | 98.62% | | knowledge‑update | 69 | 98.55% | 97.25% | | single‑session‑assistant | 54 | 98.15% | 97.01% | | multi‑session | 115 | 99.13% | 94.57% | | temporal‑reasoning | 124 | 97.58% | 90.51% | | single‑session‑preference | 25 | 88.00% | 77.12% | A few notes: - **single‑session‑preference** is the hardest type. Preferences in LongMemEval are often implicit, and the question phrasing frequently doesn’t share vocabulary with the original session. That’s a fundamental challenge for retrieval that operates only on session content. - **single‑session‑assistant** has a structural ceiling in this setup: only user turns are indexed, so answers that exist *only* in assistant turns can’t be retrieved by any embedding strategy here. --- ## Reproduction Full pipeline, strategy sources, and step‑by‑step commands are in the first comment. Happy to answer questions about the methodology, limitations, or any of the strategies above.
Original Article

Similar Articles

MemGym: a Long-Horizon Memory Environment for LLM Agents

arXiv cs.CL

MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.