The author shares benchmark results for memweave, a Python library for agent memory, achieving 98% Recall@5 on LongMemEval-S using only local embeddings without LLM calls. The post details the methodology and compares performance against mempalace, highlighting stable retrieval across different question types.
I’ve been working on memweave — a Python library for persistent agent memory backed by plain Markdown files and SQLite. I wanted to share benchmark results on LongMemEval‑S and the methodology behind them. --- ## The benchmark LongMemEval‑S is a 500‑question retrieval benchmark (Wu et al., 2024). Each question comes with a haystack of ~53 multi‑session conversations. The task: retrieve the session(s) containing the answer. The benchmark defines 6 question types: - single‑session (user turn) - single‑session (assistant turn) - implicit preference - multi‑session - knowledge‑update - temporal‑reasoning **Setup** - Embeddings: `all-MiniLM-L6-v2`(local) - Indexed content: user turns only - No LLM calls, no API key, no cloud services at any stage - Parameters tuned on a 50‑question dev set only; the 450‑question held‑out split is evaluated once with no post‑hoc adjustments --- ## Results — held‑out split (450 questions) **Single run (best heuristic pipeline: ECR + IDF + CAATB)** | K | Recall@K | NDCG@K | |----|----------|--------| | 1 | 90.00% | 90.00% | | 3 | 96.44% | 93.45% | | **5** | **98.00%** | **93.75%** | | 10 | 99.11% | 93.76% | | 25 | **100.00%** | 93.83% | 100% recall is reached by **R@23**. **5‑seed cross‑validated (5 independent stratified splits, each with its own dev sweep)** | Metric | Mean | ±Std | |--------|----------|---------| | R@5 | 97.24% | ±0.12% | | R@10 | 98.76% | ±0.12% | | R@25 | 100.00% | ±0.00% | | NDCG@5 | 92.28% | ±0.69% | The ±0.12% std on R@5 suggests the result is stable across splits rather than a lucky dev/held‑out partition. --- ## Comparison with mempalace Mempalace is the closest comparable system — same benchmark, same embedding model, same “user‑turns‑only” indexing. Their best published result on this setup is Hybrid v4. | System | R@5 | R@10 | NDCG@5 | 100% recall at | |------------------------------|--------|--------|--------|----------------| | memweave (ECR + IDF + CAATB) | 98.00% | 99.11% | 93.75% | R@23 | | mempalace Hybrid v4 | 98.44% | 99.78% | — | R@30 | Mempalace scores slightly higher on R@5 and R@10. Memweave reaches 100% recall 7 ranks earlier (R@23 vs R@30). For pipelines that retrieve a fixed top‑K and then feed that into a re‑ranker or LLM, a smaller K that still guarantees full coverage can matter in practice. One methodological difference: mempalace Hybrid v4 injects synthetic “preference” documents at ingestion time — heuristic regex patterns generate additional index entries per session. Memweave reaches 98.00% without any ingestion‑time augmentation: only the original session text is indexed. --- ## How the scores were achieved The pipeline uses three post‑processors built on memweave’s plugin API (`mem.register_postprocessor(...)`). None of these lives in the core library (for now); they sit on top of a vanilla memweave memory. **ECR — EntityConfidenceReranker** Confidence‑adaptive entity boost. Additive, only fires where the vector model is relatively uncertain, and skips preference‑type queries where entity matching is unreliable. It never overrides very high‑confidence matches. **IDF — IDFKeywordBooster** Per‑question, corpus‑relative keyword boost. IDF is computed from the 200 retrieved candidates for that specific question, so terms that are common in that haystack score low. It’s multiplicative, so it preserves the relative ordering among strong vector hits while nudging up candidates with rare but important tokens. **CAATB — ConfidenceAdaptiveTemporalBooster** Temporal proximity boost for queries expressing time offsets (“4 weeks ago”, “last month”, “a couple of days ago”). No lexical gate — temporal proximity alone fires the boost. The boost is additive and confidence‑adaptive, so it mainly helps medium‑confidence candidates whose dates line up with the query, without pushing already top‑ranked sessions further ahead. --- ## Per question type (held‑out) | Question type | n | R@5 | NDCG@5 | |---------------------------|-----|--------|--------| | single‑session‑user | 63 | 100% | 98.62% | | knowledge‑update | 69 | 98.55% | 97.25% | | single‑session‑assistant | 54 | 98.15% | 97.01% | | multi‑session | 115 | 99.13% | 94.57% | | temporal‑reasoning | 124 | 97.58% | 90.51% | | single‑session‑preference | 25 | 88.00% | 77.12% | A few notes: - **single‑session‑preference** is the hardest type. Preferences in LongMemEval are often implicit, and the question phrasing frequently doesn’t share vocabulary with the original session. That’s a fundamental challenge for retrieval that operates only on session content. - **single‑session‑assistant** has a structural ceiling in this setup: only user turns are indexed, so answers that exist *only* in assistant turns can’t be retrieved by any embedding strategy here. --- ## Reproduction Full pipeline, strategy sources, and step‑by‑step commands are in the first comment. Happy to answer questions about the methodology, limitations, or any of the strategies above.
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
GroupMemBench is a new benchmark for evaluating LLM agent memory in multi-party conversations, exposing failures in current memory systems with the best achieving only 46% average accuracy.