A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks

Reddit r/AI_Agents 06/16/26, 08:35 AM Tools

local-memory retrieval attention-based kv-cache open-source long-context ai-agents

Summary

Attemory is an open-source local memory retrieval engine that uses attention-based retrieval over KV cache instead of traditional embedding or BM25 methods, achieving state-of-the-art results on LongMemEval, LoCoMo, and code search benchmarks.

I built an open-source local memory retrieval engine for AI agents: Attemory. Attemory is built around a different retrieval path. Instead of retrieving with embeddings, BM25, or a graph layer, it turns memory into reusable KV cache and lets a local model retrieve by attending over that memory directly. The reason I started building it is that retrieval often needs reasoning. A good memory system needs to follow constraints, connect entities, use dates and context, and understand what evidence would actually answer the user’s question. Traditional keyword/vector retrieval is useful, but it often does not have that reasoning path. Attemory uses the model’s attention path for retrieval, so relevance can be judged in the same context format that an LLM already understands. The benchmark results are clearly **SOTA**: \- LongMemEval-S: about 40 sessions / 115k tokens, **98.72% session Recall\_any@5, 92.77% session Recall\_all@5, 98.94% message Recall\_all@50** \- LongMemEval-M: about 500 sessions / 1.5M tokens / 5k messages, **94.89% session Recall\_any@5, 83.62% session Recall\_all@5, 92.55% message Recall\_all@50** \- LoCoMo: 10 long conversations / 1,540 QA items, **94.52% accuracy** with GPT-4.1-mini as answer model and GPT-4o-mini as judge \- Semble: 63 repos / 19 languages / largest repo about 5M tokens, **0.9055 file-level NDCG@10** The retrieval benchmarks are reproducible locally. The retrieval path is decode-free: it uses partial prefill, KV-cache reuse, and attention-based ranking, so search does not require token-by-token generation. The goal is to make LLM-based retrieval practical for multi-million-token workflows. This is still early software, and I’d especially like feedback from people building local agents or long-context memory systems: Happy to answer technical questions about the approach, benchmarks, packaging, or limitations.

Original Article

A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks

Similar Articles

Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key

T-Mem: Memory That Anticipates, Not Archives

Cognis: Context-Aware Memory for Conversational AI Agents

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

MemoryOS – AI agent memory with temporal knowledge graph and 9ms ingest and 78ms retrieval

Submit Feedback

Similar Articles

Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key

T-Mem: Memory That Anticipates, Not Archives

Cognis: Context-Aware Memory for Conversational AI Agents

SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

MemoryOS – AI agent memory with temporal knowledge graph and 9ms ingest and 78ms retrieval