A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks

Reddit r/AI_Agents Tools

Summary

Attemory is an open-source local memory retrieval engine that uses attention-based retrieval over KV cache instead of traditional embedding or BM25 methods, achieving state-of-the-art results on LongMemEval, LoCoMo, and code search benchmarks.

I built an open-source local memory retrieval engine for AI agents: Attemory. Attemory is built around a different retrieval path. Instead of retrieving with embeddings, BM25, or a graph layer, it turns memory into reusable KV cache and lets a local model retrieve by attending over that memory directly. The reason I started building it is that retrieval often needs reasoning. A good memory system needs to follow constraints, connect entities, use dates and context, and understand what evidence would actually answer the user’s question. Traditional keyword/vector retrieval is useful, but it often does not have that reasoning path. Attemory uses the model’s attention path for retrieval, so relevance can be judged in the same context format that an LLM already understands. The benchmark results are clearly **SOTA**: \- LongMemEval-S: about 40 sessions / 115k tokens, **98.72% session Recall\_any@5, 92.77% session Recall\_all@5, 98.94% message Recall\_all@50** \- LongMemEval-M: about 500 sessions / 1.5M tokens / 5k messages, **94.89% session Recall\_any@5, 83.62% session Recall\_all@5, 92.55% message Recall\_all@50** \- LoCoMo: 10 long conversations / 1,540 QA items, **94.52% accuracy** with GPT-4.1-mini as answer model and GPT-4o-mini as judge \- Semble: 63 repos / 19 languages / largest repo about 5M tokens, **0.9055 file-level NDCG@10** The retrieval benchmarks are reproducible locally. The retrieval path is decode-free: it uses partial prefill, KV-cache reuse, and attention-based ranking, so search does not require token-by-token generation. The goal is to make LLM-based retrieval practical for multi-million-token workflows. This is still early software, and I’d especially like feedback from people building local agents or long-context memory systems: Happy to answer technical questions about the approach, benchmarks, packaging, or limitations.
Original Article

Similar Articles

T-Mem: Memory That Anticipates, Not Archives

arXiv cs.CL

T-Mem is a new long-term conversational memory architecture that enables both descriptive and associative recall, covering scenarios where query and memory share surface features and those where they are connected by latent semantic arcs. It reaches state-of-the-art on the LoCoMo and LoCoMo-Plus benchmarks.