A novel memory retrieval system inspired by episodic memory theory achieves state-of-the-art 96.4% top-50 accuracy on the LongMemEval benchmark using Gemini Flash, outperforming larger Pro-based baselines by isolating retrieval quality from model capability.
Disclosure: first author. Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability. 96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%. Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered: * **Query decomposition**: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments. * **Temporal salience scoring**: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009). * **Coherence re-ranking**: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model. Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions. Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%. Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested. Above \~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream. [Paper](https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark) | [Results](https://fabric.so/p/longmemevalresults20260514125510-48Wipq9gyX8ZRzG3jHnWY9?expandedFdocId=b06b9bab-a50c-4bcf-935b-7f86c118aa9b) | [Answerer prompt](https://fabric.so/p/longmemeval-answer-generation-prompt-6qoaWxIa5BCrPH3DDdQbdS?expandedFdocId=0fda4cbc-438d-4546-bd06-69bcedb1e566) Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.
The author shares benchmark results for memweave, a Python library for agent memory, achieving 98% Recall@5 on LongMemEval-S using only local embeddings without LLM calls. The post details the methodology and compares performance against mempalace, highlighting stable retrieval across different question types.
The author claims GBrain outperforms MemPalace on the LongMemEval benchmark and has released the evaluation repository as open source to validate the results.
MemoryOS is an open-source, self-hosted AI agent memory tool using a temporal knowledge graph, achieving 86.2% accuracy on LongMemEval-s with fast 78ms retrieval speeds.
Google announces Gemini 2.5 Flash, a new hybrid reasoning model available in preview through the Gemini API. The model features toggleable thinking capabilities, fine-grained thinking budgets for quality-cost-latency tradeoffs, and maintains fast inference speeds while improving performance over 2.0 Flash.
Google has released Gemini 3 Flash, a fast, cost-effective AI model that combines Pro-grade reasoning with Flash-level speed for tasks like coding, complex analysis, and agentic workflows.