#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Reddit r/MachineLearning Papers

Summary

A novel memory retrieval system inspired by episodic memory theory achieves state-of-the-art 96.4% top-50 accuracy on the LongMemEval benchmark using Gemini Flash, outperforming larger Pro-based baselines by isolating retrieval quality from model capability.

Disclosure: first author. Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability. 96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%. Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered: * **Query decomposition**: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments. * **Temporal salience scoring**: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009). * **Coherence re-ranking**: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model. Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions. Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%. Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested. Above \~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream. [Paper](https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark) | [Results](https://fabric.so/p/longmemevalresults20260514125510-48Wipq9gyX8ZRzG3jHnWY9?expandedFdocId=b06b9bab-a50c-4bcf-935b-7f86c118aa9b) | [Answerer prompt](https://fabric.so/p/longmemeval-answer-generation-prompt-6qoaWxIa5BCrPH3DDdQbdS?expandedFdocId=0fda4cbc-438d-4546-bd06-69bcedb1e566) Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.
Original Article

Similar Articles

Introducing Gemini 2.5 Flash

Google DeepMind Blog

Google announces Gemini 2.5 Flash, a new hybrid reasoning model available in preview through the Gemini API. The model features toggleable thinking capabilities, fine-grained thinking budgets for quality-cost-latency tradeoffs, and maintains fast inference speeds while improving performance over 2.0 Flash.

Gemini 3 Flash: frontier intelligence built for speed

Google DeepMind Blog

Google has released Gemini 3 Flash, a fast, cost-effective AI model that combines Pro-grade reasoning with Flash-level speed for tasks like coding, complex analysis, and agentic workflows.