#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Reddit r/MachineLearning 05/17/26, 05:44 PM Papers

memory-retrieval longmemeval gemini-flash episodic-memory retrieval-augmented benchmark conversational-memory

Summary

A novel memory retrieval system inspired by episodic memory theory achieves state-of-the-art 96.4% top-50 accuracy on the LongMemEval benchmark using Gemini Flash, outperforming larger Pro-based baselines by isolating retrieval quality from model capability.

Disclosure: first author. Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability. 96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%. Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered: * **Query decomposition**: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments. * **Temporal salience scoring**: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009). * **Coherence re-ranking**: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model. Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions. Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%. Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested. Above \~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream. [Paper](https://exabase.io/research/exabase-achieves-state-of-the-art-on-longmemeval-benchmark) | [Results](https://fabric.so/p/longmemevalresults20260514125510-48Wipq9gyX8ZRzG3jHnWY9?expandedFdocId=b06b9bab-a50c-4bcf-935b-7f86c118aa9b) | [Answerer prompt](https://fabric.so/p/longmemeval-answer-generation-prompt-6qoaWxIa5BCrPH3DDdQbdS?expandedFdocId=0fda4cbc-438d-4546-bd06-69bcedb1e566) Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.

Original Article

#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Similar Articles

Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key

@garrytan: GBrain beats MemPalace on LongMemEval And I published the benchmarks and open source eval repo to prove it

MemoryOS – AI agent memory with temporal knowledge graph and 9ms ingest and 78ms retrieval

Introducing Gemini 2.5 Flash

Gemini 3 Flash: frontier intelligence built for speed

Submit Feedback

Similar Articles

Benchmarking agent memory retrieval on LongMemEval‑S — 98% Recall@5, 100% recall by R@23, local embeddings only (all-MiniLM-L6-v2), no LLM, no API key

@garrytan: GBrain beats MemPalace on LongMemEval And I published the benchmarks and open source eval repo to prove it

MemoryOS – AI agent memory with temporal knowledge graph and 9ms ingest and 78ms retrieval

Gemini 3 Flash: frontier intelligence built for speed