MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

arXiv cs.AI 06/17/26, 04:00 AM Papers
llm long-term-memory benchmark evaluation knowledge-point probing
Summary
MemTrace is a benchmark that evaluates LLM agent memory at the knowledge point level, probing how facts behave under varying memory age, question type, and evidence conditions. It reveals that pooled accuracy hides distinct failure modes, and that the main bottleneck is evidence use rather than retrieval.
arXiv:2606.17328v1 Announce Type: new Abstract: LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
Source: [https://arxiv.org/html/2606.17328](https://arxiv.org/html/2606.17328)
Xianxuan Long1Zhikai Chen1Shenglai Zeng1 Shouren Wang2Kai Guo1Jiliang Tang1 1Michigan State University2Case Western Reserve University \{longxia2, chenzh85, zengshe1, guokai1, tangjili\}@msu\.edu sxw992@case\.edu

###### Abstract

LLM agents increasingly maintain long\-term memory of user facts across sessions\. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes\. Because it scores question rows independently, even when several probe one fact, it cannot show how that fact behaves as conditions change\. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point—a single typed fact about the user, not an individual question\. MemTrace probes each fact along three controlled dimensions: memory age \(how many sessions ago it appeared in the history\), question type \(current, earlier, or trajectory of change\), and evidence condition \(present, missing, or contradicted by a false premise\)\. Evaluating 13 memory\-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact’s current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise\. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10×\\timesmore often than it was missing—so improving memory depends on using reachable evidence, not on more storage or retrieval\.

MemTrace: Probing What Final Accuracy Misses in Long\-Term Memory

Xianxuan Long1Zhikai Chen1Shenglai Zeng1Shouren Wang2Kai Guo1Jiliang Tang11Michigan State University2Case Western Reserve University\{longxia2, chenzh85, zengshe1, guokai1, tangjili\}@msu\.edusxw992@case\.edu

## 1Introduction

Large language models \(LLMs\) are moving from single\-turn assistants to persistent agents that interact with users across many sessions\(Deshpandeet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib24); Shenet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib7); Denget al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib55)\)\. In this setting, memory is not only a matter of recalling isolated facts\. A useful system\(Huet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib1); Huanget al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib2)\)is expected to remember user\-specific information, update it when the user changes state, and keep its answers coherent as goals and preferences evolve\(Deshpandeet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib24); Zhanget al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib39); Chenet al\.,[2026b](https://arxiv.org/html/2606.17328#bib.bib58)\)\. This need is now being addressed by several lines of work, including long\-context models that keep more history in the input\(Gemini Team, Google DeepMind,[2025](https://arxiv.org/html/2606.17328#bib.bib30); OpenAI,[2025](https://arxiv.org/html/2606.17328#bib.bib33); Qwen Team,[2025b](https://arxiv.org/html/2606.17328#bib.bib28)\), retrieval\-augmented systems that retrieve evidence at inference time\(Asaiet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib47); Jimenez Gutierrezet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib19)\), external\-memory systems that maintain persistent memory stores\(Liuet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib27); Zhonget al\.,[2023](https://arxiv.org/html/2606.17328#bib.bib53)\), and agentic\-memory architectures that use policies or agents to manage memories across interactions\(Wang and Chen,[2025](https://arxiv.org/html/2606.17328#bib.bib20); Yueet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib41)\)\.

![Refer to caption](https://arxiv.org/html/2606.17328v1/x1.png)Figure 1:A pooled QA view can mark one final answer as correct while hiding failures on the same knowledge point\. In this example, a system answers the current role correctly but fails when the same role fact is probed as aged historical and trajectory questions\.Many current long\-term memory benchmarks aggregate accuracy over question rows or interaction episodes\(Yenet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib22); Tavakoliet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib9)\), which hides variation across different queries\. This aggregation hides where a system fails, but it is only the surface problem\. The deeper one is the unit of measurement: accuracy is scored per question row or per interaction episode, so questions that probe one underlying fact are treated as independent items\. As a result, no score over that unit can hold a fact fixed and ask how it behaves as conditions change\. These conditions matter in practice: a system may recall a fact when recent but forget it after many sessions, answer its current state but not how it changed, or refuse an unmentioned fact yet accept a false premise about it\. Capturing such contrasts requires fixing the fact and varying the conditions around it\.[Figure˜1](https://arxiv.org/html/2606.17328#S1.F1)gives an example\.

Distinguishing these failures is crucial because they reflect how memory systems are used in practice\. Users ask about the current state of a fact, refer back to earlier states, and ask how a fact has changed over time\. They also ask questions whose evidence is absent, where the system should abstain, or whose premise conflicts with remembered information, where the system should reject the premise\. Benchmarks that aggregate across these dimensions cannot show which behavior breaks, even when two systems have similar pooled scores\.

We introduce MemTrace, a benchmark whose unit of measurement is the*knowledge point*: a single typed fact about the user, rather than the individual question\. For each fact, MemTrace constructs repeated probes along controlled dimensions\.*Memory age*measures how many sessions have passed since the fact first appeared in history\.*Question type*asks for the current state, an earlier state, or the trajectory of change over time\.*Evidence condition*controls whether the relevant evidence is present, missing, or contradicted by a false premise\. Together, these dimensions instantiate MemTrace as a benchmark of 835 typed knowledge points from 20 users, expanded into 15,422 question rows and over 200,000 scored answers\.

Each user fact in MemTrace is repeatedly probed, which lets us ask whether memory persists as sessions accumulate, whether systems track the state and evolution of a fact, and whether they behave safely when evidence is missing or conflicting\. We also diagnose whether remaining errors come from unreachable evidence or from reachable evidence that is not used\. We evaluate 13 memory\-system configurations across four paradigms on MemTrace; our key findings are as follows:\(1\) Performance varies systematically across memory age, question type, and evidence condition\. Long\-context systems answer recent facts well, but lose accuracy as facts age, especially on trajectory questions\. RAG systems, including graph\-based retrieval, handle current and earlier\-state questions better than questions that require tracking change over time\. Some external\-memory systems decline almost all questions about facts that were never mentioned, yet rarely answer correctly when the prompt contains a false premise\.\(2\) Across systems, the dominant remaining gap is evidence use rather than retrieval:when systems fail, the evidence was already retrievable about 10×\\timesmore often than it was missing\. Our contributions are:

- •We introduceMemTrace, a knowledge\-point benchmark built around three probing dimensions central to long\-term memory evaluation: memory age, question type, and evidence condition\. These dimensions test retention, varied fact queries, and safe behavior under missing or conflicting evidence\.
- •We evaluate 13 memory\-system configurations across four paradigms and show that systems with similar pooled scores fail differently\. In particular, trajectory questions expose a broad weakness: systems that recover current or earlier states of a fact can still fail when asked how it changes over time\.
- •We provide diagnostic analysis of where memory failures arise\. Across systems, the main bottleneck is evidence use rather than retrieval: failures come 10×\\timesmore often from unused evidence than from unreachable evidence\.

## 2Related Work

#### Memory Architectures\.

Persistent LLM agents have motivated several memory architectures\. Long\-context models read prior interactions directly from the prompt\. Retrieval\-augmented systems index and retrieve external evidence before generation\(Lewiset al\.,[2020](https://arxiv.org/html/2606.17328#bib.bib46); Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.17328#bib.bib31); Jimenez Gutierrezet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib19)\)\. Explicit or agentic memory systems maintain dedicated stores and policies for writing, updating, and retrieving memories\. Some systems organize memory as an explicit agent state\(Parket al\.,[2023](https://arxiv.org/html/2606.17328#bib.bib48); Packeret al\.,[2023](https://arxiv.org/html/2606.17328#bib.bib49); Chhikaraet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib18)\); others study lightweight memory stores\(Liuet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib27); Shuet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib26); Xuet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib25)\)or memory\-management agents\(Wang and Chen,[2025](https://arxiv.org/html/2606.17328#bib.bib20); Yueet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib41)\)\. This architectural diversity makes final\-answer accuracy an incomplete way to compare memory systems\.

![Refer to caption](https://arxiv.org/html/2606.17328v1/x2.png)Figure 2:Construction and evaluation schematic for MemTrace\.\(A\)Source sessions are converted into typed knowledge points with session anchors and quality checks\.\(B\)Probe construction pairs each knowledge point with a memory window and evidence condition, then expands the base probe by question type into question rows\.\(C\)Memory\-system responses are scored and summarized into diagnostic views of memory maintenance, evidence\-condition behavior, and failure attribution\.
#### Memory Benchmarks\.

Memory evaluation spans long\-context stress tests\(Baiet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib42); Zhanget al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib43); Hsiehet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib44)\), newer long\-context suites\(Baiet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib45)\), and long\-term conversational memory\(Xuet al\.,[2022](https://arxiv.org/html/2606.17328#bib.bib56); Maharanaet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib10); Wuet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib11)\)\. Other benchmarks focus on personalization\(Jianget al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib13); Bianet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib54)\), agent memory\(Huet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib15); Tavakoliet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib9)\), dynamic profiles and evolving preferences\(Liet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib36); Uddinet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib6)\), stale or hallucinated memories\(Chaoet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib34); Tanet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib16)\), and missing or conflicting evidence\(Chenet al\.,[2026a](https://arxiv.org/html/2606.17328#bib.bib3); Aiet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib17)\)\. These benchmarks broaden memory\-system evaluation, but usually score question rows or interaction outcomes and aggregate across them\. MemTrace uses knowledge points as the analysis unit\. Each fact is probed across windows, question types, and evidence conditions, so behavior is measured on the knowledge point rather than averaged across independent question rows\.

## 3The MemTrace Benchmark

Different from existing benchmarks that mainly evaluate isolated question rows or aggregate corpus\-level accuracy, MemTrace uses the*knowledge point*as the unit of measurement\. A knowledge point is a typed fact about the user\. For each knowledge point, we construct repeated probes that vary memory age, question type, and evidence condition while keeping the underlying fact fixed\. This design allows us to test three behaviors that pooled QA scores usually merge: whether a fact remains usable as sessions accumulate, whether the system can answer current, historical, and trajectory questions about the same fact, and whether it behaves appropriately when evidence is present, missing, or contradicted\.[Figure˜2](https://arxiv.org/html/2606.17328#S2.F2)shows the full construction and evaluation flow: source sessions are converted into knowledge points, expanded into controlled probes, scored through memory\-system responses, and summarized as diagnostic views\.[Table˜1](https://arxiv.org/html/2606.17328#S3.T1)positions this protocol against representative long\-term and personalized\-memory benchmarks\(Maharanaet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib10); Wuet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib11); Jianget al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib13); Chenet al\.,[2026a](https://arxiv.org/html/2606.17328#bib.bib3); Huet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib15)\)\.

Table 1:Benchmark\-design comparison\. Rows are protocol axes, not quality judgments; ✓ = central;∘\\circ= related; – = not part of the main protocol\.### 3\.1Data Source and Knowledge Points

To evaluate a system along controlled dimensions, a per\-fact protocol requires source data with two properties\. First, facts must be grounded in multi\-session user histories and anchored to specific sessions, so memory age can be defined\. Second, the data must include distractor items for constructing missing\-evidence and conflict probes, so evidence conditions can be tested\. HaluMem\-Medium\(Chenet al\.,[2026a](https://arxiv.org/html/2606.17328#bib.bib3)\)provides both properties\. HaluMem evaluates whether memory systems hallucinate during extraction, updating, and question answering\. However, its examples are not organized around a fixed knowledge point queried under multiple conditions\. We therefore use HaluMem\-Medium as source data and transform its histories, memory points, and diagnostic questions into knowledge\-point probes\. Substantive knowledge points are static facts, dynamic facts with earlier and updated states, or preference facts\. Conflict and boundary distractors come from HaluMem’s diagnostic questions and are reformulated for the per\-fact protocol\.

The benchmark covers 20 users and 835 typed knowledge points, expanded into 5,677 base probes, 15,422 question rows, and 200,453 scored answers across 13 memory\-system configurations spanning four paradigms \(Appendix[A](https://arxiv.org/html/2606.17328#A1)\)\. As shown in[Table˜2](https://arxiv.org/html/2606.17328#S3.T2), 635 knowledge points are substantive user facts and 200 are distractor knowledge points used for conflict and boundary probes\. Each user contributes 35–48 knowledge points \(mean 41\.8\)\.

Table 2:Distribution of knowledge points in MemTrace: static, dynamic, and preference are substantive facts; conflict and boundary are distractors\.
### 3\.2Probe Construction

Each probe in MemTrace reflects how long\-term memory systems are queried in multi\-session conversations\. Rather than treating evaluation as a flat question\-answering task, we fix the knowledge point and evaluate it along three dimensions that aggregate QA scores usually merge: how long the system can retain it \(*Memory age*\), whether the system can flexibly utilize it across different question contexts \(*Question type*\), and whether the system can handle it safely under unmentioned or misleading conditions \(*Evidence condition*\)\.

Formally, a*base probe*pairs a knowledge point, memory windowWwW\_\{w\}, and evidence condition, then expands into question rows\. A memory window is an evaluation checkpoint; its*prefix*is the conversation history shown to the system\. IfWwW\_\{w\}ends after sessiontevalt\_\{\\mathrm\{eval\}\}, the system receives sessions with indices<teval<t\_\{\\mathrm\{eval\}\}\. This differs from boundary probes, which query unmentioned facts\.

#### Memory age\.

To track how memory persists as the conversation grows, we evaluate each user’s history at eight chronological checkpoints \(rather than on disjoint session blocks\) to map a continuous retention trace\. The evaluation checkpoints are chosen to span from early to saturated histories, with extra windows anchored near dynamic updates to probe pre\- and post\-update states\. A knowledge point is only probed after its initial appearance\. If it first appears in sessiontsourcet\_\{\\mathrm\{source\}\}, its memory age atWwW\_\{w\}isteval−tsourcet\_\{\\mathrm\{eval\}\}\-t\_\{\\mathrm\{source\}\}sessions\.

#### Question type\.

Question type tests whether the same fact supports different uses\. Users ask beyond the current state of a fact\. Each substantive knowledge point is instantiated as a Current question \(what is the current state?\), a Historical question \(what was the state at an earlier point?\), and a Trajectory question \(how has the state changed over time?\); distractor knowledge points, which drive the evidence\-condition probes described next, carry only Current and Historical questions\.

#### Evidence condition\.

Evidence condition tests how systems respond when evidence is absent or conflicting\. Real conversations include queries about unmentioned facts \(missing evidence\) and queries whose premise contradicts memory \(conflicting evidence\)\. The standard probes provide the evidence\-present case\. We therefore expand distractor knowledge points into two additional probe families:*boundary probes*, which query unmentioned facts, and*conflict probes*, which assert a false premise that contradicts memory\. Both families are adapted from HaluMem’s diagnostic taxonomy and aligned with the per\-fact unit\.

Table 3:Memory maintenance by question type\.Fresh averages Gist over checkpoints W1 and W2, Saturated averages Gist over checkpoints W7 and W8, andΔ\\DeltaForget is their gap computed before rounding, all in percentage points\. Score cells are shaded within each column; bold marks each Fresh/Saturated leader among the main rows\. Gray rows are paper\-native backbone sensitivity rows and are not counted as additional main systems\.

### 3\.3Metrics

A single accuracy metric would conflate distinct response behaviors\. Each response is therefore assigned a score tuple\(g,v,r\)\(g,v,r\)\. Here,ggis binary*Gist accuracy*, which captures semantic correctness;vvis continuous*Verbatim completeness*in\[0,1\]\[0,1\], which checks whether canonical answer tokens appear; andrris*Response type*, which supports diagnostics over correct answers, abstentions, and hallucinations\. GPT\-4o is the primary judge; Gemini\-3\-Flash checks reliability on a stratified 200\-probe sample under the same rubric \(Appendix[D](https://arxiv.org/html/2606.17328#A4)\)\.

### 3\.4Diagnostic Views

We report three diagnostic views instead of one pooled score: memory maintenance, evidence\-condition behavior, and reach/use attribution\.

#### Memory maintenance\.

For each knowledge point and question type, we aggregate Gist accuracy across all eight windows to ask whether memory persists as the conversation grows\. We report this in two ways\. The first is the full retention traceW1,…,W8W\_\{1\},\\dots,W\_\{8\}, visualized as a per\-system retention curve\. The second is three scalar summaries: Fresh accuracy \(W1,W2W\_\{1\},W\_\{2\}\), measuring recent access; Saturated accuracy \(W7,W8W\_\{7\},W\_\{8\}\), measuring aged\-memory retention; andΔ\\DeltaForget, the gap between the two\.Δ\\DeltaForget is a diagnostic gap, not a standalone leaderboard: a small value can mean either stable retention or uniformly low endpoint accuracy\.

#### Evidence\-condition behavior\.

For each system, we aggregate response\-type rates over distractor probes: correct boundary refusal, correct conflict resolution, and hallucination on both\. Safe abstention and false\-premise rejection differ, but pooled accuracy treats both as wrong\.

#### Failure attribution\.

When a system answers incorrectly, we ask whether the gold evidence was unreachable or reachable but unused\. This diagnostic runs on dedicated samples rather than the full benchmark and has three components\.

*Oracle on hard failures\.*We supply gold memory evidence directly to probes that all systems respond incorrectly and then re\-score\. The lift over the production baseline tests whether these probes can be solved when the needed evidence is explicit\. This open\-book check is not a retrieval setting; it only checks whether the hard probes are unanswerable\.

*Reach\-vs\.\-use replay\.*On a broader sample, a simple Text\-emb\-3\-small retriever checks whether it can reach the source session containing the gold evidence\. We combine this reach signal with the original production answer to separate reach misses from reached\-but\-unsolved cases\.

*Oracle on retriever\-reached but unsolved probes\.*We then rerun the open\-book oracle on the retriever\-reached but unsolved cases\. If accuracy rises, the error is recoverable with explicit evidence; if not, the case may remain unsolved even with evidence\.

## 4Experiment and Findings

This section evaluates MemTrace along the diagnostic views defined in[section˜3\.4](https://arxiv.org/html/2606.17328#S3.SS4)\. We then report memory maintenance across sessions and question types, behavior under missing and conflicting evidence, and failure attribution by retrieval reach and evidence use\. We close with the main implications for long\-term memory system design\.

![Refer to caption](https://arxiv.org/html/2606.17328v1/x3.png)Figure 3:Memory\-age traces\.Gist traces from checkpoint W1 through W8 for the main configurations\. Each panel fixes one question type: current state, earlier state, or trajectory of change\. Gray lines show the remaining systems, and colored lines highlight representative configurations used in the text\. Endpoints are summarized in[Table˜3](https://arxiv.org/html/2606.17328#S3.T3)\.### 4\.1Experiment Setting

We evaluate 13 memory\-system configurations across four paradigms\.\(a\) Long\-context configurationsinclude Qwen3\.5\-35B\(Qwen Team,[2025b](https://arxiv.org/html/2606.17328#bib.bib28)\), Gemini\-3\-Flash\(Gemini Team, Google DeepMind,[2025](https://arxiv.org/html/2606.17328#bib.bib30)\), and GPT\-5\-nano\(OpenAI,[2025](https://arxiv.org/html/2606.17328#bib.bib33)\); they read visible history directly in the model context\.\(b\) RAG configurationsinclude BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.17328#bib.bib31)\), Text\-emb\-3\-small\(OpenAI,[2024](https://arxiv.org/html/2606.17328#bib.bib32)\), Qwen3\-Emb\(Qwen Team,[2025a](https://arxiv.org/html/2606.17328#bib.bib29)\), and HippoRAG\-v2\(Jimenez Gutierrezet al\.,[2024](https://arxiv.org/html/2606.17328#bib.bib19)\); they retrieve memory evidence before generation\.\(c\) External\-memory configurationsinclude Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib18)\), SimpleMem\(Liuet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib27)\), REMem\(Shuet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib26)\), and AMem\(Xuet al\.,[2025](https://arxiv.org/html/2606.17328#bib.bib25)\); they maintain memory stores\.\(d\) Agentic\-memory configurationsinclude MIRIX\(Wang and Chen,[2025](https://arxiv.org/html/2606.17328#bib.bib20)\)and Mem\-T\(Yueet al\.,[2026](https://arxiv.org/html/2606.17328#bib.bib41)\); they use policy\-driven or multi\-agent memory management\. Appendix[B](https://arxiv.org/html/2606.17328#A2)gives the full configuration for all 13 systems\.

For long\-context models, each probe is answered by prompting the model with the visible session prefix for that window\. For the other configurations, the memory state for each checkpoint is built or updated only from the same visible prefix; future sessions are not available to storage, retrieval, or generation\. The system then returns retrieved or stored memory evidence for the probe\. We pass that evidence to a shared gpt\-4o\-mini answer generator with the same answer prompt, so the main comparison changes the memory mechanism while holding the final generator fixed\. We also report paper\-native gpt\-4\.1\-mini sensitivity rows for SimpleMem and MIRIX where the original systems specify that backbone\. Unless otherwise stated, results use Gist accuracy as the correctness signal\. The following subsections use this setup to test the three diagnostic views introduced in[Section˜3\.4](https://arxiv.org/html/2606.17328#S3.SS4)\.

### 4\.2Memory Maintenance Across Sessions and Question Types

Pooled accuracy hides when and how memory fails\.A pooled final\-accuracy score removes the two variables that define long\-term memory use: when a knowledge point is queried and what the question asks the system to do with it\. MemTrace therefore reports Fresh access, averaged over W1 and W2; Saturated retention, averaged over W7 and W8; andΔ\\DeltaForget, their gap\. We compute these summaries separately for current, historical, and trajectory questions\.[Table˜3](https://arxiv.org/html/2606.17328#S3.T3)reports these summaries for all main configurations, and[Figure˜3](https://arxiv.org/html/2606.17328#S4.F3)shows the traces from W1 through W8\.

These results show why a single leaderboard is lossy\. On overall Saturated Gist, HippoRAG\-v2 has the strongest endpoint \(36\.5%\), but the leader changes by question type: HippoRAG\-v2 leads saturated current and historical questions \(45\.4% and 50\.9%\), while Mem\-T leads saturated trajectory questions \(19\.8%\)\. Trajectory is therefore not just harder lookup; it asks for a different memory behavior\. Long\-context models fail differently: Qwen3\.5\-35B and GPT\-5\-nano have high Fresh trajectory scores \(49\.0% and 38\.4%\), but fall to 6\.7% and 6\.5% when saturated\. Appendix[C](https://arxiv.org/html/2606.17328#A3)reports the corresponding bootstrap rank checks in[Table˜11](https://arxiv.org/html/2606.17328#A3.T11)\.

This trajectory drop suggests that relevant evidence alone is not enough\. A trajectory answer must connect multiple states of the same knowledge point, identify their temporal order, and express the update\. Long\-context models receive the visible conversation history, but as it grows, old and new mentions compete with later sessions; the model may answer a local current or historical query while failing to organize the temporal relation\. RAG systems show a complementary limitation: HippoRAG\-v2 leads saturated current and historical questions, but reaches only 13\.4% on saturated trajectory\. In both cases, the hard part is using multiple states as a temporal update trace, not recalling one state\. A smallΔ\\DeltaForget can still hide low performance at both endpoints, so we interpret the gap together with Fresh and Saturated\.

### 4\.3Behavior Under Missing and Conflicting Evidence

![Refer to caption](https://arxiv.org/html/2606.17328v1/x4.png)Figure 4:Conflict Gist versus boundary abstention by system\.The two axes separate missing\-evidence refusal from false\-premise resolution; dashed guides mark the high\-boundary and high\-conflict regions\.Table 4:Conflict and boundary profile\.For both conflict and boundary probes we report Gist accuracy, abstention rate \(Abst\.\), and hallucination rate \(Hallu\.\)\. On conflict probes, Gist measures whether the system resolves a false premise using memory; on boundary probes, abstention measures safe refusal when the requested fact is missing\. All values are percentages\.Abstaining is not resolving\.Reliable memory systems must handle missing evidence and false premises, not only recall supported facts\. Boundary probes ask about facts absent from the history and should be refused\. Conflict probes contain a false premise and require correction from memory\.[Table˜4](https://arxiv.org/html/2606.17328#S4.T4)and[Figure˜4](https://arxiv.org/html/2606.17328#S4.F4)compare the two axes\. These two behaviors are empirically separate\. Mem0, AMem, and REMem are boundary\-safe: their boundary abstention rates are 99\.3%, 97\.4%, and 94\.0%, but they rarely resolve conflict probes\. Their conflict Gist scores are 14\.6%, 20\.1%, and 35\.1%\. Their conflict failures are mostly abstentions rather than fabrications: conflict hallucination is 2\.7%, 1\.1%, and 4\.9%\. This pattern suggests an evidence\-use failure at the memory\-to\-answer interface, not a lack of stored memory\. Under the main answer backbone, boundary probes can be handled by abstention when support is missing, while conflict probes require the generator to use memory against the user’s false premise\. Relevant memory evidence is present, but the system must first detect that the premise contradicts memory and then convert the memory into a correction\. Thus, these systems appear safe on missing evidence while often treating contradicted evidence as if it were absent\. Appendix[E\.2](https://arxiv.org/html/2606.17328#A5.SS2)reports answer\-backbone ablations showing that conflict resolution is sensitive to how memory evidence is passed to the generator\.

### 4\.4Failure Attribution: Retrieval Reach vs\. Evidence Use

Reaching evidence is not the same as using it\.We analyze failure origin in three steps\. First, we check whether hard probes can be answered when gold evidence is given directly\. Second, we check whether a simple retriever can reach the gold evidence\. Third, for cases where evidence is reachable but the system still fails, we give the gold evidence and test whether the answer can be recovered\.

![Refer to caption](https://arxiv.org/html/2606.17328v1/x5.png)Figure 5:Failure\-attribution decomposition\.The oracle component gives the gold evidence directly to test whether hard probes are answerable; it is not a retrieval method\. The retrieval replay splits failures into reach vs\. reached\-but\-unsolved\.We first analyze 120 hard probes: 40 all\-systems\-fail trajectory probes, 40 dynamic\-changed trajectory probes, and 40 hard current/historical control probes\. We supply gold memory evidence directly to the generator, an open\-book setting that tests answerability rather than retrieval\. Oracle Gist rises to 80–85% in every bucket, while production baselines stay between 0% and 33\.8% \([Figure˜5](https://arxiv.org/html/2606.17328#S4.F5)\)\. Thus many hard probes are answerable with the right evidence; we next ask whether that evidence was unreachable or reachable but unused\.

To check whether evidence is reachable, we run a 300\-probe replay with a simple Text\-emb\-3\-small retriever\. We record two signals per probe:R=1R\{=\}1if the retriever reaches the gold session, andU=1U\{=\}1if the original production answer solves the probe\. Of the 300 probes, 21 are reach misses \(7\.0%\), 220 are retriever\-reached but unsolved \(73\.3%\), and 59 are solved \(19\.7%\)\. Thus the dominant failure case isR=1,U=0R\{=\}1,U\{=\}0\([Table˜5](https://arxiv.org/html/2606.17328#S4.T5)\)\. Extending this reach/use replay to all 13 main configurations gives the same picture:P\(U=0∣R=1\)P\(U\{=\}0\\mid R\{=\}1\)ranges from 69\.2% for HippoRAG\-v2 to 88\.2% for MIRIX\. We then ask whether these reached but unsolved probes become answerable with explicit evidence\.

Table 5:Retriever\-reached but unsolved failures\.TheR=1,U=0R\{=\}1,U\{=\}0column gives the count of retriever\-reached but unsolved probes over the 279 retriever\-reached probes;P\(U=0∣R=1\)P\(U\{=\}0\\mid R\{=\}1\)and Oracle Gist are percentages\. Oracle Gist re\-supplies the gold memory evidence on each system’sR=1,U=0R\{=\}1,U\{=\}0subset \(per\-row denominator\) and re\-scores with gpt\-4o\-mini backbone\.Finally, we rerun the open\-book oracle on each main system’sR=1,U=0R\{=\}1,U\{=\}0probes \(Oracle Gist column,[Table˜5](https://arxiv.org/html/2606.17328#S4.T5)\)\. All 13 systems recover to 80\.4%–83\.9%, a pooled lift of 81\.8 percentage points over the 0% baseline\. This confirms that most errors are not caused by unreachable evidence\. Reaching the gold session is only a weak precondition: it does not guarantee that the relevant span, temporal relation, or conflict signal is exposed to the generator\. The harder step is selecting and presenting reachable evidence in a form the generator can use, especially when the answer requires comparing states over time or correcting a false premise\.

### 4\.5Summary and Implications

Taken together, the results suggest that long\-term memory failure is not merely a storage or retrieval failure, but a temporally grounded evidence\-use failure\. Long\-context systems read the visible history directly, but their aged\-trajectory collapse is consistent with attention dilution and weak temporal organization: old and new mentions must compete with many irrelevant sessions before the model can form an update\. RAG systems recover current and historical evidence more effectively, but retrieval exposes local evidence pieces rather than a coherent update path across states\. External\-memory systems can appear safe on missing evidence because conservative abstention is sufficient there, but conflict probes require active correction: rejecting the false premise while committing to the remembered fact\. The reach/use audit connects these patterns: many wrong answers occur after relevant evidence is already reachable\. Personalized memory systems therefore need more than larger context windows, stronger retrievers, or safer abstention; they need mechanisms that expose reachable evidence with temporal and conflict structure for multi\-session queries\.

## 5Conclusion

MemTrace recasts long\-term memory evaluation around the knowledge point: rather than pooling question rows, it holds a single user fact fixed and varies its memory age, the question asked about it, and the evidence available\. This makes visible what an aggregate score cannot: when a fact decays over sessions, when a system recalls a state but not how it changed, and when it abstains safely yet fails to correct a false premise\. Across 13 memory\-system configurations and four paradigms, systems with similar pooled accuracy diverge along these axes, and most errors persist even when the relevant evidence is already reachable\. Long\-term memory is therefore limited less by storing or retrieving facts than by using reachable evidence during inference\.

## Limitations

MemTrace is derived from HaluMem\-Medium and covers 20 users from a single source distribution, so its results should not be read as spanning all domains or interaction styles\. Our main numbers compare end\-to\-end configurations rather than isolated memory mechanisms: long\-context rows are provider\-native models, while most other rows share a gpt\-4o\-mini generator but differ in their retrievers and embeddings \(AMem, for instance, keeps its original all\-MiniLM\-L6\-v2 embedding\)\. Because backbone choice alone can shift a system across much of the leaderboard, cross\-paradigm rankings should be read at the configuration level\. The reach/use attribution further relies on a single Text\-emb\-3\-small retriever and an open\-book oracle over hard probes, so the roughly 10×\\timesimbalance is a directional finding whose magnitude depends on the retriever\. Lastly, scoring relies on an LLM judge, with reliability checked by a second judge on a stratified sample\.

## References

- MemoryBench: a benchmark for memory and continual learning in LLM systems\.External Links:2510\.17281,[Document](https://dx.doi.org/10.48550/arXiv.2510.17281)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.Note:ICLR 2024External Links:2310\.11511Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.Note:ACL 2024External Links:2308\.14508Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Bai, S\. Tu, J\. Zhang, H\. Peng, X\. Cui, X\. Wang, X\. Lv, S\. Cao, J\. Xu, L\. Liu, Z\. Wang, C\. Lv, Y\. Zhang, X\. Liu, X\. Liu, Y\. Wang, G\. Zhang, K\. Wong, P\. Han, C\. Wang, W\. Chen, J\. Nie, J\. Tang, J\. Li, L\. Hou, A\. Yuille, and S\. Lin \(2025\)LongBench v2: towards deeper understanding and reasoning on realistic long\-context multitasks\.Note:ACL 2025External Links:2412\.15204Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Bian, Z\. Yao, S\. Hu, Z\. Xu, S\. Zhang, Y\. Guo, Z\. Yang, X\. Han, H\. Wang, and R\. Chen \(2026\)RealMem: benchmarking LLMs in real\-world memory\-driven interaction\.External Links:2601\.06966Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Chao, Y\. Bai, R\. Sheng, T\. Li, and Y\. Sun \(2026\)STALE: can LLM agents know when their memories are no longer valid?\.External Links:2605\.06527Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Chen, S\. Niu, K\. Li, P\. Liu, X\. Zheng, B\. Tang, X\. Li, F\. Xiong, and Z\. Li \(2026a\)HaluMem: evaluating hallucinations in memory systems of agents\.External Links:2511\.03506Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.17328#S3.SS1.p1.1),[§3](https://arxiv.org/html/2606.17328#S3.p1.1)\.
- T\. Chen, J\. Lu, Y\. Shen, and L\. Zhang \(2026b\)ES\-MemEval: benchmarking conversational agents on personalized long\-term emotional support\.External Links:2602\.01885Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.External Links:2504\.19413,[Document](https://dx.doi.org/10.48550/arXiv.2504.19413)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- X\. Deng, Y\. Xue, Y\. Chen, M\. Mao, R\. Zhong, B\. Xu, J\. Fang, H\. Xu, T\. Wu, Y\. Xu, S\. Deng, H\. Wang, H\. Chen, and N\. Zhang \(2026\)MobileMem: evaluating long\-horizon memory for language agents in real\-world mobile environments\.Note:OpenReviewICLR 2026 Lifelong Agent WorkshopExternal Links:[Link](https://openreview.net/forum?id=w5I11HrMgJ)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- D\. Deshpande, V\. Gangal, H\. Mehta, A\. Kannappan, R\. Qian, and P\. Wang \(2025\)MEMTRACK: evaluating long\-term memory and state tracking in multi\-platform dynamic agent environments\.Note:NeurIPS 2025 SEA WorkshopExternal Links:2510\.01353,[Document](https://dx.doi.org/10.48550/arXiv.2510.01353)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- Gemini Team, Google DeepMind \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.External Links:2404\.06654Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Hu, Y\. Wang, and J\. McAuley \(2025\)Evaluating memory in LLM agents via incremental multi\-turn interactions\.External Links:2507\.05257,[Document](https://dx.doi.org/10.48550/arXiv.2507.05257)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17328#S3.p1.1)\.
- Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi, S\. Jin, J\. Tan, Y\. Yin, J\. Liu, Z\. Zhang, Z\. Sun, Y\. Zhu, H\. Sun, B\. Peng, Z\. Cheng, X\. Fan, J\. Guo, X\. Yu, Z\. Zhou, Z\. Hu, J\. Huo, J\. Wang, Y\. Niu, Y\. Wang, Z\. Yin, X\. Hu, Y\. Liao, Q\. Li, K\. Wang, W\. Zhou, Y\. Liu, D\. Cheng, Q\. Zhang, T\. Gui, S\. Pan, Y\. Zhang, P\. Torr, Z\. Dou, J\. Wen, X\. Huang, Y\. Jiang, and S\. Yan \(2026\)Memory in the age of ai agents\.External Links:2512\.13564,[Link](https://arxiv.org/abs/2512.13564)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- W\. Huang, W\. Zhang, Y\. Liang, Y\. Bei, Y\. Chen, T\. Feng, X\. Pan, Z\. Tan, Y\. Wang, T\. Wei, S\. Wu, R\. Xu, L\. Yang, R\. Yang, W\. Yang, C\. Yeh, H\. Zhang, H\. Zhang, S\. Zhu, H\. P\. Zou, W\. Zhao, S\. Wang, W\. Xu, Z\. Ke, Z\. Hui, D\. Li, Y\. Wu, L\. He, C\. Wang, X\. Xu, B\. Huang, J\. Tan, S\. Heinecke, H\. Wang, C\. Xiong, A\. A\. Metwally, J\. Yan, C\. Lee, H\. Zeng, Y\. Xia, X\. Wei, A\. Payani, Y\. Wang, H\. Ma, W\. Wang, C\. Wang, Y\. Zhang, X\. Wang, Y\. Zhang, J\. You, H\. Tong, X\. Luo, X\. Liu, Y\. Sun, W\. Wang, J\. McAuley, J\. Zou, J\. Han, P\. S\. Yu, and K\. Shu \(2026\)Rethinking memory mechanisms of foundation agents in the second half: a survey\.External Links:2602\.06052,[Link](https://arxiv.org/abs/2602.06052)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- B\. Jiang, Z\. Hao, Y\. Cho, B\. Li, Y\. Yuan, S\. Chen, L\. Ungar, C\. J\. Taylor, and D\. Roth \(2025\)Know me, respond to me: benchmarking LLMs for dynamic user profiling and personalized responses at scale\.External Links:2504\.14225Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17328#S3.p1.1)\.
- B\. Jimenez Gutierrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.External Links:2405\.14831,[Document](https://dx.doi.org/10.48550/arXiv.2405.14831)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Kuttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.Note:NeurIPS 2020External Links:2005\.11401Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1)\.
- S\. S\. Li, B\. Paranjape, K\. Oktar, Z\. Ma, G\. Zhou, L\. Guan, N\. Zhang, S\. Park, L\. Chen, D\. Yang, Y\. Tsvetkov, and A\. Celikyilmaz \(2026\)HorizonBench: long\-horizon personalization with evolving preferences\.External Links:2604\.17283Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao \(2026\)SimpleMem: efficient lifelong memory for LLM agents\.External Links:2601\.02553Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.747)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17328#S3.p1.1)\.
- OpenAI \(2024\)New embedding models and API updates \(text\-embedding\-3\-small\)\.Note:[https://openai\.com/index/new\-embedding\-models\-and\-api\-updates/](https://openai.com/index/new-embedding-models-and-api-updates/)Model card and API documentationCited by:[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- OpenAI \(2025\)Introducing GPT\-5 \(gpt\-5\-nano variant\)\.Note:[https://openai\.com/index/introducing\-gpt\-5/](https://openai.com/index/introducing-gpt-5/)Model release announcement and system cardCited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- C\. Packer, V\. Fang, S\. G\. Patil, K\. Lin, S\. Wooders, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.External Links:2310\.08560Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.Note:UIST 2023External Links:2304\.03442Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1)\.
- Qwen Team \(2025a\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.External Links:2506\.05176Cited by:[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- Qwen Team \(2025b\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval3\(4\),pp\. 333–389\.External Links:[Document](https://dx.doi.org/10.1561/1500000019)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- Y\. Shen, K\. Li, W\. Zhou, and S\. Hu \(2026\)Mem2ActBench: a benchmark for evaluating long\-term memory utilization in task\-oriented autonomous agents\.External Links:2601\.19935Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- Y\. Shu, S\. P\. Jonnalagedda, X\. Gao, B\. Jiménez Gutiérrez, W\. Qi, K\. Das, H\. Sun, and Y\. Su \(2026\)REMem: reasoning with episodic memory in language agent\.Note:ICLR 2026External Links:2602\.13530Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- H\. Tan, Z\. Zhang, C\. Ma, X\. Chen, Q\. Dai, and Z\. Dong \(2025\)MemBench: towards more comprehensive evaluation on the memory of LLM\-based agents\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 19336–19352\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.989),[Link](https://aclanthology.org/2025.findings-acl.989/)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Tavakoli, A\. Salemi, C\. Ye, M\. Abdalla, H\. Zamani, and J\. R\. Mitchell \(2026\)Beyond a million tokens: benchmarking and enhancing long\-term memory in llms\.External Links:2510\.27246Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p2.1),[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- M\. N\. Uddin, K\. Shubham, E\. Blanco, C\. Baral, and G\. Wang \(2026\)From recall to forgetting: benchmarking long\-term memory for personalized agents\.External Links:2604\.20006Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang and X\. Chen \(2025\)MIRIX: multi\-agent memory system for LLM\-based agents\.External Links:2507\.07957,[Document](https://dx.doi.org/10.48550/arXiv.2507.07957)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17328#S3.p1.1)\.
- J\. Xu, A\. Szlam, and J\. Weston \(2022\)Beyond goldfish memory: long\-term open\-domain conversation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5180–5197\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.356),[Link](https://aclanthology.org/2022.acl-long.356/)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for LLM agents\.Note:NeurIPS 2025External Links:2502\.12110,[Document](https://dx.doi.org/10.48550/arXiv.2502.12110)Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- H\. Yen, T\. Gao, M\. Hou, K\. Ding, D\. Fleischer, P\. Izsak, M\. Wasserblat, and D\. Chen \(2025\)HELMET: how to evaluate long\-context language models effectively and thoroughly\.Note:ICLR 2025External Links:2410\.02694,[Document](https://dx.doi.org/10.48550/arXiv.2410.02694)Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p2.1)\.
- Y\. Yue, G\. Zhang, B\. Peng, X\. Fan, J\. Guo, Q\. Li, and Y\. Zhang \(2026\)Mem\-T: densifying rewards for long\-horizon memory agents\.External Links:2601\.23014Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1),[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.17328#S4.SS1.p1.1)\.
- W\. Zhang, X\. Wei, W\. Huang, Z\. Hui, C\. Wang, M\. Gong, and P\. S\. Yu \(2026\)MemoryCD: benchmarking long\-context user memory of LLM agents for lifelong cross\-domain personalization\.External Links:2603\.25973Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.
- X\. Zhang, Y\. Chen, S\. Hu, Z\. Xu, J\. Chen, M\. Hao, X\. Han, Z\. Thai, S\. Wang, Z\. Liu, and M\. Sun \(2024\)InfiniteBench: extending long context evaluation beyond 100k tokens\.External Links:2402\.13718Cited by:[§2](https://arxiv.org/html/2606.17328#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2023\)MemoryBank: enhancing large language models with long\-term memory\.External Links:2305\.10250Cited by:[§1](https://arxiv.org/html/2606.17328#S1.p1.1)\.

## 6Use of AI Writing Assistance

Consistent with the policy on AI writing assistance, we used AI assistants only for code development and for polishing the wording and grammar of the manuscript\. All scientific content—the claims, experimental design, results, and analyses—was conceived, written, reviewed, and verified by the human authors, who take full responsibility for it\.

## Appendix ABenchmark Details

The final benchmark index listed in the artifact manifest contains 20 users, 835 knowledge points, 5,677 base probes, and 15,422 question rows\. Paper\-facing loaders use the scoring protocol listed in the same manifest\.

### A\.1Scale and Workload

This subsection reports the benchmark scale and scoring workload referenced from[Section˜3\.1](https://arxiv.org/html/2606.17328#S3.SS1)\.

Table 6:Benchmark scale and scoring workload\. Base probes are knowledge\-point/window probes; question rows are produced after expanding probes by question type and evidence condition\. Scored system\-answer rows sum the completed 13 main configurations under the paper\-facing scoring pass\.
### A\.2Quality\-Control Pipeline

MemTrace treats quality control as a construction check rather than a result metric\. The pipeline first preserves grounded source sessions and source memory evidence, then transforms them into typed knowledge points and repeated window\-level probes\. Automated checks flag candidate grounding, leakage, temporal\-anchor, type, and semantic\-consistency defects; flagged cases are manually reviewed and corrected before the benchmark index is frozen\. The check summary is reported below, and the independent second\-judge scoring check is reported in Appendix D\.

Table 7:Quality\-control check applied during MemTrace construction\. Pre\-fix shows defects inherited from the HaluMem\-Medium\-derived draft; Post\-fix is after manual review and correction\.

## Appendix BSystem Configurations

[Table˜8](https://arxiv.org/html/2606.17328#A2.T8)lists the 13 main memory\-system configurations used in the controlled comparison\. The comparison unit is a complete configuration: memory construction, retrieval or access path, update policy, answer prompt, and answer generator\. Long\-context rows are provider\-native; the remaining main rows use the shared gpt\-4o\-mini answer generator described in[Section˜4\.1](https://arxiv.org/html/2606.17328#S4.SS1)\.

SystemParadigmBackboneMemory unitRetrievalUpdate mech\.Conflict pathTemporal pathQwen3\.5\-35BLong\-Contextnativefull context––partialpartialGemini\-3\-FlashLong\-Contextnativefull context––partialpartialGPT\-5\-nanoLong\-Contextnativefull context––partialpartialBM25RAG4o\-minichunksparse \(BM25\)appendnonoText\-emb\-3\-smallRAG4o\-minichunkdenseappendnonoQwen3\-EmbRAG4o\-minichunkdenseappendnonoHippoRAG\-v2RAG4o\-minievent graph \(KG\)hybrid \(graph\+dense\)appendpartialnoMem0External\-Memory4o\-minifact tripledenseconsolidatepartialnoSimpleMemExternal\-Memory4o\-minicompressed notedenseconsolidatepartialpartialREMemExternal\-Memory4o\-minievent graph \(gist\+fact\)hybrid \(agentic\)consolidateyesyesAMemExternal\-Memory4o\-mininotedense \(Chroma\)appendpartialnoMIRIXAgentic\-Memory4o\-mini6\-store scratchpadhybridconsolidatepartialpartialMem\-TAgentic\-Memory4o\-mininote \(Chroma\)dense \(RL policy\)consolidatepartialpartial

Table 8:System\-design matrix for the 13 evaluated main configurations\. Backbone disclosures are used when interpreting cross\-system comparisons\. Conflict path and temporal path are descriptive implementation labels: “yes” denotes an explicit mechanism or prompt path for that axis, “partial” denotes an indirect or general mechanism, and “no” denotes no dedicated axis\-specific mechanism in our implementation\. They are not standalone capability claims\.Mem\-T is integrated from the released Mem\-T\-4B checkpoint with the published MOT\-GRPO retrieval policy and hindsight\-credit\-assignment memory construction\. Its hierarchical memory contains working, factual, experiential, and raw stores, all backed by ChromaDB\. Mem\-T’s published in\-domain training corpus is LoCoMo; MemTrace is derived from HaluMem\-Medium with LoCoMo session lineage, so the released policy may have soft distribution overlap with our evaluation data\. We use the frozen checkpoint and do not fine\-tune on MemTrace probes\.

## Appendix CFull Results and Failure\-Origin Checks

This appendix provides full numeric companions to the compact main\-text tables and figures\. Values use the main MemTrace adapter configurations unless stated otherwise\. It also includes supporting views for endpoints, hallucination, conflict\-by\-window behavior, abstention precision, and qualitative examples\.

### C\.1Full Memory\-Age and Evidence\-Condition Results

These tables expand the compact main\-text results without changing the aggregation convention\.[Table˜9](https://arxiv.org/html/2606.17328#A3.T9)reports the full Fresh/Saturated/Δ\\DeltaForget profile,[Table˜10](https://arxiv.org/html/2606.17328#A3.T10)adds confidence intervals for conflict and boundary probes, and[Table˜11](https://arxiv.org/html/2606.17328#A3.T11)shows the rank stability behind the question\-type comparison\.

Table 9:Full main\-adapter Fresh, Saturated, andΔ\\DeltaForget table by question type\. Fresh/Saturated Gist is shown as percentages andΔ\\DeltaForget as percentage points\. Systems are ordered by mean Saturated Gist in the main leaderboard\.Table 10:Main\-adapter 13\-system conflict and boundary table with bootstrap 95% confidence intervals\. Cells report percentage \[lower, upper\]\. This is the confidence\-interval companion to[Table˜4](https://arxiv.org/html/2606.17328#S4.T4)\.Table 11:Per\-system bootstrap 95% confidence intervals on main\-adapter Saturated Gist by question type, shown as percentages, and rank\-correlation distribution between current/historical and trajectory ranks\.
### C\.2Additional Supporting Views

The following figures and tables preserve supporting results that are too large for the main text\.[Figure˜6](https://arxiv.org/html/2606.17328#A3.F6)shows the Fresh\-to\-Saturated endpoints by question type,[Figures˜7](https://arxiv.org/html/2606.17328#A3.F7)and[7](https://arxiv.org/html/2606.17328#A3.F7)shows additional memory\-window diagnostics,LABEL:tab:appendix\-abstention\-precisionreports abstention precision, andLABEL:tab:appendix\-showcasesgives representative qualitative probes\.

![Refer to caption](https://arxiv.org/html/2606.17328v1/figures/archive_unused/fig17_replacement_fresh_sat_dumbbell_by_qtype.png)Figure 6:Per\-system Fresh\-to\-Saturated endpoints by question type for the main MemTrace adapter configurations\. Filled markers denote Fresh W1–W2 and hollow markers denote Saturated W7–W8\. This is the endpoint\-level companion to the main memory\-age trace in[Figure˜3](https://arxiv.org/html/2606.17328#S4.F3)\.![Refer to caption](https://arxiv.org/html/2606.17328v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.17328v1/x7.png)

Figure 7:Additional memory\-window diagnostics\. Top: main\-adapter per\-system hallucination rate across W1–W8\. Bottom: main\-adapter Conflict Gist accuracy by system and memory window, expanding the conflict/boundary profile and its confidence\-interval companion table\.
### C\.3Failure\-Origin Details

The failure\-origin analysis has two parts\. The oracle block asks whether the answer generator can recover when the relevant memory evidence is supplied directly\. The replay block then checks, with a transparent Text\-emb\-3\-small proxy, whether the same evidence was reachable in the original production setting\.

Table 12:Numerical companion to the main oracle/recovery decomposition figure\.*Base*is the mean production\-system solve rate;*Oracle*is Gist accuracy when relevant memory evidence is supplied directly\. The replay rows use the transparent Text\-emb\-3\-small reach proxy:RRmarks whether the proxy reaches the benchmark source session for the gold answer, andUUmarks whether the original production answer solves the probe\. Lift and range values are percentage points\.

## Appendix DJudge Reliability

MemTrace uses GPT\-4o as the primary scoring judge\. As a reliability check, we rescore a stratified 200\-probe sample with a second LLM judge using the same rubric and JSON schema, then compute Cohen’sκ\\kappawith bootstrap confidence intervals\.

Table 13:Second\-judge reliability on a stratified 200\-probe sample\. Agreement is computed between the primary GPT\-4o judge and a Gemini\-3\-Flash cross\-judge using the same scoring rubric\.Table 14:Second\-judge disagreement breakdown on the same 200\-probe sample\. “Any disagreement” counts probes where the judges disagree on either binary Gist accuracy or response type\.
## Appendix ESensitivity Checks

The main paper reports controlled MemTrace adapter configurations\. The checks below vary prompts, paper\-native settings, answer backbones, or diagnostic setup to quantify sensitivity around those main rows\. They are not counted as additional benchmark systems\.

### E\.1Configuration and Prompt Sensitivity

This section keeps the main benchmark rows fixed and reports variants that stress prompts, paper\-native settings, and matched results\. The goal is to make clear which conclusions are stable and which numbers should be read only as configuration checks rather than main\-system replacements\.

### E\.2Backbone and Oracle Sensitivity

We also check whether the conflict and oracle conclusions depend on the answer backbone\. These checks swap only the answer\-time generator while keeping the rest of the diagnostic setup fixed\.

### E\.3Evidence\-Interface and Actionability Checks

The following checks are diagnostic interventions, not additional benchmark systems\. They ask whether changing the evidence interface or using oracle benchmark labels can change the observed failure pattern\.

Table 15:Configuration sensitivity checks\. Values are overall Gist under the core\-KP user\-macro mean\-over\-question\-types convention from the ablation inventory, shown as percentages\. These variants are not mixed into the main leaderboard\.Table 16:Distractor prompt\-control checks\. Arrows report main adapter→\\rightarrowunified where available\. The table shows configuration sensitivity for the conflict/boundary results, with rates shown as percentages\.![Refer to caption](https://arxiv.org/html/2606.17328v1/figures/prompt_confound_horizon_shape.png)Figure 8:Prompt\-control ablation by horizon and question type\. This sensitivity check asks whether horizon and question\-form changes alone explain the retention patterns; it does not replace the main adapter Fresh\-to\-Saturated trace\.Table 17:Diagnostic actionability check using benchmark labels as oracle routing signals on top of the transparent dense\-RAG baseline\. Variants are not deployable memory methods and are not counted as benchmarked systems; they only test whether the diagnostic axes can guide targeted interventions\.![Refer to caption](https://arxiv.org/html/2606.17328v1/x8.png)Figure 9:Backbone\-sensitivity panel on conflict probes\. The main\-adapter gpt\-4o\-mini condition is the audit baseline, while the answer\-time generator is swapped to Gemini\-3\-Flash\.
Table 18:Backbone\-sensitivity numeric checks\. Left: Conflict Gist shifts when the answer generator is swapped from gpt\-4o\-mini to Gemini\-3\-Flash; AMem and Mem\-T are exploratory 3\-user subsets, while the other rows are full 20\-user runs\. Right: oracle\-evidence recoverability on the same 120\-probe balanced sample, showing large lifts under both oracle backbones\. Deltas and lifts are percentage points\.Table 19:Retrieval\-interface audit on 40 user\-0 conflict probes under the audited system pipelines\. The audit illustrates differences in retrieved evidence form and should be read as an interface check, not as a substitute for the main rows\.Table 20:Mem0 score\-visibility intervention on paired users \{0,1,2\}\. The vanilla gpt\-4o\-mini value is a paired\-subset baseline, not the full 20\-user Mem0 main\-row value\. Surfacing retrieval similarity scores reduces Mem0’s backbone\-coupling on conflict Gist, with side costs in the same audit: Gemini\-3\-Flash conflict hallucination rises from 7\.1% to 12\.1% and boundary abstention falls from 88\.7% to 68\.8%\.
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

Similar Articles

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

@hyunji_amy_lee: LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process …

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Memory retrieval is broken under the hood.

Submit Feedback

Similar Articles

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery
@hyunji_amy_lee: LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process …
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
Memory retrieval is broken under the hood.