WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

arXiv cs.CL Papers

Summary

Introduces a four-condition diagnostic protocol to identify whether failures in long-context memory systems stem from write-side compression discarding evidence or retrieval-side missing stored information. The analysis reveals write-side gaps dominate for most baselines, motivating the proposed Expected Predictive Compression (EPC) method that improves preservation of relevant evidence.

arXiv:2605.24579v1 Announce Type: new Abstract: Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic protocol that evaluates a fixed reader under truncated full context (TFC), oracle evidence (OE), complete stored memory (CSM), and retrieved memory (RM). Under this fixed-budget LongMemEval setup, write-side gaps exceed retrieval-side gaps for most tested baselines, with four of six baselines robustly write-dominant under our default diagnosis margin. Motivated by this diagnosis, we propose Expected Predictive Compression (EPC), which moves the key decision--what information to retain--to write time by using an LLM to anticipate likely future questions and preserve the minimal supporting evidence under the token budget, while leaving retrieval unchanged at question time. Across all 500 LongMemEval questions with three readers (GPT-5.2, Claude Sonnet 4, Gemini 2.5 Pro), EPC achieves the highest CSM scores among all systems (0.49 vs. 0.44 for Summary (LLM), the strongest baseline), reducing Delta_write to 0.04 while leaving Delta_retr comparable to other LLM-based systems. These results suggest that, on this benchmark and evaluation setup, improving what the write stage preserves is a key avenue for performance gains in the tested systems.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:04 AM

# Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems
Source: [https://arxiv.org/html/2605.24579](https://arxiv.org/html/2605.24579)
###### Abstract

Long\-context memory systems often fail under fixed budgets, but end\-to\-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved\. We introduce afour\-condition diagnostic protocolthat evaluates a fixed reader under truncated full context \(TFC\), oracle evidence \(OE\), complete stored memory \(CSM\), and retrieved memory \(RM\)\. Under this fixed\-budget LongMemEval setup, write\-side gaps exceed retrieval\-side gaps for most tested baselines, with four of six baselines robustly*write\-dominant*under our default diagnosis margin\. Motivated by this diagnosis, we proposeExpected Predictive Compression \(EPC\), which moves the key decision—what information to retain—to write time by using an LLM to anticipate likely future questions and preserve the minimal supporting evidence under the token budget, while leaving retrieval unchanged at question time\. Across all 500 LongMemEval questions with three readers \(GPT\-5\.2, Claude Sonnet 4, Gemini 2\.5 Pro\), EPC achieves the highestCSMscores among all systems \(0\.49 vs\. 0\.44 for Summary \(LLM\), the strongest baseline\), reducingΔwrite\\Delta\_\{\\text\{write\}\}to 0\.04 while leavingΔretr\\Delta\_\{\\text\{retr\}\}comparable to other LLM\-based systems\. These results suggest that, on this benchmark and evaluation setup, improving what the write stage preserves is a key avenue for performance gains in the tested systems\.

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long\-Context Memory Systems

Jiangnan Yu, Kisson Songqi Lin, and Jilong Wu

## 1Introduction

A user chats with an AI assistant for three months—sharing restaurant preferences, travel plans, project deadlines\. One day they ask: “What Thai place did I mention liking in April?” The system fails\. But*why*it fails matters: did the memory system discard that detail during compression, or is the detail stored in memory but missed by retrieval? The first case calls for better compression; the second, better retrieval\. End\-to\-end accuracy—the only metric most benchmarks report—cannot distinguish the two\.

This is not a hypothetical problem\. Memory\-augmented LLMs now use a wide range of write strategies—chunked storage\(Lewiset al\.,[2020](https://arxiv.org/html/2605.24579#bib.bib5)\), session summarization\(Wanget al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib8)\), summary trees\(Chenet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib2)\), gist pages\(Leeet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib3)\), importance\-scored banks\(Zhonget al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib4)\)—but evaluations usually report only end\-to\-end performance\. When a system scores poorly, practitioners cannot tell whether to invest in compression or retrieval\.

We expose these stage\-level effects by evaluating the same reader under four controlled inputs that replace or bypass different parts of the memory pipeline\. The resultingfour\-condition diagnostic protocol\(§[3](https://arxiv.org/html/2605.24579#S3)\) compares oracle evidence \(OE\) against complete stored memory \(CSM\) to estimate the write\-side gap, andCSMagainst retrieved memory \(RM\) for the retrieval\-side gap\. The protocol is operational—it localizes where performance degrades, without claiming to identify root causes\.

Applying this protocol to six baseline systems on LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2605.24579#bib.bib1)\), we find a consistent asymmetry: write\-side gaps exceed retrieval\-side gaps for most baselines, with four baselines robustly write\-dominant under our default diagnosis margin\. For these baseline operating points, more performance is lost during the write stage than during retrieval\.

This finding motivates using anticipated future questions to guide compression at write time\. The challenge is that compression happens*before*the question arrives\. A standard summarizer does not know which facts will matter later\.Expected Predictive Compression \(EPC\)\(§[4](https://arxiv.org/html/2605.24579#S4)\) addresses this by prompting the LLM to generate likely future questions about the conversation\. Those self\-generated probe questions then guide evidence selection under a fixed token budget—preserving the specific facts, dates, and preferences most likely to be needed, rather than producing a generically readable summary\.

Across 500 LongMemEval questions with three readers \(GPT\-5\.2, Claude Sonnet 4, Gemini 2\.5 Pro\), EPC achieves the highestCSMscore among all tested systems, with the lowest write\-side gap \(Δwrite=0\.04\\Delta\_\{\\text\{write\}\}=0\.04\)\. On a second benchmark \(LoCoMo\), EPC again has the lowest write\-side gap \(Δwrite=0\.06\\Delta\_\{\\text\{write\}\}=0\.06\)\. A cost\-matched comparison separates the gain from an additional LLM call, and a budget sweep shows the advantage is largest under tight budgets—precisely where choosing the right evidence matters most\.

## 2Related Work

#### Long\-Context Memory Systems\.

Prior work explores a wide range of memory systems for long interactions\. MemGPT\(Packeret al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib10)\)introduces a virtual memory hierarchy; MemWalker\(Chenet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib2)\)organizes long documents as a summary tree; ReadAgent\(Leeet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib3)\)stores gists and expands them at read time; MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib4)\)scores memory by importance and temporal decay; andWanget al\.\([2024](https://arxiv.org/html/2605.24579#bib.bib8)\)train a long\-term memory module jointly with the model\.Zhanget al\.\([2024](https://arxiv.org/html/2605.24579#bib.bib9)\)survey these memory mechanisms more broadly\. Our focus is complementary: rather than proposing another end\-to\-end system, we ask how to*localize where performance degrades*in a memory pipeline\.

#### Memory Evaluation and Error Analysis\.

LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2605.24579#bib.bib1)\)provides multi\-session conversations with gold turn\-level evidence annotations, making it possible to evaluate memory systems against explicit supporting evidence\. In retrieval\-augmented generation,Xuet al\.\([2024b](https://arxiv.org/html/2605.24579#bib.bib7)\)compare retrieval\-based and long\-context approaches, but they evaluate end\-to\-end performance without separating write\-side from retrieval\-side degradation\. Our protocol fills that gap\.

#### Compression for Downstream Tasks\.

Prompt and context compression methods\(Jianget al\.,[2023](https://arxiv.org/html/2605.24579#bib.bib15); Panet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib18)\)reduce token counts while preserving general utility\. Query\-aware variants—RECOMP\(Xuet al\.,[2024a](https://arxiv.org/html/2605.24579#bib.bib19)\)and the broader query\-focused summarization tradition\(Daumé III and Marcu,[2006](https://arxiv.org/html/2605.24579#bib.bib20); Baumelet al\.,[2018](https://arxiv.org/html/2605.24579#bib.bib21)\)—go further by conditioning compression on a known downstream query\. EPC instead operates before the downstream query is available\. It approximates the future\-question distribution via LLM self\-questioning and minimizes expected answer loss over likely future questions rather than optimizing generic readability or a response to a known query\.

## 3Diagnostic Protocol

This section defines the four controlled input conditions, the resulting write\-side and retrieval\-side indicators, and the rule used to label the dominant bottleneck\.

### 3\.1Problem Formulation

Consider a memory\-augmented QA system operating over a conversation historyH=\{s1,s2,…,sn\}H=\\\{s\_\{1\},s\_\{2\},\\ldots,s\_\{n\}\\\}consisting ofnnsessions\. Given a questionqqwith gold answeryyand gold evidence turnsEqE\_\{q\}drawn fromHH, the system must: \(1\)writethe history into a budget\-limited memory storeMMwith capacityBBtokens, and \(2\)readfromMMby retrieving contentR⊆MR\\subseteq Mto provide context for answeringqq\. Here, the write stage denotes the pre\-question processing that stores, indexes, evicts, or compresses history intoMM; retrieval is the question\-time selection of content from that store\.

### 3\.2Four Conditions

We define four evaluation conditions, each providing a different input to a fixed reader, to isolate reference performance and memory\-pipeline effects \(Figure[1](https://arxiv.org/html/2605.24579#S3.F1)\):

TFC\(Truncated Full Context\):The reader receivesHHtruncated to a fixed 32K\-token context budget\. No memory system is involved\.

OE\(Oracle Evidence\):The reader receives only the gold evidence turnsEqE\_\{q\}—distractors removed, no truncation\.

CSM\(Complete Stored Memory\):The reader receives all content in memoryMMafter the write stage\.

RM\(Retrieved Memory\):The reader receives the retrieved subsetR⊆MR\\subseteq M\.

Two gaps localize where performance drops within the memory pipeline:OE→\\toCSMdefines the write\-side gap, andCSM→\\toRMdefines the retrieval\-side gap\. Two caveats apply\. First, theOE–TFCgap mixes distractor removal with the imposed context budget, soOE\>\>TFCdoes not isolate either effect\. Second, theOE→\\toCSMgap should be read as an upper bound on write\-side degradation: it can also reflect format mismatch between stored memory and reader expectations, or loss of contextual cues\. These gaps are therefore*operational*indicators that localize where performance degrades, not causal attributions to a single mechanism\. While each condition is individually simple, prior evaluations report only end\-to\-end scores \(RMor equivalent\), making it impossible to determine whether failures originate in compression or retrieval\. The protocol’s contribution is combining these conditions into a reusable diagnostic that any memory system can run without architectural changes\.

### 3\.3Bottleneck Indicators

For a scoring metricϕ\\phi, we use the two gaps as*operational bottleneck indicators*:

Δwrite\\displaystyle\\Delta\_\{\\text\{write\}\}=ϕ​\(OE\)−ϕ​\(CSM\)\\displaystyle=\\phi\(\\textsc\{OE\}\)\-\\phi\(\\textsc\{CSM\}\)\(1\)Δretr\\displaystyle\\Delta\_\{\\text\{retr\}\}=ϕ​\(CSM\)−ϕ​\(RM\)\\displaystyle=\\phi\(\\textsc\{CSM\}\)\-\\phi\(\\textsc\{RM\}\)\(2\)These are additive:ϕ​\(OE\)−ϕ​\(RM\)=Δwrite\+Δretr\\phi\(\\textsc\{OE\}\)\-\\phi\(\\textsc\{RM\}\)=\\Delta\_\{\\text\{write\}\}\+\\Delta\_\{\\text\{retr\}\}\. For compact notation, we writeΔw=Δwrite\\Delta\_\{w\}=\\Delta\_\{\\text\{write\}\}andΔr=Δretr\\Delta\_\{r\}=\\Delta\_\{\\text\{retr\}\}below\.

### 3\.4Diagnosis Rule

With marginϵ=0\.02\\epsilon=0\.02:

diagnosis=\{Writeif​Δw\>Δr\+ϵRetrievalif​Δr\>Δw\+ϵMixedotherwise\\text\{diagnosis\}=\\begin\{cases\}\\textsc\{Write\}&\\text\{if \}\\Delta\_\{w\}\>\\Delta\_\{r\}\+\\epsilon\\\\ \\textsc\{Retrieval\}&\\text\{if \}\\Delta\_\{r\}\>\\Delta\_\{w\}\+\\epsilon\\\\ \\textsc\{Mixed\}&\\text\{otherwise\}\\end\{cases\}\(3\)The margin only affects near\-balanced systems; four baselines remain unambiguouslyWriteunderϵ∈\[0,\.05\]\\epsilon\\in\[0,\.05\], while Summary \(LLM\) and ReadAgent lie near the boundary\.

HHMMRRwriteretrieveEqE\_\{q\}goldTFCOECSMRMReaderone at a timeϕ\\phiΔwrite\\Delta\_\{\\text\{write\}\}Δretr\\Delta\_\{\\text\{retr\}\}

Figure 1:The four\-condition diagnostic protocol\. Each condition is evaluated by the same reader;Δwrite=ϕ​\(OE\)−ϕ​\(CSM\)\\Delta\_\{\\text\{write\}\}=\\phi\(\\textsc\{OE\}\)\-\\phi\(\\textsc\{CSM\}\)andΔretr=ϕ​\(CSM\)−ϕ​\(RM\)\\Delta\_\{\\text\{retr\}\}=\\phi\(\\textsc\{CSM\}\)\-\\phi\(\\textsc\{RM\}\)\.

## 4Expected Predictive Compression

Consider LLM summarization, the strongest query\-agnostic compression baseline among our tested systems \(Table[3](https://arxiv.org/html/2605.24579#S6.T3)\)\. It compresses 121K tokens of conversation into 5K tokens of fluent text—yet even when the reader has access to all stored memory, fewer than half the questions are answered correctly \(CSM = \.44\)\. Which information is lost? The summarizer preserves broadly salient content—topics discussed, decisions made—but omits specific dates, entity names, and preference details that downstream questions target\. This motivates a compression objective that prioritizes supporting evidence, not just readable summaries\.

We proposeExpected Predictive Compression \(EPC\), which shifts the compression objective accordingly\.

### 4\.1Formulation

Given a conversation segmentxx\(a session in our experiments\) and a token budgetBB, letQ​\(x\)Q\(x\)denote a set of possible future questions that may require information fromxx, withw​\(q\)w\(q\)as each question’s estimated likelihood\. LetA​\(c,q\)A\(c,q\)denote the reader’s answer to questionqqgiven contextcc, andℒ\\mathcal\{L\}a loss function measuring answer degradation\. EPC seeks the compressed memorym∗m^\{\*\}that minimizes expected answer loss:

m∗=arg​min\|m\|≤B​∑q∈Q​\(x\)w​\(q\)⋅ℒ​\(A​\(x,q\),A​\(m,q\)\)m^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{\|m\|\\leq B\}\\sum\_\{q\\in Q\(x\)\}w\(q\)\\cdot\\mathcal\{L\}\\big\(A\(x,q\),\\;A\(m,q\)\\big\)\(4\)
where the weighted sum approximates the expectation over the future\-question distribution\. This is loosely motivated by rate–distortion theory\(Cover and Thomas,[2006](https://arxiv.org/html/2605.24579#bib.bib14)\), where the budgetBBplays the role of a rate constraint and the distortion is future\-question answer loss rather than surface\-level reconstruction error\. The actual implementation uses a greedy heuristic \(Eq\.[5](https://arxiv.org/html/2605.24579#S4.E5)\) rather than formal rate–distortion optimization; the connection is conceptual rather than algorithmic\.

### 4\.2LLM Self\-Questioning EPC

SinceQ​\(x\)Q\(x\)is unknown at write time, we approximate it viaLLM self\-questioning: the LLM first generates probe questions that the segment is likely to be queried about, then uses them to guide evidence selection\. The implementation treats those generated probe questions as an unweighted approximation toQ​\(x\)Q\(x\)\. Figure[2](https://arxiv.org/html/2605.24579#S4.F2)illustrates the procedure:

Conversation segmentxx↬\\looparrowrightSelf\-Question1Probe questions\{q1​…​qk\}\\\{q\_\{1\}\\ldots q\_\{k\}\\\}⊳\\trianglerightEvidence ID2Evidence units\{ei\}\\\{e\_\{i\}\\\}⋈\\bowtieMerge & Select3BudgetBBMemorym∗m^\{\*\}to retrievalAnticipate future questions Generate likely downstream questions before compression begins\.Trace minimal evidence Identify exact spans, entities, and evidence units\.Select under budget Write structured \[Q\]\[E\]\[S\] entries withinBBtokens\.

Figure 2:The EPC write pipeline\. \\scriptsize1⃝ Generate probe questions\. \\scriptsize2⃝ Identify supporting evidence\. \\scriptsize3⃝ Merge and select under budgetBB\.#### Step 1: Generate probe questions\.

Given segmentxx, prompt the LLM to generatek=5k=5likely future questions targeting factual details, preferences, plans, temporal information, and state changes\.

#### Step 2: Identify supporting evidence\.

For each probe questionqiq\_\{i\}, the LLM identifies the minimal supporting evidence: specific turns, spans, and entities\.

#### Step 3: Merge, score, and select\.

Overlapping evidence spans are merged, and each evidence uniteereceives a utility score:

u​\(e\)=\\displaystyle u\(e\)=\{\}α⋅coverage​\(e\)\+β⋅specificity​\(e\)\\displaystyle\\alpha\\cdot\\text\{coverage\}\(e\)\+\\beta\\cdot\\text\{specificity\}\(e\)\(5\)−λ⋅redundancy​\(e\)\\displaystyle\-\\lambda\\cdot\\text\{redundancy\}\(e\)where coverage counts how many probe questionseesupports; specificity rewards named entities, dates, and numbers; and redundancy penalizes overlap with already\-selected evidence\. At each greedy selection step, units are ranked by the currentu​\(e\)u\(e\)until the budgetBBis exhausted, and each selected unit is written as a structured memory entry:

```
[Q] What food does the user prefer?
[E] User prefers Thai over Italian.
[S] session_12_turn_3
```

#### Design rationale\.

The design assumes that LLM\-generated probe questions can cover common future question targets such as preferences, facts, and temporal events, so even an approximateQ​\(x\)Q\(x\)can direct compression toward useful evidence\. Unlike conventional summarization, which optimizes for readability, EPC prioritizes supporting evidence—producing structured, information\-dense output that preserves exact entities, dates, and numbers\.

## 5Experimental Setup

This section fixes the dataset, memory systems, model configurations, retrieval setup, and metrics used for the main LongMemEval experiments\.

### 5\.1Dataset

We evaluate on all 500 questions from LongMemEval\(Wuet al\.,[2025](https://arxiv.org/html/2605.24579#bib.bib1)\)\. The benchmark is well suited to our setting because it pairs multi\-session conversations averaging 121K tokens \(∼\{\\sim\}50 sessions per conversation\) with gold turn\-level evidence annotations\. We keep the dataset’s original session boundaries throughout\.

### 5\.2Memory Systems

Table[1](https://arxiv.org/html/2605.24579#S5.T1)lists the seven memory systems we compare\. The baselines span raw storage, heuristic summarization, LLM\-based compression, hierarchical memory, and importance\-weighted memory to cover diverse write and retrieval designs\. For controlled comparison, EPC shares the same budgets and embedding top\-kkretriever as the Summary systems and the same writer model as Summary \(LLM\)\. BecauseCSMremoves retrieval, EPC’sCSMgains over Summary \(LLM\) reflect the write\-side compression strategy rather than retriever choice\.

#### Baseline provenance\.

Verbatim Chunk and the two Summary variants are implemented directly in our codebase\. MemWalker, ReadAgent, and MemoryBank were reimplemented following the procedures described in the original papers\(Chenet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib2); Leeet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib3); Zhonget al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib4)\)and adapted to the session\-based LongMemEval format: MemWalker constructs a summary tree over sessions using the same writer model \(GPT\-5\.2\); ReadAgent stores gist pages per session and expands them at read time; MemoryBank scores memories by importance with temporal decay\. None of these systems provide official LongMemEval evaluation code, so reimplementation differences may affect absolute scores; however, all systems use the same dataset, budget, and readers, so relative comparisons within our evaluation are controlled\. Full prompts, hyperparameters, baseline adaptation details, and significance\-testing procedures are provided in Appendix[A](https://arxiv.org/html/2605.24579#A1)\.

Table 1:Seven memory systems\. EPC shares the same budgets and embedding retriever with the Summary systems, and shares the writer model with Summary \(LLM\);CSMcomparisons remove retrieval and isolate the write\-side compression strategy\.

### 5\.3Implementation Details

#### Readers\.

We use three readers: GPT\-5\.2 \(gpt\-5\.2\), Claude Sonnet 4 \(claude\-sonnet\-4\-20250514\), and Gemini 2\.5 Pro \(gemini\-2\.5\-pro\)\. All readers use temperature 0, max output tokens 200, and a simple prompt: “Based on the following context, answer the question\. If the answer cannot be determined, say ‘I don’t know’\.” The context field is populated with the truncated full history \(TFC, fixed 32K\-token budget\), gold evidence turns \(OE\), complete stored memory \(CSM\), or retrieved memory \(RM\)\. ForTFC, truncation retains the*most recent*32K tokens \(tail of the conversation\), discarding earlier sessions\. This is a budget\-limited long\-context baseline, not an unconstrained full\-context reference\.

#### Memory budgets and chunking\.

Write budgetBB=5,000 tokens; read budget 5,000 tokens\. Token counts use thecl100k\_basetokenizer\. For Verbatim Chunk, conversations are split into 200\-token chunks with no overlap; FIFO eviction drops the oldest chunks when exceedingBB\. For Summary systems and EPC, each session is compressed independently; per\-session budgets are allocated proportionally to the square root of session length\.

#### Retrieval\.

Verbatim Chunk, Summary \(Extractive\), Summary \(LLM\), and EPC all use the same retriever:all\-MiniLM\-L6\-v2sentence embeddings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.24579#bib.bib13); Wanget al\.,[2020](https://arxiv.org/html/2605.24579#bib.bib16)\)with cosine similarity, returning top\-kk=5 chunks/entries\. MemWalker uses its own tree navigation; ReadAgent uses lookup expansion; MemoryBank uses importance\-weighted re\-ranking over the same embedding scores\.

#### Write\-side LLM configuration\.

The directly compared LLM compression methods—Summary \(LLM\), EPC, and the 2\-pass baseline—use the same model \(GPT\-5\.2, temperature 0\) for compression, ensuring that CSM differences among them reflect the compression*strategy*, not the model\. Summary \(LLM\) uses a single call per session instructing the model to summarize under the token budget while preserving names, dates, numbers, preferences, and decisions, without guessing future questions\. The 2\-pass baseline uses the same first pass, then a second call asking the model to improve information density by replacing vague references with specific facts\. EPC uses two calls: Step 1 generates 5 probe questions; Step 2 identifies supporting evidence \(see §[4](https://arxiv.org/html/2605.24579#S4)\)\. Utility weights in Eq\.[5](https://arxiv.org/html/2605.24579#S4.E5):α\\alpha=1\.0,β\\beta=0\.5,λ\\lambda=0\.3, selected in a 20\-example pilot before the final evaluation\. Full prompt templates for all three methods are provided in Appendix[A](https://arxiv.org/html/2605.24579#A1)\.

### 5\.4Metrics

We reportContains Match\(CM\): whether the generated answer contains the gold answer substring\. We also reportToken F1: token\-level precision/recall between generated and gold answers\. Both are standard LongMemEval metrics\(Wuet al\.,[2025](https://arxiv.org/html/2605.24579#bib.bib1)\)\. All scores are averaged over 500 questions\. For the 3\-reader tables, we report the mean across readers\.TFCandOEare shared reference conditions, computed once per reader and reused across all systems\.

CM is a strict substring test; we verify in Appendix[A](https://arxiv.org/html/2605.24579#A1)that CM and F1 agree directionally in 95\.5% of pairwise comparisons, and system rankings are identical under both metrics in all tables\.

## 6Results

We first verify that the reference conditions make the diagnostic meaningful, then report the main write–retrieval decomposition\. The remaining subsections test whether the diagnosis is supported by reader\-independent evidence preservation, controlled perturbations, component ablations, cost matching, cross\-benchmark generalization, and budget sensitivity\.

### 6\.1Reader\-Level Reference Conditions

Table 2:Reader\-level reference conditions on LongMemEval\.OE≫\\ggTFCacross all readers\.Table[2](https://arxiv.org/html/2605.24579#S6.T2)shows that all three readers exhibitOE≫\\ggTFC: CM rises from 0\.18 to 0\.53 and F1 from 0\.29 to 0\.63\. This establishes a strong oracle\-evidence reference and a budget\-limited truncated\-context reference for the diagnostic\. It does not decompose theOE–TFCdifference, which still mixes distractor effects with the reader’s finite context window\.

### 6\.2Diagnostic Results

Table 3:Diagnostic results \(CM, 3\-reader avg,BB=5K\)\. Refs:TFC=0\.18,OE=0\.53\. EPC CSM CI: \[\.46, \.52\],p<\.01p<\.01vs Summary \(LLM\)\. Token F1 yields identical system ordering\.Table[3](https://arxiv.org/html/2605.24579#S6.T3)and Figure[3](https://arxiv.org/html/2605.24579#S6.F3)together reveal a clear spectrum of write\-side gaps\. Verbatim Chunk shows the largest write\-side gap \(Δw\\Delta\_\{\\text\{w\}\}=\.31\), with a much smaller retrieval\-side gap \(Δr\\Delta\_\{\\text\{r\}\}=\.04\)\. Progressing from weaker to stronger write strategies—Summary \(Extractive\), MemoryBank, and MemWalker—the write\-side gap gradually shrinks but remains dominant\. Summary \(LLM\) and ReadAgent substantially reduce the write\-side gap, but still lose 0\.09–0\.10 absolute CM points relative toOEduring the write stage\. EPC further reducesΔw\\Delta\_\{\\text\{w\}\}to \.04, leaving retrieval as the larger remaining gap\. Appendix[A\.6](https://arxiv.org/html/2605.24579#A1.SS6)illustrates this pattern with concrete examples showing which entities, dates, and negations are lost under query\-agnostic summarization and preserved by EPC\.

Under the diagnosis rule \(§[3](https://arxiv.org/html/2605.24579#S3)\), four baselines are robustly write\-dominant; Summary \(LLM\) is write\-dominant in the aggregate but near the margin, and ReadAgent is mixed/near\-boundary\. EPC is the only system diagnosed as retrieval\-dominant\. EPC’s CSM of \.49 CM \(\.62 F1\) has a 95% bootstrap CI of \[\.46, \.52\] and significantly exceeds Summary \(LLM\) under paired bootstrap \(p<0\.01p<0\.01\)\.

![Refer to caption](https://arxiv.org/html/2605.24579v1/x1.png)Figure 3:Write\-side gap \(Δw\\Delta\_\{\\text\{w\}\}=OE−\-CSM, coral\) vs\. retrieval\-side gap \(Δr\\Delta\_\{\\text\{r\}\}=CSM−\-RM, blue\) for all seven systems \(CM, 3\-reader avg,BB=5K\)\. Numbers at right: total OE→\\toRM drop\. EPC is highlighted in green, with hatching marking its retrieval\-side gap, and has the smallest write\-side gap\.EPC remains the highest\-CSMsystem for each reader individually \(GPT\-5\.2, Claude Sonnet 4, and Gemini 2\.5 Pro\), so the aggregate result is not driven by a single reader\.

### 6\.3Reader\-Independent Evidence Preservation

The diagnostic indicators \(Δwrite\\Delta\_\{\\text\{write\}\},Δretr\\Delta\_\{\\text\{retr\}\}\) are based on*answer correctness*, which can mix evidence preservation, memory format, and reader ability\. As a reader\-independent check on the write\-side diagnosis, we measure whether gold evidence survives in memory\.

For each question, LongMemEval provides gold evidence turns\. We compute two recall metrics over complete stored memory \(CSM\) and retrieved memory \(RM\):Turn recall: fraction of gold evidence turns preserved in memory \(token\-level Jaccard overlap\>0\.5\>0\.5with the closest memory segment\)\.Span recall: fraction of gold answer entities \(names, dates, numbers\) found as exact substrings in memory\. Both metrics are heuristic and intended as a reader\-independent complement to the diagnostic indicators, not as a standalone evaluation\.

![Refer to caption](https://arxiv.org/html/2605.24579v1/x2.png)Figure 4:Evidence recall \(CSM and RM\)\. EPC preserves \+\.15 more gold answer entities than Summary \(LLM\) in CSM span recall\.Figure[4](https://arxiv.org/html/2605.24579#S6.F4)provides the reader\-independent check: CSM span recall mirrors the write\-side ranking in Table[3](https://arxiv.org/html/2605.24579#S6.T3), and EPC preserves 70% of gold answer entities versus 55% for Summary \(LLM\)\. Its CSM→\\toRM span\-recall drop \(\.70→\\to\.60\) also matches the retrieval\-side diagnosis, indicating that EPC’s remaining gap is associated with losing preserved spans during retrieval\.

### 6\.4Protocol Validation via Controlled Degradation

As a validation check, we evaluate Summary \(LLM\) under five controlled settings: the unmodified system, mild and severe write\-side degradation \(randomly dropping 25% or 50% of memory entries after the write stage\), and mild and severe retrieval\-side degradation \(replacing top\-ranked retrieved entries with entries from the bottom\-ranked 25% or 50% of the ranked candidate list\)\. We run each setting with all three readers, yielding 15 setting–reader runs\. Figure[5](https://arxiv.org/html/2605.24579#S6.F5)shows thatΔw\\Delta\_\{\\text\{w\}\}responds selectively to write\-side degradation andΔr\\Delta\_\{\\text\{r\}\}to retrieval\-side degradation; the degraded variants are diagnosed correctly\.

![Refer to caption](https://arxiv.org/html/2605.24579v1/x3.png)Figure 5:Controlled degradation \(3 readers\)\.Δw\\Delta\_\{\\text\{w\}\}\(write\-side gap = OE−\-CSM\) responds selectively to write\-side degradation;Δr\\Delta\_\{\\text\{r\}\}\(retrieval\-side gap = CSM−\-RM\) responds selectively to retrieval\-side degradation\.
### 6\.5EPC Breakdown: Question Type and Probe Alignment

We next ask where EPC helps most: across question types, and as a function of how closely its probe questions match the held\-out test question\.

![Refer to caption](https://arxiv.org/html/2605.24579v1/x4.png)Figure 6:EPC vs\. Summary \(LLM\) CSM \(CM, 3\-reader avg,BB=5K\) by question type \(left\) and probe–question alignment \(right\)\. EPC gains are shown above bars; the largest gain \(\+\.10\) occurs when probe questions align with test questions\.Figure[6](https://arxiv.org/html/2605.24579#S6.F6)shows that EPC improves CSM on single\-session, multi\-session, and temporal questions, with the largest gain on single\-session questions \(\+\.06\)\. We measure probe–question alignment as the maximum cosine similarity between the test question and EPC’s generated probe questions, using the sameall\-MiniLM\-L6\-v2embeddings as retrieval\. Splitting questions at the median alignment score confirms this pattern: \+\.10 CSM when probe questions align and \+\.04 when they do not\. One explanation is that self\-questioning encourages broader coverage of high\-value information types \(entities, dates, preferences\), which can benefit even unanticipated questions\.

### 6\.6Component Ablation and Weight Sensitivity

We then isolate which EPC components account for the gain and check sensitivity to probe count and utility weights\.

![Refer to caption](https://arxiv.org/html/2605.24579v1/x5.png)Figure 7:Component ablations and probe\-count sensitivity \(CSM CM,BB=5K\)\. Removing self\-questioning \(−\-\.05\) is the largest drop; utility\-weight sensitivity is reported below\.Figure[7](https://arxiv.org/html/2605.24579#S6.F7)identifies self\-questioning as the largest contributor among the tested components: removing it drops CSM by \.05, while random probe questions recover only \+\.01 over Summary \(LLM\)\.kk=5 probe questions suffice;kk=10 provides no additional benefit\.

#### Utility weight sensitivity\.

The utility function \(Eq\.[5](https://arxiv.org/html/2605.24579#S4.E5)\) uses weightsα\\alpha=1\.0,β\\beta=0\.5,λ\\lambda=0\.3 selected in a 20\-example pilot\. Across six alternative configurations, CSM stays within \.45–\.50; reducing specificity weightβ\\betahurts most \(−\.04\-\.04\), but all variants match or exceed Summary \(LLM\), suggesting that the self\-questioning mechanism contributes more to EPC’s gains than the particular weight settings\. This low sensitivity to weight configurations also reduces the risk that the reported gains depend on this small pilot choice\.

### 6\.7Cost\-Matched Comparison

EPC uses two LLM calls per session versus one for Summary \(LLM\)\. To separate the effect of an additional LLM call from the effect of probe\-question generation, we introduce a2\-passbaseline that also uses two calls but without self\-questioning: Pass 1 compresses normally, and Pass 2 performs query\-agnostic refinement for information density\.

Table 4:Cost\-matched comparison \(CM, GPT\-5\.2\)\. The 2\-pass baseline recovers \+\.02 over 1\-pass Summary \(LLM\); EPC adds a further \+\.03 under the same two\-call budget\.Table[4](https://arxiv.org/html/2605.24579#S6.T4)shows that a second query\-agnostic refinement pass improves CSM by \+\.02, while EPC gains \+\.05 over 1\-pass Summary \(LLM\)\. The remaining \+\.03 separates EPC from the cost\-matched 2\-pass baseline on this reader\. The same ordering \(EPC\>\>2\-pass\>\>1\-pass\) holds for the other two readers\. Relative to the 1\-pass Summary \(LLM\) baseline, EPC roughly doubles write latency \(∼\\sim30s→∼\\to\\sim60s per session in our implementation\) while roughly halvingΔwrite\\Delta\_\{\\text\{write\}\}\.

### 6\.8Cross\-Benchmark Generalization: LoCoMo

To test generalization, we evaluate on LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.24579#bib.bib22)\), a multi\-session benchmark with 348 QA pairs across 10 conversations, including two question categories absent from LongMemEval \(elaboration and adversarial\)\. For LoCoMo, we report a two\-reader average using GPT\-5\.2 and Claude Sonnet 4\.

LoCoMo operates in a*different regime*: conversations average∼\{\\sim\}9K tokens, fitting entirely within the 32K context budget\. Consequently, theTFCcondition effectively becomes a*full\-context*baseline \(no truncation\), and theOE–TFCgap reflects distractor interference alone, not information loss from truncation\. We set the write budget toBB=2,000 tokens to create meaningful compression pressure \(22% retention\), while noting that this is less aggressive compression thanBB=5,000 on LongMemEval’s 121K\-token contexts\.

Table 5:LoCoMo results \(CM,BB=2K, 2\-reader avg\)\. Refs:TFC=0\.46,OE=0\.61\. The CM ranking pattern matches LongMemEval\.Table[5](https://arxiv.org/html/2605.24579#S6.T5)shows a similar ranking pattern in this no\-truncation reference setting: Verbatim Chunk through MemWalker haveΔw\>Δr\\Delta\_\{\\text\{w\}\}\>\\Delta\_\{\\text\{r\}\}, EPC has the lowestΔw\\Delta\_\{\\text\{w\}\}and is the only system whereΔr\>Δw\\Delta\_\{\\text\{r\}\}\>\\Delta\_\{\\text\{w\}\}, and the EPC–Summary \(LLM\) CSM gap is \+\.04\. The smallerOE–TFCgap \(\.15 CM vs\. \.35 on LongMemEval\) is consistent with truncation contributing to the LongMemEval reference gap, though the protocol remains informative without truncation\. By category, EPC gains most on factual questions \(\+\.06 CSM\), moderately on temporal/elaboration \(\+\.04\), and least on inferential/adversarial cases \(\+\.02\)\.

### 6\.9Budget Sensitivity on LongMemEval

One question raised by the rate–distortion framing \(Section[4](https://arxiv.org/html/2605.24579#S4)\) is whether self\-questioning matters more under tight budget constraints, where choosing the right evidence is critical, than under relaxed budgets\. We test this by sweeping the write budgetB∈\{2​K,5​K,10​K,20​K\}B\\in\\\{2\\text\{K\},5\\text\{K\},10\\text\{K\},20\\text\{K\}\\\}on all 500 LongMemEval questions with three readers, while keeping the read budget fixed at 5K tokens\.

Table 6:Write\-budget sweep \(CM, 3\-reader avg; read budget fixed at 5K\)\. Refs:TFC=\.18,OE=\.53\. W/R denote write\-/retrieval\-dominant diagnoses\. EPC’s advantage shrinks from \+\.11 \(2K\) to \+\.01 \(20K\)\.Table[6](https://arxiv.org/html/2605.24579#S6.T6)shows that EPC’s CSM advantage over Summary \(LLM\) shrinks from \+\.11 at 2K to \+\.01 at 20K as both systems approach the OE upper bound \(\.53\)\. The relative size of the two gaps also shifts with budget: EPC reachesΔr\>Δw\\Delta\_\{\\text\{r\}\}\>\\Delta\_\{\\text\{w\}\}at 5K, Summary \(LLM\) at 10K, while Verbatim Chunk and Summary \(Extractive\) haveΔw\>Δr\\Delta\_\{\\text\{w\}\}\>\\Delta\_\{\\text\{r\}\}throughout\.

## 7Analysis

#### Operating point matters\.

The budget sweep shows that the diagnosis is not an intrinsic property of an architecture alone: asBBgrows, EPC becomes retrieval\-dominant at 5K and Summary \(LLM\) at 10K, while Verbatim Chunk and Summary \(Extractive\) remain write\-dominant throughout\. Thus the protocol should be run under the target budget, reader, and question distribution rather than inferred from the system family\.

#### Benefits and limits of self\-questioning\.

EPC’s gains are largest under aggressive compression \(\+\.11 atBB=2K\), on single\-session LongMemEval questions \(\+\.06\), and on factual LoCoMo questions \(\+\.06\)\. They shrink when budgets are generous \(\+\.01 atBB=20K\) or questions require compositional inference \(\+\.02 on LoCoMo’s inferential category\)\. This suggests that self\-questioning is most useful when the writer must choose a small set of answer\-relevant details before the question is known\.

#### Relation to full\-context reading\.

Because ourTFCcondition imposes a fixed 32K\-token budget, full\-context reading could narrow theOE–TFCgap\. The diagnostic protocol is complementary: it evaluates fixed\-budget memory pipelines and helps determine whether compression or retrieval is the limiting stage within that design\.

## 8Conclusion

This paper makes three main points\.

First, the diagnostic protocol decomposes end\-to\-end memory failures into write\-side and retrieval\-side gaps\. It requires no architectural changes—just four evaluations of the same reader under controlled inputs—and we have shown it to be informative across two benchmarks \(LongMemEval and LoCoMo\), multiple readers, and a range of budgets\.

Second, EPC uses likely future questions to decide what information to retain during compression, substantially reducing the write\-side gap on both benchmarks and shifting the remaining bottleneck toward retrieval\.

Third, the write\-retrieval balance is not fixed\. It depends on the system, the budget, and the question distribution\. Most systems we tested operate in a regime where improving the write stage would help more than improving retrieval—but this changes as budgets grow or as compression methods improve\. The diagnostic protocol makes this regime visible and reduces the risk of optimizing a non\-limiting stage\.

Future work includes reducing reliance on human evidence annotations via LLM\-generated silver evidence, exploring adaptive compression that adjusts strategy based on per\-session diagnostic signals, and evaluating on longer conversations and more diverse domains\.

## Limitations

The diagnostic indicators are conditioned on the chosen reader and metric\. EPC’s self\-questioning step adds write\-time latency\. While we evaluate on two benchmarks \(LongMemEval and LoCoMo\), both rely on gold evidence annotations for theOEcondition\. This is an experimental choice rather than a protocol limitation: the four\-condition structure is agnostic to how evidence is obtained\. In practice, a practitioner could approximateOEwith silver annotations—e\.g\., prompting an LLM to identify the minimal supporting turns for each question—which would introduce annotation noise but may preserve the directional diagnosis \(write\-dominant vs\. retrieval\-dominant\), since the diagnosis rule already tolerates small perturbations via the marginϵ\\epsilon\. Empirically validating this silver\-annotation path remains future work\. LoCoMo’s shorter context \(∼\{\\sim\}9K tokens\) means the compression challenge is milder than in LongMemEval; evaluating on benchmarks with longer contexts and more diverse domains would further strengthen generalizability claims\.

## Ethics and Reproducibility

Long\-term memory benchmarks and systems necessarily involve storing user\-specific facts, preferences, dates, and other potentially sensitive information\. Because EPC explicitly tries to preserve supporting evidence, a deployed version could also preserve sensitive information more reliably than a generic summary unless retention, deletion, access control, and redaction policies are carefully designed\. More broadly, the work studies diagnostic structure rather than downstream social impact, so it should not be interpreted as an argument for indiscriminate memory retention in real applications\. Reproducibility is also limited by our use of proprietary LLM readers and writers, whose behavior may change over time\.

## References

- T\. Baumel, M\. Eyal, and M\. Elhadad \(2018\)Query focused abstractive summarization: incorporating query relevance, multi\-document coverage, and summary length constraints into seq2seq models\.arXiv preprint arXiv:1801\.07704\.External Links:[Link](https://arxiv.org/abs/1801.07704)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Chen, R\. Pasunuru, J\. Weston, and A\. Celikyilmaz \(2024\)Walking down the memory maze: beyond context limit through interactive reading\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2310.05029),[Document](https://dx.doi.org/10.48550/arXiv.2310.05029)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p2.1),[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.24579#S5.SS2.SSS0.Px1.p1.1)\.
- T\. M\. Cover and J\. A\. Thomas \(2006\)Elements of information theory\.2 edition,Wiley\-Interscience\.External Links:[Link](https://onlinelibrary.wiley.com/doi/book/10.1002/047174882X),[Document](https://dx.doi.org/10.1002/047174882X)Cited by:[§4\.1](https://arxiv.org/html/2605.24579#S4.SS1.p3.1)\.
- H\. Daumé III and D\. Marcu \(2006\)Bayesian query\-focused summarization\.InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL,pp\. 305–312\.External Links:[Link](https://aclanthology.org/P06-1039/),[Document](https://dx.doi.org/10.3115/1220175.1220214)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Jiang, Q\. Wu, X\. Luo, S\. Ahn, Z\. Ma, Z\. Liu, C\. Lin, and Y\. Yang \(2023\)LLMLingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2023.emnlp-main.825/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.825)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Lee, D\. Ippolito, A\. Nishi, Y\. Ren, and Z\. Parekh \(2024\)A human\-inspired reading agent with gist memory of very long contexts\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/2402.09727),[Document](https://dx.doi.org/10.48550/arXiv.2402.09727)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p2.1),[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.24579#S5.SS2.SSS0.Px1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems,Vol\.33\.External Links:[Link](https://arxiv.org/abs/2005.11401),[Document](https://dx.doi.org/10.48550/arXiv.2005.11401)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p2.1)\.
- A\. Maharana, D\. Lee, S\. Tuber, M\. Bansal, F\. Barbieri, Y\. Fang, R\. Damber, and S\. Yi \(2024\)Evaluating very long\-term conversational memory of LLM agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2024.acl-long.747/)Cited by:[§6\.8](https://arxiv.org/html/2605.24579#S6.SS8.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2024\)MemGPT: towards LLMs as operating systems\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.08560),[Document](https://dx.doi.org/10.48550/arXiv.2310.08560)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Pan, Q\. Wu, H\. Jiang, M\. Xia, X\. Luo, J\. Zhang, Q\. Lin, V\. Ruhle, Y\. Yang, C\. Lin,et al\.\(2024\)LLMLingua\-2: data distillation for efficient and faithful task\-agnostic prompt compression\.InFindings of the Association for Computational Linguistics: ACL 2024,External Links:[Link](https://aclanthology.org/2024.findings-acl.57/)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/D19-1410/),[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[§5\.3](https://arxiv.org/html/2605.24579#S5.SS3.SSS0.Px3.p1.1)\.
- W\. Wang, L\. Dong, H\. Cheng, X\. Liu, X\. Yan, J\. Gao, and F\. Wei \(2024\)Augmenting language models with long\-term memory\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2306.07174),[Document](https://dx.doi.org/10.48550/arXiv.2306.07174)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p2.1),[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MiniLM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.InAdvances in Neural Information Processing Systems,Vol\.33\.External Links:[Link](https://arxiv.org/abs/2002.10957),[Document](https://dx.doi.org/10.48550/arXiv.2002.10957)Cited by:[§5\.3](https://arxiv.org/html/2605.24579#S5.SS3.SSS0.Px3.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. He, and K\. Zhu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=pZiyCaVuti)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p4.1),[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.24579#S5.SS1.p1.1),[§5\.4](https://arxiv.org/html/2605.24579#S5.SS4.p1.1)\.
- F\. Xu, W\. Shi, and E\. Choi \(2024a\)RECOMP: improving retrieval\-augmented LMs with compression and selective augmentation\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2310.04408)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Xu, W\. Ping, X\. Wu, L\. McAfee, C\. Zhu, Z\. Liu, S\. Subramanian, E\. Bakhturina, M\. Shoeybi, and B\. Catanzaro \(2024b\)Retrieval meets long context large language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=xw5nxFWMlo)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, X\. B\. Zhang, C\. Jiang, J\. Liu, J\. Zou,et al\.\(2024\)A survey on the memory mechanism of large language model based agents\.arXiv preprint arXiv:2404\.13501\.External Links:[Link](https://arxiv.org/abs/2404.13501),[Document](https://dx.doi.org/10.48550/arXiv.2404.13501)Cited by:[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/2305.10250),[Document](https://dx.doi.org/10.48550/arXiv.2305.10250)Cited by:[§1](https://arxiv.org/html/2605.24579#S1.p2.1),[§2](https://arxiv.org/html/2605.24579#S2.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.24579#S5.SS2.SSS0.Px1.p1.1)\.

## Appendix AReproducibility Details

This appendix provides the prompt templates, hyperparameters, metric agreement checks, baseline adaptation details, per\-reader results, qualitative examples, and bootstrap procedure used in the main experiments\.

### A\.1Prompt Templates

All LLM calls use temperature 0\. We use the following templates, with bracketed fields replaced by the corresponding instance\-specific content\.

#### Reader prompt\.

```
Based on the following context,
answer the question.
If the answer cannot be determined,
say "I don’t know".

Context:
{context}

Question:
{question}
```

#### Summary \(LLM\) write prompt\.

```
You are compressing conversation memory
under a hard budget.
Summarize this session from {date}
in under {max_tokens} tokens.
Preserve names, dates, numbers,
preferences, negations, state changes,
decisions, and event relations.
Stay query-agnostic: do not guess
future questions.
Return only the compressed memory text.

{session_turns}
```

#### EPC Step 1: probe generation\.

```
You are writing long-term memory before
the future question is known.
Given this conversation session,
generate 5 likely future questions
that may require information from it.
Prefer questions about factual details,
names, dates, numbers, preferences,
plans, temporal changes, decisions,
and negations.

Return only a numbered list.

{session_turns}
```

#### EPC Step 2: evidence selection and memory writing\.

```
You are compressing a conversation
session into long-term memory under
a hard budget of {max_tokens} tokens.

Use the probe questions to identify
minimal supporting evidence.
Keep exact entities, dates, numbers,
preferences, decisions, negations,
and temporal relations.
Avoid generic summaries.

For each selected item, use this format:
[Q] likely future question
[E] minimal supporting evidence
[S] session_{session_id}_turn_{turn_id}

Probe questions:
{probe_questions}

Conversation:
{session_turns}
```

#### Two\-pass baseline refinement prompt\.

```
You are improving an existing compressed
memory under the same hard budget of
{max_tokens} tokens.
Increase density by replacing
vague references with specific names,
dates, numbers, preferences, decisions,
negations, and event relations.
Do not guess future questions.
Return only the revised memory.

Original session:
{session_turns}

Current compressed memory:
{summary}
```

### A\.2Hyperparameters

Table[7](https://arxiv.org/html/2605.24579#A1.T7)lists the main hyperparameters used in the LongMemEval experiments\.

Table 7:Main hyperparameters used in the LongMemEval experiments unless otherwise specified\.
### A\.3CM and F1 Agreement

Contains Match \(CM\) is the primary metric in the main tables, but we also compute Token F1 for every system and condition\. Across pairwise comparisons among LongMemEval system–condition scores, CM and F1 agree directionally in 95\.5% of cases, and the system rankings reported in the main tables are unchanged under Token F1\. For LoCoMo, Token F1 yields the same system ranking as CM in Table[5](https://arxiv.org/html/2605.24579#S6.T5)\.

### A\.4Baseline Adaptations

Table[8](https://arxiv.org/html/2605.24579#A1.T8)summarizes how each baseline system was adapted to the session\-based LongMemEval setting\.

Table 8:Baseline adaptations to the session\-based LongMemEval setting\.Note that MemWalker, ReadAgent, and MemoryBank were reimplemented from their paper descriptions, not from official code\. To mitigate reimplementation risk, we verified two properties: \(1\) the qualitative behavior of each system matches the original \(e\.g\., MemWalker’s tree depth scales logarithmically with session count; MemoryBank’s importance scores decay with time as described\); \(2\) all systems are compared under identical data, budgets, and readers\. We therefore interpret absolute comparisons to reimplemented systems cautiously, while emphasizing controlled comparisons within our evaluation setup\.

### A\.5Per\-Reader Diagnostic Results

Table[9](https://arxiv.org/html/2605.24579#A1.T9)reports the diagnostic results for each reader individually\. The system ranking is consistent across all three readers: EPC achieves the highest CSM for each reader\. The robust write\-dominant pattern holds for Verbatim Chunk, Summary \(Extractive\), MemoryBank, and MemWalker under each reader, while Summary \(LLM\) and ReadAgent lie near the margin\.

Table 9:Per\-reader diagnostic results \(CM,BB=5K\)\. System ranking is consistent across all three readers\.
### A\.6Qualitative Case Studies

To illustrate how the diagnostic indicators correspond to concrete information loss, we present three representative LongMemEval examples, lightly paraphrased for brevity and anonymization, whereΔwrite\\Delta\_\{\\text\{write\}\}is large for Summary \(LLM\) but small for EPC\.

#### Case 1: Dropped date\.

Question:“When did the user say they were moving to Seattle?”Gold evidence:*session\_7, turn\_14*: “I’m planning to move to Seattle around mid\-March\.”Summary \(LLM\) CSM:The summary mentions “the user discussed relocation plans” but drops “mid\-March” and “Seattle” appears only as a general topic\. Reader answers “I don’t know\.”EPC CSM:A probe question “When is the user relocating?” preserves the entry\[E\] Moving to Seattle around mid\-March→\\toreader answers correctly\.

#### Case 2: Merged preferences\.

Question:“Does the user prefer Thai or Italian food?”Gold evidence:*session\_12, turn\_3*: “I definitely prefer Thai over Italian\.”Summary \(LLM\) CSM:The summary states “the user discussed food preferences” without specifying the preference direction\. Reader answers incorrectly\.EPC CSM:A probe question “What food does the user prefer?” preserves\[E\] User prefers Thai over Italian→\\toreader answers correctly\.

#### Case 3: Lost negation\.

Question:“Is the user still taking the advanced Python course?”Gold evidence:*session\_22, turn\_8*: “I dropped the advanced Python course last week\.”Summary \(LLM\) CSM:The summary mentions “the user is taking programming courses” — the negation \(dropping the course\) is lost\. Reader answers “Yes\.”EPC CSM:A probe question “What courses is the user currently enrolled in?” preserves\[E\] Dropped advanced Python course last week→\\toreader answers correctly\.

These cases illustrate the pattern behindΔwrite\\Delta\_\{\\text\{write\}\}: query\-agnostic summarization tends to preserve topics but omit the specific facts \(dates, preference directions, negations\) that questions target\. EPC’s self\-questioning improves coverage of these high\-value details, explaining its lowerΔwrite\\Delta\_\{\\text\{write\}\}\.

### A\.7Bootstrap Significance Testing

For significance testing, we use paired bootstrap resampling over questions\. Each bootstrap sample resamples the 500 LongMemEval questions with replacement and computes theCSMdifference between EPC and Summary \(LLM\) on the resampled set\. We use 10,000 bootstrap samples\. The reportedpp\-value is the fraction of samples in which the paired EPC–Summary \(LLM\) difference is less than or equal to zero\. Confidence intervals are percentile intervals over bootstrap samples\.

Similar Articles

Evaluating Memory Capability in Continuous Lifelog Scenario

arXiv cs.CL

This paper introduces LifeDialBench, a novel benchmark for evaluating memory capabilities in continuous lifelog scenarios using wearable devices, and proposes an online evaluation protocol that enforces temporal causality. Key finding: sophisticated memory systems underperform simple RAG baselines, highlighting the importance of high-fidelity context preservation over lossy compression.