InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
Summary
InfoMem introduces a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information gain, improving long-context memory-agent performance under the same RL framework.
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
Source: [https://arxiv.org/html/2606.03329](https://arxiv.org/html/2606.03329)
Tiancheng Han1,2,Yong Li1,Wuzhou Yu1,Qiaosheng Zhang2,3,†,Wenqi Shao2,3,† 1Tongji University2Shanghai Innovation Institute3Shanghai AI Laboratory zhangqiaosheng@pjlab\.org\.cnweqish@link\.cuhk\.edu\.hk
###### Abstract
Long\-context tasks require LLMs to identify and preserve answer\-relevant information from large contexts\. Chunk\-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory\. However, existing RL\-based chunk\-wise agents either rely on sparse final\-answer rewards or use lexical intermediate rewards for memory and retrieval actions\. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground\-truth answer\. We propose InfoMem, a reward mechanism for training chunk\-wise memory agents that evaluates final\-memory utility using answer\-conditioned information\. InfoMem measures how much the final memory increases the model’s per\-token log\-likelihood of the ground\-truth answer\. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition\. Under the same GRPO framework and training budget, InfoMem improves long\-context memory\-agent performance over comparable memory\-agent RL baselines\. Analyses show that effective final\-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query\. Our code is available at[https://github\.com/GenSouKa1/InfoMem](https://github.com/GenSouKa1/InfoMem)\.
InfoMem: Training Long\-Context Memory Agents with Answer\-Conditioned Information Gain
Tiancheng Han1,2, Yong Li1, Wuzhou Yu1, Qiaosheng Zhang2,3,†, Wenqi Shao2,3,†1Tongji University2Shanghai Innovation Institute3Shanghai AI Laboratoryzhangqiaosheng@pjlab\.org\.cnweqish@link\.cuhk\.edu\.hk
\{NoHyper\}††footnotetext:†Corresponding authors\.
## 1Introduction
Long\-context understanding has become a central capability for large language models \(LLMs\), with applications ranging from long\-document question answering to corpus\-level evidence aggregation\(OpenAI,[2025](https://arxiv.org/html/2606.03329#bib.bib15);[Wuet al\.,](https://arxiv.org/html/2606.03329#bib.bib17);[Hsiehet al\.,](https://arxiv.org/html/2606.03329#bib.bib13); Luet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib16)\)\. Prior work has improved long\-context processing through extended context windows\(Shenet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib19)\), attention or positional modifications\(Munkhdalaiet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib20);[Presset al\.,](https://arxiv.org/html/2606.03329#bib.bib21)\), retrieval augmentation\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib4)\), and memory\-based or agentic pipelines\(Packeret al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib22); Zhouet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib23)\)\. However, effectively using long inputs remains challenging when relevant evidence is sparse, distributed across distant segments, or must be preserved throughout a long reading process\.
Among these approaches, chunk\-wise memory agents provide a simple and effective paradigm for long\-context reasoning\. Instead of processing the entire document at once, the model sequentially reads shorter chunks, updates a compact memory state, and generates the final answer from the accumulated memory\. This paradigm appears in training\-free reading agents\(Leeet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib24)\), recurrent or memory\-augmented model architectures\(Daiet al\.,[2019](https://arxiv.org/html/2606.03329#bib.bib25)\), and post\-training methods\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1); Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)\. Crucially, their explicit memory state makes memory quality an observable optimization target\.
Despite their effectiveness, existing chunk\-wise long\-context systems still lack a scalable way to supervise memory formation\. Training\-free methods often rely on human\-designed memory\-update prompts, summarization heuristics, or fixed traversal workflows\(Leeet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib24)\)\. Architecture\-level methods may improve long\-context capability more fundamentally\(Daiet al\.,[2019](https://arxiv.org/html/2606.03329#bib.bib25)\), but typically require costly pretraining\. Reinforcement learning \(RL\)\-based memory agents can also improve long\-context behavior through task feedback, but existing methods mainly rely on sparse answer rewards\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\)or lexical intermediate rewards for memory and retrieval actions\(Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)\. These rewards supervise task success or local word overlap, without directly evaluating whether the final memory semantically supports the ground\-truth answer\.
This limitation is especially pronounced within successful trajectories: sparse outcome rewards cannot distinguish whether the final memory contains focused answer\-supporting evidence or redundant distracting information, while lexical rewards may not capture semantic support for the final answer\. This motivates a memory\-specific reward signal for chunk\-wise long\-context reinforcement learning\.
We propose InfoMem, a reward\-shaping method for chunk\-wise long\-context memory agents based on answer\-conditioned information gain\. The core intuition is that a useful final memory should increase the model’s support for the ground\-truth answer\. Instead of estimating distribution\-level mutual information, InfoMem uses a pointwise information\-gain surrogate by comparing the model’s per\-token log\-likelihood of the ground\-truth answer with and without the final memory\. InfoMem further improves training stability by applying this signal only to successful trajectories and normalizing it before reward composition\.
Experiments show that InfoMem consistently improves long\-context memory\-agent performance over outcome\-only GRPO and a comparable memory\-agent RL baseline ReMemR1\(Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)\. Further analyses show that effective final\-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the ground\-truth answer rather than on the query alone\. These findings suggest that answer\-conditioned information gain provides a principled framework for final\-memory supervision in chunk\-wise long\-context RL\.
Our contributions are threefold: \(1\) We formulate final\-memory utility through an information\-theoretic perspective, where useful memories should reduce the model’s uncertainty about the ground\-truth answer\. \(2\) We introduce InfoMem, an answer\-conditioned information\-gain reward for direct final\-memory shaping over successful trajectories\. \(3\) We show that InfoMem consistently improves chunk\-wise long\-context memory agents over comparable memory\-agent RL baselines, and further identify three key properties of effective final\-memory rewards: success\-side supervision, pre\-composition normalization, and answer conditioning\.
## 2Related Work
### 2\.1Long\-context LLMs and Chunk\-wise Memory Agents
Long\-context LLM research has improved long\-input processing through context extension\(Shenet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib19)\), attention or positional modifications\(Munkhdalaiet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib20);[Presset al\.,](https://arxiv.org/html/2606.03329#bib.bib21)\), and retrieval augmentation\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib4)\)\.
Complementary to these approaches, chunk\-wise memory agents maintain an explicit memory state during sequential long\-document processing\. This paradigm includes training\-free reading workflows\(Leeet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib24)\), segment\-level recurrent architectures\(Daiet al\.,[2019](https://arxiv.org/html/2606.03329#bib.bib25); Hutchinset al\.,[2022](https://arxiv.org/html/2606.03329#bib.bib26); Dinget al\.,[2021](https://arxiv.org/html/2606.03329#bib.bib27)\), and post\-training memory agents\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1); Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)\. Existing methods rely on manually designed workflows, costly recurrent architectures, or RL objectives based on sparse final\-answer rewards and intermediate memory heuristics\. As a result, direct supervision of answer\-conditioned final\-memory utility remains relatively underexplored\.
### 2\.2RL for Long\-context QA
DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib28)\)demonstrates that reinforcement learning can substantially enhance LLM capabilities in specialized domains\. Recent work further shows that RL can improve long\-context question answering and reasoning\. Early methods mainly optimize verifiable end\-task outcomes, such as final\-answer correctness or verifier\-based response quality\(Shenet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib19); Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\)\. More recent studies have introduced denser supervision signals for grounding, evidence extraction, and contextual reasoning\([Chenet al\.,](https://arxiv.org/html/2606.03329#bib.bib29); Guanet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib30); Pinget al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib31); Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)\. Together, these results suggest that reward\-based optimization is a promising direction for improving long\-context reasoning\.
Despite these advances, existing rewards mainly supervise grounding quality, evidence selection, reading utility, or intermediate memory\-update behavior\. ReMemR1\(Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)introduces information\-style rewards for memory and callback actions, but these signals are based on word\-level recall rather than answer\-conditioned final\-memory utility\. Direct supervision of the final memory representation itself remains relatively underexplored\. In particular, existing methods rarely evaluate whether the resulting final memory directly supports the ground\-truth answer, which is the focus of our work\.
## 3Problem Setup and Motivation
### 3\.1Chunk\-wise Long\-context Memory Agent
We focus on chunk\-wise memory agents as a practical paradigm for reinforcement learning in long\-context settings\. Formally, letxxdenote the query,DDthe long document, andy∗y^\{\*\}the ground\-truth answer\. Given a pre\-defined chunk sizeCC, the document is divided intoKKchunks:
D=\{c1,c2,…,cK\},D=\\\{c\_\{1\},c\_\{2\},\\ldots,c\_\{K\}\\\},\(1\)Conditioned on the query, the model sequentially reads the chunks and maintains a memory state:
Mt=πθ\(Mt−1,ct,x\),t=1,…,K\.M\_\{t\}=\\pi\_\{\\theta\}\(M\_\{t\-1\},c\_\{t\},x\),\\quad t=1,\\ldots,K\.\(2\)After processing all chunks, the model obtains the final memoryMKM\_\{K\}and generates the final answer based on the query and this memory:
y^=πθ\(x,MK\)\.\\hat\{y\}=\\pi\_\{\\theta\}\(x,M\_\{K\}\)\.\(3\)
### 3\.2Why Outcome Reward Is Insufficient for Memory Learning
The outcome reward directly supervises final answer correctness, but provides only sparse and indirect supervision for memory utility\. In chunk\-wise memory agents, the final prediction is generated based on the final memoryMKM\_\{K\}, which is expected to retain the information necessary for supporting the correct answer\. However, the binary outcome reward evaluates only whether the generated answer matchesy∗y^\{\*\}, without directly distinguishing the quality of different final memories\.
This limitation is particularly evident among successful trajectories\. Multiple rollouts may generate the same correct answer and therefore receive identical outcome rewards, while their final memories can differ substantially in utility\. Some memories may retain only the critical supporting evidence, whereas others may preserve the same evidence together with substantial redundant or distracting information while still yielding the correct prediction\. Consequently, outcome reward alone cannot differentiate memory utility within successful trajectories, motivating the need for a direct reward signal for final\-memory utility in chunk\-wise long\-context reinforcement learning\.
### 3\.3From Mutual Information to Model\-induced Pointwise Surrogate
Long\-context question answering can be viewed as extracting answer\-relevant information from a large context\. Under the chunk\-wise memory\-agent formulation, the final memory should therefore reduce the uncertainty of the answer conditioned on the query\. Ideally, this utility can be characterized by the conditional mutual informationI\(M;Y∣X\)I\(M;Y\\mid X\), which measures how much additional information the memoryMMprovides about the answerYYgiven the queryXX\.
However, distribution\-level mutual information depends on the full joint distribution over queries, memories, and answers, and is difficult to estimate reliably in the high\-dimensional semantic space of LLMs\(Qianet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib5)\)\. This motivates the InfoMem reward introduced in Section[4](https://arxiv.org/html/2606.03329#S4), where we instantiate the mutual\-information objective with a single\-sample pointwise surrogate\. This surrogate measures whether the final memory increases the model’s support for the ground\-truth answer on the current instance\.
## 4Method: InfoMem
Figure 1:Overview of InfoMem for chunk\-wise long\-context RL\. InfoMem measures final\-memory utility using answer\-conditioned information gain by comparing the teacher\-forced per\-token average log\-likelihood of the ground\-truth answery∗y^\{\*\}with and without the final memory\. During GRPO training, information\-gain supervision is applied only to successful trajectories, normalized across successful rollouts, and combined with the sparse outcome reward for policy optimization\.### 4\.1Reward Definition
We train the chunk\-wise memory agent with Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib11)\)\. For each prompt, GRPO samples a group ofnnrollouts,
𝒢=\{1,2,…,n\}\.\\mathcal\{G\}=\\\{1,2,\\ldots,n\\\}\.\(4\)Each rollouti∈𝒢i\\in\\mathcal\{G\}produces a final memoryMiM\_\{i\}and a final answery^i\\hat\{y\}\_\{i\}\. The base outcome reward is defined as
Routcome,i=𝟙\[y^i=y∗\],R\_\{\\mathrm\{outcome\},i\}=\\mathbbm\{1\}\[\\hat\{y\}\_\{i\}=y^\{\*\}\],\(5\)where answer correctness is evaluated by normalized string matching in our main experiments\.
Given a queryxx, a final memoryMM, and the ground\-truth answery∗y^\{\*\}, we define the answer\-conditioned information\-gain rewardrgainr\_\{\\mathrm\{gain\}\}as
rgain\(x,M,y∗\)=1\|y∗\|logPθ\(y∗∣x,M\)−1\|y∗\|logPθ\(y∗∣x,∅\),\\displaystyle r\_\{\\mathrm\{gain\}\}\(x,M,y^\{\*\}\)=\\frac\{1\}\{\|y^\{\*\}\|\}\\log P\_\{\\theta\}\(y^\{\*\}\\mid x,M\)\-\\frac\{1\}\{\|y^\{\*\}\|\}\\log P\_\{\\theta\}\(y^\{\*\}\\mid x,\\emptyset\),
\(6\)where∅\\emptysetdenotes null memory\. The sequence likelihood of LLM on ground\-truth tokensPθP\_\{\\theta\}is computed under teacher forcing:
logPθ\(y∗∣x,M\)=∑j=1\|y∗\|logpθ\(yj∗∣y<j∗,x,M\),\\log P\_\{\\theta\}\(y^\{\*\}\\mid x,M\)=\\sum\_\{j=1\}^\{\|y^\{\*\}\|\}\\log p\_\{\\theta\}\(y^\{\*\}\_\{j\}\\mid y^\{\*\}\_\{<j\},x,M\),\(7\)which corresponds to the summed log\-probability of the ground\-truth tokens\. The first term in Eq\. \([6](https://arxiv.org/html/2606.03329#S4.E6)\) measures the likelihood assigned to the ground\-truth answer when conditioned on the final memory, while the second term measures the corresponding likelihood without memory conditioning\. Consequently, higher values ofrgainr\_\{\\mathrm\{gain\}\}indicate that the final memory increases the likelihood assigned to the correct answer\.
### 4\.2Using Successful Trajectories as Positive Memory Signals
Given the rollout group𝒢\\mathcal\{G\}and the outcome rewards defined above, we define the set of successful rollouts as
𝒮=\{i∈𝒢:Routcome,i=1\}\.\\mathcal\{S\}=\\\{i\\in\\mathcal\{G\}:R\_\{\\mathrm\{outcome\},i\}=1\\\}\.\(8\)
InfoMem applies the information\-gain reward only to rollouts in𝒮\\mathcal\{S\}\. Specifically, successful trajectories are further differentiated byrgain\(x,Mi,y∗\)r\_\{\\mathrm\{gain\}\}\(x,M\_\{i\},y^\{\*\}\), whereas failed trajectories are supervised withRoutcomeR\_\{\\mathrm\{outcome\}\}only\.
This design restricts information\-gain optimization to trajectories whose final answers are already validated by the outcome reward\. Within this subset, differences inrgainr\_\{\\mathrm\{gain\}\}more directly reflect the extent to which the final memory supports the ground\-truth answery∗y^\{\*\}, thereby providing a more stable signal for memory shaping\.
By contrast, failed trajectories may entangle memory quality with incorrect evidence selection or answer generation, makingrgainr\_\{\\text\{gain\}\}less reliable as a memory\-utility signal\. We empirically analyze this issue through wrong\-only and both\-side supervision variants in Section[6](https://arxiv.org/html/2606.03329#S6)\.
### 4\.3Memory Reward Normalization
The scale of raw information\-gain rewards can vary substantially across prompts due to differences in answer uncertainty\. For relatively easy questions, the model may already assign high likelihood to the ground\-truth answer without memory conditioning, resulting in limited likelihood improvement from the final memory\. For more difficult questions, an informative final memory can produce a substantially larger increase in ground\-truth likelihood\. Consequently, directly combining rawrgainr\_\{\\mathrm\{gain\}\}with the binary outcome reward may introduce severe reward\-scale imbalance, causing the information\-gain term to dominate the sparse outcome signal for some prompts while remaining negligible for others\.
Following the reward\-decoupled normalization setting in GDPO\(Liuet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib35)\), InfoMem controls this scale before reward composition by normalizing information\-gain values within the successful trajectories of the same rollout group\. Letri=rgain\(x,Mi,y∗\)r\_\{i\}=r\_\{\\mathrm\{gain\}\}\(x,M\_\{i\},y^\{\*\}\)fori∈𝒮i\\in\\mathcal\{S\}\. We compute
r~i=ri−μ𝒮σ𝒮\+ϵ,i∈𝒮,\\tilde\{r\}\_\{i\}=\\frac\{r\_\{i\}\-\\mu\_\{\\mathcal\{S\}\}\}\{\\sigma\_\{\\mathcal\{S\}\}\+\\epsilon\},\\quad i\\in\\mathcal\{S\},\(9\)whereμ𝒮\\mu\_\{\\mathcal\{S\}\}andσ𝒮\\sigma\_\{\\mathcal\{S\}\}denote the mean and standard deviation of\{ri:i∈𝒮\}\\\{r\_\{i\}:i\\in\\mathcal\{S\}\\\}, andϵ\\epsilonis a small constant for numerical stability\. When\|𝒮\|=0\|\\mathcal\{S\}\|=0, no information\-gain reward is applied\. When\|𝒮\|=1\|\\mathcal\{S\}\|=1, the original value is retained as a fallback strategy\.
The final reward for rolloutiiis defined as
Ri=\{Routcome,i\+βr~i,i∈𝒮,Routcome,i,i∉𝒮,R\_\{i\}=\\begin\{cases\}R\_\{\\mathrm\{outcome\},i\}\+\\beta\\tilde\{r\}\_\{i\},&i\\in\\mathcal\{S\},\\\\ R\_\{\\mathrm\{outcome\},i\},&i\\notin\\mathcal\{S\},\\end\{cases\}\(10\)whereβ\\betacontrols the strength of the normalized information\-gain term\.
This normalization differs from the group\-relative normalization in GRPO\. InfoMem normalizesrgainr\_\{\\mathrm\{gain\}\}before reward composition, controlling the relative scale between the information\-gain and outcome rewards\.
Although InfoMem is computed from the final memory and final answer, the resulting trajectory\-level advantage is propagated to all generated tokens, including both memory\-update and final\-answer tokens\. The complete reward\-construction procedure is summarized in Appendix[A\.2](https://arxiv.org/html/2606.03329#A1.SS2)\.
## 5Experiments
### 5\.1Synthetic Context Discrimination with Hallucinated Evidence
We first evaluate whether answer\-conditioned information gain can distinguish genuinely answer\-supporting evidence from surface\-similar but factually misleading contexts\. We construct a synthetic diagnostic using the SQuAD dataset\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2606.03329#bib.bib6)\)\. Each example contains a question, a supporting context, and a ground\-truth answer\.
Figure 2:Example of synthetic hallucinated evidence\.For each retained QA pair, we generate two hallucinated contexts using Gemini 3 Flash preview\(Google,[2026](https://arxiv.org/html/2606.03329#bib.bib12)\)\. The hallucinated contexts preserve the overall structure and semantic style of the original passage while replacing answer\-critical facts so that they no longer support the ground\-truth answer, as illustrated in Figure[2](https://arxiv.org/html/2606.03329#S5.F2)\.
We compare three categories of evidence scores: \(1\)Information gain\.We compute thergainr\_\{\\mathrm\{gain\}\}score using the same Qwen2\.5\-1\.5B\-Instruct model\(Qwenet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib7)\)as in training, comparing the per\-token average log\-likelihood of the ground\-truth answer with and without the candidate context\. \(2\)Embedding similarity\.We compute cosine similarity between the embedding of candidate\-context and question\-answer template, using mainstream open\-source embedding models\(Zhanget al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib8); Chenet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib9); Wanget al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib10)\)\. \(3\)Attention\-based scores\.Using the same Qwen2\.5\-1\.5B\-Instruct model, we evaluate two attention metrics\.Attn\-Massaverages the total attention mass assigned from answer tokens to context tokens, whileAttn\-Top1averages the maximum attention score assigned to any context token for each answer token\.
We evaluate each score from two complementary perspectives\. First, we compute the mean reciprocal rank \(MRR\) of the true supporting context among the three candidates\. Second, we report a Z\-score signal\-to\-noise ratio \(SNR\) to assess the stability of this discrimination signal\. A higher SNR indicates that the score not only separates true contexts from hallucinated ones, but also does so with lower relative variation across samples, making it more suitable as a training reward\.
Table 1:Synthetic context discrimination results\. Thergainr\_\{\\text\{gain\}\}score corresponds to the answer\-conditioned information\-gain score in Eq\. \([6](https://arxiv.org/html/2606.03329#S4.E6)\)\.Table 2:Main evaluation results on long\-context benchmarks\. All scores are reported as percentages\. The best score for each benchmark is shown in bold\.Table[1](https://arxiv.org/html/2606.03329#S5.T1)shows thatrgainr\_\{\\text\{gain\}\}achieves the highest MRR and Z\-score SNR among all compared scores\. The high MRR indicates strong discrimination between true and hallucinated contexts, while the substantially larger SNR suggests that the resulting signal is also more stable across samples\. Embedding\-based similarities achieve moderate ranking performance but exhibit much weaker SNR\. Attention\-based scores, especiallyAttn\-Top1, improve ranking quality but remain substantially belowrgainr\_\{\\text\{gain\}\}in signal stability\. These results suggest that answer\-conditioned likelihood gain provides both stronger evidence discrimination and a cleaner reward signal for training\. Details of the discrimination experiment are provided in Appendix[B](https://arxiv.org/html/2606.03329#A2)\.
### 5\.2Training Setup
We train on the RULER\-HotpotQA dataset\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\), which applies the RULER long\-context construction paradigm\([Hsiehet al\.,](https://arxiv.org/html/2606.03329#bib.bib13)\)to HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.03329#bib.bib14)\)by mixing answer\-relevant documents with distractors\. Each example contains 200 documents\. The original training split has 32,768 examples and the validation split has 128 examples; for controlled experimentation, we downsample the training split to 512 examples and keep the validation split unchanged\.
All experiments use Qwen2\.5\-1\.5B\-Instruct as the base model and GRPO as the algorithm\. For each prompt, we samplen=8n=8rollouts and train for 120 steps with learning rate1×10−61\\times 10^\{\-6\}\. The chunk size is 5000 tokens, the maximum memory length is 1024 tokens, and the InfoMem coefficient isβ=0\.2\\beta=0\.2\. For controlled comparison, all methods use the same data, model, rollout number, decoding configuration, and training budget\. Our primary baseline is Outcome\-only GRPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\), which removes the information\-gain term while keeping the same training pipeline\. We also report the initial model before RL optimization where appropriate\. Since ReMemR1\(Shiet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib32)\)augments the chunk\-wise memory\-agent framework with callback retrieval, we align all shared training parameters with our setting for fair reproduction, while keeping ReMemR1\-specific parameters consistent with the original paper\. Details are provided in Appendix[A](https://arxiv.org/html/2606.03329#A1)\.
### 5\.3Evaluation
#### 5\.3\.1Benchmarks
Under the same chunk\-wise memory framework, we evaluate the trained models on a set of out\-of\-distribution long\-context benchmarks covering complementary forms of evidence use: \(1\)MRCR\-8needle\(OpenAI,[2025](https://arxiv.org/html/2606.03329#bib.bib15)\)evaluates multi\-needle retrieval; \(2\)RULER synthetic QA\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\)evaluates sparse retrieval\-style question answering, using a different corpus source from the training data; \(3\)CorpusQA\(Luet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib16)\)evaluates corpus\-level evidence aggregation across document collections; and \(4\)LongMemEval\([Wuet al\.,](https://arxiv.org/html/2606.03329#bib.bib17)\)evaluates long\-horizon dialogue memory and conversational information tracking\.
Across benchmarks, we use the metric specified by each benchmark and apply the same evaluation protocol to all compared methods, except that ReMemR1 is evaluated under its original chunk\-wise framework with callback retrieval for train\-test consistency\. CorpusQA and LongMemEval are evaluated with LLM\-as\-judge using Kimi\-K2\.6\(Moonshot AI,[2026](https://arxiv.org/html/2606.03329#bib.bib18)\)111Manual verification shows high similarity to human judgments: 99\.5% on CorpusQA and 96\.8% on LongMemEval\., MRCR\-8needle with sequence match, and RULER synthetic QA with F1 score\. For evaluation efficiency, we downsample the original 800 MRCR\-8needle examples to 100 examples, use the 128K\-token subset of CorpusQA, and use the 115K\-token LongMemEval\-S subset\. Details for evaluation are provided in Appendix[C](https://arxiv.org/html/2606.03329#A3)\.
#### 5\.3\.2Main Results
Table[2](https://arxiv.org/html/2606.03329#S5.T2)shows that InfoMem achieves the best overall performance across all four long\-context benchmarks, outperforming both Outcome\-only GRPO and ReMemR1\. Compared with the initial model, InfoMem consistently improves both retrieval\-oriented tasks and more memory\-intensive settings\. These results suggest that answer\-conditioned information gain provides an effective supervision signal for final\-memory formation beyond sparse final\-answer correctness\.
In contrast, Outcome\-only GRPO produces less consistent improvements\. Although it improves over the initial model on CorpusQA, LongMemEval, and RULER synthetic QA, it substantially degrades performance on MRCR\-8needle, falling below the initial model\. This result suggests that optimizing only sparse outcome rewards may lead to degraded long\-context retrieval behavior\. InfoMem avoids this degradation and achieves the best overall performance, supporting the need for a memory\-specific reward signal in chunk\-wise long\-context RL\.
ReMemR1 also fails to match InfoMem under our sample\-aligned evaluation\. One possible explanation is that its intermediate reward design is primarily based on word\-level recall against the ground\-truth answer, which may encourage lexical overlap rather than preserving semantically supporting evidence\.
## 6Analysis and Ablation
We next analyze the key design choices underlying InfoMem\. Specifically, we study which trajectories should receive information\-gain supervision, why information\-gain rewards must be normalized before reward composition, and why the reward should be conditioned on the ground\-truth answer rather than on the query\.
### 6\.1Which Side Should Receive Information\-gain Supervision?
We analyze which trajectories should receive information\-gain supervision\. Specifically, we compare three supervision\-side variants:Success\(the default InfoMem setting\);Wrong, which appliesrgainr\_\{\\mathrm\{gain\}\}only to failed trajectories; andBoth, which applies it to both\. All variants use the same training setup, differing only in the supervision\.
Figure 3:Effect of information\-gain supervision side\. Top: with outcome reward, validation accuracy measures generalization\. Bottom: without explicit outcome reward, training accuracy diagnoses whether each supervision side provides a usable learning signal\. Thick lines show sliding\-window smoothing with window size 5, and light lines show raw values\.Table 3:Ablation results on long\-context benchmarks\. All scores are reported as percentages\. The best score for each benchmark is shown in bold\.#### 6\.1\.1Main Setting with Outcome Reward
We retain the outcome reward and vary only the supervision side of the information\-gain term\. As shown in the top panel of Figure[3](https://arxiv.org/html/2606.03329#S6.F3), success\-only supervision produces the most stable validation\. This behavior supports the role of successful trajectories as positive memory examples: their final memories are already associated with correct final answers, andrgainr\_\{\\mathrm\{gain\}\}can further distinguish which successful memories provide stronger support for the ground\-truth answer\.
By contrast, wrong\-side and both\-side supervision exhibit unstable training dynamics\. Both variants initially improve but collapse during mid\-stage training, with the both\-side variant remaining near zero afterward\. This suggests that failed trajectories may provide weak useful signals early in training, but their associated memories become increasingly noisy as optimization proceeds\. Mixing successful and failed memories within the same reward objective further weakens the positive memory\-shaping effect of success\-only supervision\.
#### 6\.1\.2No\-outcome Diagnostic
We further conduct a diagnostic experiment without explicit outcome reward\. Rather than serving as the main training objective, this experiment tests whetherrgainr\_\{\\mathrm\{gain\}\}alone can provide a usable learning signal under different supervision sides\. Specifically, we removeRoutcomeR\_\{\\mathrm\{outcome\}\}and optimize using only the information\-gain reward under success\-only, wrong\-only, and both\-side supervision\.
The bottom panel of Figure[3](https://arxiv.org/html/2606.03329#S6.F3)shows that only success\-only supervision produces sustained improvement in training accuracy, whereas wrong\-only and both\-side supervision fail to produce stable learning behavior\. This result suggests that success\-side information gain can provide meaningful memory shaping even without explicit outcome reward, while wrong\-side supervision does not provide a reliable optimization signal\. Overall, these findings support using information\-gain reward as positive memory shaping over successful trajectories rather than as uniform supervision across all trajectories\.
### 6\.2Why Is Normalization Necessary Before Reward Composition?
We evaluate the role of pre\-composition normalization by removing the normalization ofrgainr\_\{\\mathrm\{gain\}\}before reward composition while keeping all other settings unchanged\. As shown in Table[3](https://arxiv.org/html/2606.03329#S6.T3), removing this normalization consistently degrades performance across all evaluated benchmarks\. The degradation is especially pronounced on retrieval\-oriented tasks, suggesting that raw likelihood\-gain rewards are not directly comparable across rollout groups\.
These results support the need for normalization before reward composition\. Since rawrgainr\_\{\\mathrm\{gain\}\}magnitudes can vary substantially across prompts, directly combining them with sparse outcome rewards may introduce unstable reward scaling\. Normalizingrgainr\_\{\\mathrm\{gain\}\}within successful trajectories converts it into a relative memory\-quality signal before reward composition, producing a more stable reward\-shaping signal during training\.
### 6\.3Why Must the Reward Be Answer\-conditioned Rather Than Query\-conditioned?
Figure 4:Fraction of rollouts whose final memory repeats the query during training\.We next examine whether query relevance alone is sufficient for final\-memory reward design\. To this end, we construct a query\-conditioned variant, QueryPMI, which replaces the ground\-truth answer inrgainr\_\{\\mathrm\{gain\}\}with the query itself\. Specifically, given a final memoryMMand the queryxx, we use the template“Based on the previous memory\{M\}, we can answer the query\{x\}\.”and compute the corresponding per\-token log\-likelihood gain on query\. This formulation is intuitively appealing because useful memories are often related to the query\. However, query relevance is not equivalent to answer support: a memory can increase query likelihood simply by copying or paraphrasing the query without preserving answer\-supporting evidence\.
Table[3](https://arxiv.org/html/2606.03329#S6.T3)shows that QueryPMI performs consistently worse than InfoMem across all evaluated benchmarks, indicating that query\-conditioned likelihood gain is a substantially weaker final\-memory reward than answer\-conditioned information gain\. Figure[4](https://arxiv.org/html/2606.03329#S6.F4)further explains this degradation\. During training, QueryPMI rapidly increases the proportion of rollouts whose final memories repeat the query, whereas InfoMem keeps this proportion close to zero\. This behavior suggests that the model exploits the query\-conditioned objective by making the query easier to predict from memory rather than learning to preserve answer\-supporting evidence\. These results highlight the importance of conditioning the reward on the ground\-truth answer rather than on the query alone\. Additional training and validation curves are provided in Appendix[E](https://arxiv.org/html/2606.03329#A5)\.
## 7Conclusion
We studied final\-memory reward design for chunk\-wise long\-context memory agents\. While outcome\-only GRPO supervises memory formation only indirectly through final\-answer correctness, InfoMem introduces answer\-conditioned information gain as a direct reward signal for final\-memory utility\. Experiments show that InfoMem consistently improves long\-context memory\-agent performance over outcome\-only GRPO and a comparable memory\-agent RL baseline, ReMemR1\.
Further analyses reveal three key properties of effective final\-memory rewards: they should operate as positive memory shaping over successful trajectories, be normalized before reward composition, and be conditioned on the ground\-truth answer rather than on the query alone\. Overall, our results suggest that answer\-conditioned information gain provides a more direct and effective supervision signal for learning useful final memories in chunk\-wise long\-context RL\.
## Limitations
Despite these promising results, our current study still has several limitations\.
First, our experiments use a limited training subset and a relatively small base model\. This design reflects both the high computational cost of long\-context reinforcement learning and our focus on controlled evaluation of the proposed reward design rather than large\-scale benchmark optimization\. Scaling InfoMem to larger models and substantially larger training corpora remains an important direction for future study\.
Second, this work focuses specifically on chunk\-wise long\-context memory agents\. The proposed reward is designed for settings in which the model sequentially processes context chunks, maintains an explicit memory state, and generates the final answer from the resulting final memory\. Its applicability to other long\-context paradigms, such as retrieval\-only systems or full\-context single\-pass models, remains unexplored\.
Third, the current reward is defined only at the final step\. Although GRPO propagates the resulting trajectory\-level advantage to all generated tokens, the reward itself evaluates only the final memory and final answer\. Extending answer\-conditioned information gain toward intermediate memory states and step\-wise process supervision remains future work\.
Potential risks should also be considered\. Since InfoMem encourages final memories that increase support for a target answer, erroneous answers or biased source documents may lead the model to preserve and amplify misleading evidence\. The answer\-conditioned reward may also be over\-optimized toward memories that are highly associated with the expected answer while omitting important qualifications from the original context\. These risks are especially relevant in high\-stakes long\-document applications, such as legal, medical, or financial analysis, where compressed memory states should not replace source\-document verification or human review\.
## References
- \[1\]LongRLVR: long\-context reinforcement learning requires verifiable context rewards\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2318–2335\.External Links:[Link](https://aclanthology.org/2024.findings-acl.137/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.137)Cited by:[§B\.3\.2](https://arxiv.org/html/2606.03329#A2.SS3.SSS2.p1.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p3.1)\.
- Z\. Dai, Z\. Yang, Y\. Yang, J\. Carbonell, Q\. Le, and R\. Salakhutdinov \(2019\)Transformer\-XL: attentive language models beyond a fixed\-length context\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 2978–2988\.External Links:[Link](https://aclanthology.org/P19-1285/),[Document](https://dx.doi.org/10.18653/v1/P19-1285)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1)\.
- S\. Ding, J\. Shang, S\. Wang, Y\. Sun, H\. Tian, H\. Wu, and H\. Wang \(2021\)ERNIE\-doc: a retrospective long\-document modeling transformer\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 2914–2927\.Cited by:[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§A\.3](https://arxiv.org/html/2606.03329#A1.SS3.p2.1)\.
- Google \(2026\)Gemini 3 flash preview\.Note:[https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\-flash\-preview](https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview)Google AI for Developers\. Last updated 2026\-04\-28 UTC\. Accessed: 2026\-05\-19Cited by:[§B\.1](https://arxiv.org/html/2606.03329#A2.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p2.1)\.
- X\. Guan, Z\. Li, S\. Huang, P\. Xie, J\. Zhou, and J\. Cao \(2026\)Evidence\-augmented policy optimization with reward co\-evolution for long\-context reasoning\.External Links:2601\.10306,[Link](https://arxiv.org/abs/2601.10306)Cited by:[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1)\.
- \[9\]C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, and B\. GinsburgRULER: what’s the real context size of your long\-context language models?\.InFirst Conference on Language Modeling,Cited by:[§A\.1](https://arxiv.org/html/2606.03329#A1.SS1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§5\.2](https://arxiv.org/html/2606.03329#S5.SS2.p1.1)\.
- D\. Hutchins, I\. Schlag, Y\. Wu, E\. Dyer, and B\. Neyshabur \(2022\)Block\-recurrent transformers\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 33248–33261\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/d6e0bbb9fc3f4c10950052ec2359355c-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23\-26, 2023,J\. Flinn, M\. I\. Seltzer, P\. Druschel, A\. Kaufmann, and J\. Mace \(Eds\.\),pp\. 611–626\.External Links:[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[§C\.1](https://arxiv.org/html/2606.03329#A3.SS1.p3.1)\.
- K\. Lee, X\. Chen, H\. Furuta, J\. Canny, and I\. Fischer \(2024\)A human\-inspired reading agent with gist memory of very long contexts\.InInternational Conference on Machine Learning,pp\. 26396–26415\.Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1)\.
- S\. Liu, X\. Dong, X\. Lu, S\. Diao, P\. Belcak, M\. Liu, M\. Chen, H\. Yin, Y\. F\. Wang, K\. Cheng, Y\. Choi, J\. Kautz, and P\. Molchanov \(2026\)GDPO: group reward\-decoupled normalization policy optimization for multi\-reward rl optimization\.External Links:2601\.05242,[Link](https://arxiv.org/abs/2601.05242)Cited by:[§4\.3](https://arxiv.org/html/2606.03329#S4.SS3.p2.2)\.
- Z\. Lu, C\. Li, Y\. Shi, W\. Shen, M\. Yan, and F\. Huang \(2026\)CorpusQA: a 10 million token benchmark for corpus\-level analysis and reasoning\.External Links:2601\.14952,[Link](https://arxiv.org/abs/2601.14952)Cited by:[§C\.1](https://arxiv.org/html/2606.03329#A3.SS1.p3.1),[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1.p1.1)\.
- Moonshot AI \(2026\)Kimi k2\.6: advancing open\-source coding\.Note:Kimi Tech Blog\. Accessed: 2026\-05\-19External Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[§5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1.p2.1)\.
- T\. Munkhdalai, M\. Faruqui, and S\. Gopal \(2024\)Leave no context behind: efficient infinite context transformers with infini\-attention\.External Links:2404\.07143,[Link](https://arxiv.org/abs/2404.07143)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p1.1)\.
- OpenAI \(2025\)OpenAI mrcr: long context multiple needle in a haystack benchmark\.Note:[https://huggingface\.co/datasets/openai/mrcr](https://huggingface.co/datasets/openai/mrcr)Hugging Face dataset repository\. Accessed: 2026\-05\-19Cited by:[§C\.1](https://arxiv.org/html/2606.03329#A3.SS1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1.p1.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2024\)MemGPT: towards llms as operating systems\.External Links:2310\.08560,[Link](https://arxiv.org/abs/2310.08560)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1)\.
- B\. Ping, Z\. Chen, Y\. Yu, T\. Hui, J\. Yan, and B\. Chang \(2026\)LongR: unleashing long\-context reasoning via reinforcement learning with dense utility rewards\.External Links:2602\.05758,[Link](https://arxiv.org/abs/2602.05758)Cited by:[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1)\.
- \[20\]O\. Press, N\. Smith, and M\. LewisTrain short, test long: attention with linear biases enables input length extrapolation\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p1.1)\.
- C\. Qian, D\. Liu, H\. Wen, Z\. Bai, Y\. Liu, and J\. Shao \(2025\)Demystifying reasoning dynamics with mutual information: thinking tokens are information peaks in llm reasoning\.External Links:2506\.02867,[Link](https://arxiv.org/abs/2506.02867)Cited by:[§3\.3](https://arxiv.org/html/2606.03329#S3.SS3.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§A\.1](https://arxiv.org/html/2606.03329#A1.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p3.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 2383–2392\.External Links:[Link](https://aclanthology.org/D16-1264/),[Document](https://dx.doi.org/10.18653/v1/D16-1264)Cited by:[§B\.1](https://arxiv.org/html/2606.03329#A2.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§4\.1](https://arxiv.org/html/2606.03329#S4.SS1.p1.1)\.
- W\. Shen, Z\. Yang, C\. Li, Z\. Lu, M\. Peng, H\. Sun, Y\. Shi, S\. Liao, S\. Lai, B\. Zhang, D\. Liu, F\. Huang, J\. Zhou, and M\. Yan \(2025\)QwenLong\-l1\.5: post\-training recipe for long\-context reasoning and memory management\.External Links:2512\.12967,[Link](https://arxiv.org/abs/2512.12967)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§A\.1](https://arxiv.org/html/2606.03329#A1.SS1.p1.1),[§A\.3](https://arxiv.org/html/2606.03329#A1.SS3.p1.1)\.
- Y\. Shi, Y\. Chen, S\. Wang, S\. Li, H\. Cai, Q\. Gu, X\. Wang, and A\. Zhang \(2026\)Look back to reason forward: revisitable memory for long\-context llm agents\.External Links:2509\.23040,[Link](https://arxiv.org/abs/2509.23040)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p3.1),[§1](https://arxiv.org/html/2606.03329#S1.p6.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p2.1),[§5\.2](https://arxiv.org/html/2606.03329#S5.SS2.p2.3),[Table 2](https://arxiv.org/html/2606.03329#S5.T2.1.5.4.1)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024\)Multilingual e5 text embeddings: a technical report\.Technical reportTechnical ReportMSR\-TR\-2024\-45,Microsoft\.External Links:[Link](https://www.microsoft.com/en-us/research/publication/multilingual-e5-text-embeddings-a-technical-report/)Cited by:[§B\.3\.2](https://arxiv.org/html/2606.03329#A2.SS3.SSS2.p1.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p3.1)\.
- \[29\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. YuLongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§C\.1](https://arxiv.org/html/2606.03329#A3.SS1.p3.1),[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/),[Document](https://dx.doi.org/10.18653/v1/D18-1259)Cited by:[§A\.1](https://arxiv.org/html/2606.03329#A1.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.03329#S5.SS2.p1.1)\.
- H\. Yu, T\. Chen, J\. Feng, J\. Chen, W\. Dai, Q\. Yu, Y\. Zhang, W\. Ma, J\. Liu, M\. Wang, and H\. Zhou \(2025\)MemAgent: reshaping long\-context llm with multi\-conv rl\-based memory agent\.External Links:2507\.02259,[Link](https://arxiv.org/abs/2507.02259)Cited by:[§A\.1](https://arxiv.org/html/2606.03329#A1.SS1.p2.1),[§C\.1](https://arxiv.org/html/2606.03329#A3.SS1.p3.1),[Appendix D](https://arxiv.org/html/2606.03329#A4.p1.1),[§1](https://arxiv.org/html/2606.03329#S1.p2.1),[§1](https://arxiv.org/html/2606.03329#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.03329#S2.SS2.p1.1),[§5\.2](https://arxiv.org/html/2606.03329#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2606.03329#S5.SS2.p2.3),[§5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.External Links:2506\.05176,[Link](https://arxiv.org/abs/2506.05176)Cited by:[§B\.3\.2](https://arxiv.org/html/2606.03329#A2.SS3.SSS2.p1.1),[§5\.1](https://arxiv.org/html/2606.03329#S5.SS1.p3.1)\.
- Q\. Zhao, R\. Wang, Y\. Cen, D\. Zha, S\. Tan, Y\. Dong, and J\. Tang \(2024\)LongRAG: a dual\-perspective retrieval\-augmented generation paradigm for long\-context question answering\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 22600–22632\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1259/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1259)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03329#S2.SS1.p1.1)\.
- Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang \(2025\)MEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.External Links:2506\.15841,[Link](https://arxiv.org/abs/2506.15841)Cited by:[§1](https://arxiv.org/html/2606.03329#S1.p1.1)\.
## Appendix AImplementation Details
### A\.1Model, Training Data, and Compute
All training runs use Qwen2\.5\-1\.5B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib7)\)as the base model and the corresponding Qwen2\.5 tokenizer\. All compared methods are initialized from the same base checkpoint and are trained under the same chunk\-wise memory\-agent framework\. We implement RL fine\-tuning with the veRL framework\(Shenget al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib33)\)\.
We train on RULER\-HotpotQA\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\), a long\-document QA dataset constructed by applying the RULER\([Hsiehet al\.,](https://arxiv.org/html/2606.03329#bib.bib13)\)long\-context generation paradigm to HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.03329#bib.bib14)\)\. Each example contains a query, a ground\-truth answer, and a long context consisting of 200 documents\. The original training split contains 32,768 examples, totaling approximately 973M tokens, and the validation split contains 128 examples, totaling approximately 4\.1M tokens\. To focus on controlled reward\-design evaluation rather than large\-scale benchmark optimization, we downsample the training split to 512 examples, corresponding to approximately 15\.2M tokens, while keeping the validation split unchanged\.
Table[4](https://arxiv.org/html/2606.03329#A1.T4)summarizes the main training configuration\. All main comparison methods use the same data, base model, rollout number, optimization budget, decoding configuration, and compute budget\. Each training run requires approximately 440 GPU\-hours on 16 NVIDIA H20 GPUs\.
Table 4:Training configuration used for the main comparison experiments\. Decoding, maximum input/output length, and KL\-related settings are kept fixed across compared methods\.ReMemR1 additionally uses method\-specific advantage\-composition parameters\. Since ReMemR1 retrieves previous memories and inserts them back into the recurrent prompt, we set its maximum prompt length to 2048 tokens\. These ReMemR1\-specific settings are kept consistent with the original ReMemR1 setting and are summarized in Table[5](https://arxiv.org/html/2606.03329#A1.T5)\.
Table 5:ReMemR1\-specific reward and advantage parameters\.
### A\.2GRPO and Reward Implementation
Algorithm[1](https://arxiv.org/html/2606.03329#alg1)summarizes the reward\-construction procedure used to interface InfoMem with GRPO\. The procedure only changes the scalar reward assigned to each rollout; the GRPO optimizer itself is unchanged\.
Algorithm 1InfoMem Reward Construction1:Query
xx, long context
DD, ground\-truth answer
y∗y^\{\*\}, rollout group size
nn
2:Rollout rewards
\{Ri\}i=1n\\\{R\_\{i\}\\\}\_\{i=1\}^\{n\}for GRPO
3:
\{\(Mi,y^i\)\}i=1n←Rollout\(πθ,x,D,n\)\\\{\(M\_\{i\},\\hat\{y\}\_\{i\}\)\\\}\_\{i=1\}^\{n\}\\leftarrow\\textsc\{Rollout\}\(\\pi\_\{\\theta\},x,D,n\)
4:for
i=1,…,ni=1,\\ldots,ndo
5:
Routcome,i←𝟙\[Norm\(y^i\)=Norm\(y∗\)\]R\_\{\\mathrm\{outcome\},i\}\\leftarrow\\mathbbm\{1\}\\\!\\left\[\\mathrm\{Norm\}\(\\hat\{y\}\_\{i\}\)=\\mathrm\{Norm\}\(y^\{\*\}\)\\right\]
6:
Ri←Routcome,iR\_\{i\}\\leftarrow R\_\{\\mathrm\{outcome\},i\}
7:endfor
8:
𝒮←\{i:Routcome,i=1\}\\mathcal\{S\}\\leftarrow\\\{i:R\_\{\\mathrm\{outcome\},i\}=1\\\}
9:if
\|𝒮\|\>0\|\\mathcal\{S\}\|\>0then
10:for
i∈𝒮i\\in\\mathcal\{S\}do
11:
ri←rgain\(x,Mi,y∗\)r\_\{i\}\\leftarrow r\_\{\\mathrm\{gain\}\}\(x,M\_\{i\},y^\{\*\}\)⊳\\trianglerightteacher\-forced scoring
12:endfor
13:if
\|𝒮\|=1\|\\mathcal\{S\}\|=1then
14:
r~i←ri\\tilde\{r\}\_\{i\}\\leftarrow r\_\{i\}for the only
i∈𝒮i\\in\\mathcal\{S\}
15:else
16:
μ𝒮←mean\(\{ri:i∈𝒮\}\)\\mu\_\{\\mathcal\{S\}\}\\leftarrow\\mathrm\{mean\}\(\\\{r\_\{i\}:i\\in\\mathcal\{S\}\\\}\)
17:
σ𝒮←std\(\{ri:i∈𝒮\}\)\\sigma\_\{\\mathcal\{S\}\}\\leftarrow\\mathrm\{std\}\(\\\{r\_\{i\}:i\\in\\mathcal\{S\}\\\}\)
18:for
i∈𝒮i\\in\\mathcal\{S\}do
19:
r~i←\(ri−μ𝒮\)/\(σ𝒮\+ϵ\)\\tilde\{r\}\_\{i\}\\leftarrow\(r\_\{i\}\-\\mu\_\{\\mathcal\{S\}\}\)/\(\\sigma\_\{\\mathcal\{S\}\}\+\\epsilon\)
20:endfor
21:endif
22:for
i∈𝒮i\\in\\mathcal\{S\}do
23:
Ri←Ri\+βr~iR\_\{i\}\\leftarrow R\_\{i\}\+\\beta\\tilde\{r\}\_\{i\}
24:endfor
25:endif
26:return
\{Ri\}i=1n\\\{R\_\{i\}\\\}\_\{i=1\}^\{n\}
The reward computation is detached from the policy\-gradient path\. After reward composition, the resulting trajectory\-level advantage is assigned to all generated tokens in the rollout, including both memory\-update tokens and final\-answer tokens\. The normalized matching rule used to computeRoutcomeR\_\{\\mathrm\{outcome\}\}in Algorithm[1](https://arxiv.org/html/2606.03329#alg1)is used only for training\-time rewards; benchmark evaluation uses the task\-specific metrics in Section[5\.3\.1](https://arxiv.org/html/2606.03329#S5.SS3.SSS1)\.
### A\.3Training\-time Outcome Reward
During training, the sparse outcome reward is computed by normalized boxed\-answer matching following the default settings in veRL framework\(Shenget al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib33)\)\. For each rollout, we first keep the last 300 characters of the generated solution and convert them to lowercase\. We then extract the answer from the last occurrence of either\\boxed\{\.\.\.\}or\\boxedin this suffix, as the training prompts explicitly require the model to place the final answer inside a\\boxed\{\}expression\. If no boxed answer is found, the rollout is assigned zero outcome reward\.
The extracted answer is compared against a list of ground\-truth answers\. A rollout is marked successful if the extracted answer matches any ground\-truth answer after normalization:
Routcome=𝟙\[∃y∈𝒴∗:Norm\(y^\)=Norm\(y\)\],R\_\{\\mathrm\{outcome\}\}=\\mathbbm\{1\}\\left\[\\exists y\\in\\mathcal\{Y\}^\{\*\}:\\operatorname\{Norm\}\(\\hat\{y\}\)=\\operatorname\{Norm\}\(y\)\\right\],where𝒴∗\\mathcal\{Y\}^\{\*\}denotes the set of acceptable ground\-truth answers\. The normalization follows the string normalization used in the Hendrycks MATH evaluation script from EleutherAI’s lm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib34)\)\. It removes line breaks, inverse spaces, dollar signs, percentage symbols,\\left/\\right, degree markers, and whitespace; normalizestfrac/dfractofrac; normalizes escaped backslashes; and canonicalizes decimal forms such as\.5to0\.5\. The normalized strings are then compared by exact matching\.
## Appendix BSynthetic Hallucinated\-Evidence Diagnostic
### B\.1Dataset Construction
We construct the hallucinated\-evidence diagnostic from SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2606.03329#bib.bib6)\)\. Each original example contains a question, a supporting context, and a ground\-truth answer\. To reduce the influence of parametric knowledge, we retain only examples for which the scoring model fails to answer correctly without context but succeeds when conditioned on the original supporting context\. This filtering makes the diagnostic focus on context\-dependent evidence use rather than memorized answer recall\.
For each retained QA pair, we build a three\-context candidate set consisting of the original supporting context and two synthetic hallucinated contexts\. The hallucinated contexts are generated with Gemini 3 Flash preview\(Google,[2026](https://arxiv.org/html/2606.03329#bib.bib12)\)\. They are required to preserve the topic, discourse structure, and surface style of the original passage while replacing answer\-critical facts so that the resulting text no longer supports the ground\-truth answer\. This creates a controlled setting in which lexical and topical similarity alone is insufficient for identifying the truly answer\-supporting context\. After construction, each diagnostic instance contains one positive context and two hallucinated negative contexts, and each evidence score is evaluated by ranking these three candidates for the same question\-answer pair\.
### B\.2Hallucinated Context Generation Prompt
We use Gemini 3 Flash preview to generate two hallucinated contexts for each retained QA pair\. The prompt provides the question, the ground\-truth answer, and the original supporting context\. The generation is constrained to keep the passage topically and stylistically similar to the original context while corrupting answer\-critical facts, so that the generated contexts remain plausible distractors but no longer support the ground\-truth answer\. The prompt template is shown below\.
```
You are given a QA sample.
Generate two hallucinated contexts.
Requirements:
1. Keep each context relevant to the question.
2. Preserve the topic, style, and structure of
the reference context as much as possible.
3. Modify key supporting facts so that the
context does not support the gold answer.
4. Do not change the question.
5. Do not remove all relevant information.
6. Do not reveal the gold answer directly.
7. Do not add meta text, warnings, or labels
indicating that the context is false.
Question:
{question}
Gold answer:
{gold_answer}
Reference context:
{context}
Return format:
{
"hallucinated_context_1": "...",
"hallucinated_context_2": "..."
}
```
We conduct manual spot checks to verify the quality of the generated hallucinated contexts, ensuring that they remain fluent and plausible while no longer supporting the ground\-truth answer\.
### B\.3Evidence Score Computation
#### B\.3\.1Information\-gain Score
For each candidate contextcc, we compute the answer\-conditioned information\-gain score with the same Qwen2\.5\-1\.5B\-Instruct model used in training:
rgain\(x,c,y∗\)\\displaystyle r\_\{\\mathrm\{gain\}\}\(x,c,y^\{\*\}\)=1\|y∗\|logPθ\(y∗∣x,c\)\\displaystyle=\\frac\{1\}\{\|y^\{\*\}\|\}\\log P\_\{\\theta\}\(y^\{\*\}\\mid x,c\)\(11\)−1\|y∗\|logPθ\(y∗∣x,∅\)\.\\displaystyle\\quad\-\\frac\{1\}\{\|y^\{\*\}\|\}\\log P\_\{\\theta\}\(y^\{\*\}\\mid x,\\emptyset\)\.The two likelihood terms are computed under teacher forcing over the ground\-truth answer tokens\. The null\-context term uses the same queryxxbut replaces the candidate context with an empty context\. No answer sampling or decoding is used for this score\.
#### B\.3\.2Embedding Similarity
For embedding\-based scores, we compare each candidate context against a question\-answer query representation\. Specifically, we encode the candidate contextccand the template“Question:\{QUESTION\}Answer:\{ANSWER\}”with the same embedding model, then compute cosine similarity between the two embeddings\. We evaluate this procedure with three open\-source embedding models: Qwen3\-Embedding\-0\.6B\(Zhanget al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib8)\), BGE\-M3\(Chenet al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib9)\), and multilingual\-E5\-large\-instruct\(Wanget al\.,[2024](https://arxiv.org/html/2606.03329#bib.bib10)\)\. Scores are computed independently for each embedding model\.
#### B\.3\.3Attention\-based Scores
For attention\-based scores, we use the same Qwen2\.5\-1\.5B\-Instruct model and feed the candidate context, question, and ground\-truth answer into the model\. We extract attention from the final attention layer\. Since the layer contains multiple attention heads, we first average attention weights across heads\. LetAu,vA\_\{u,v\}denote the resulting head\-averaged final\-layer attention from tokenuuto tokenvv,𝒴\\mathcal\{Y\}denote the answer\-token positions, and𝒞\\mathcal\{C\}denote the candidate\-context token positions\.
Attn\-Massmeasures the average total attention assigned from answer tokens to candidate\-context tokens:
smass=1\|𝒴\|∑u∈𝒴∑v∈𝒞Au,v\.s\_\{\\mathrm\{mass\}\}=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{u\\in\\mathcal\{Y\}\}\\sum\_\{v\\in\\mathcal\{C\}\}A\_\{u,v\}\.\(12\)Attn\-Top1measures the average strongest context\-token attention for each answer token:
stop1=1\|𝒴\|∑u∈𝒴maxv∈𝒞Au,v\.s\_\{\\mathrm\{top1\}\}=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{u\\in\\mathcal\{Y\}\}\\max\_\{v\\in\\mathcal\{C\}\}A\_\{u,v\}\.\(13\)Both attention scores are higher\-is\-better and are computed without using the generated answer\.
### B\.4Metrics: MRR and Z\-score SNR
For each question, the candidate set contains one true supporting context and two hallucinated contexts\. Since all scores are higher\-is\-better, we sort the three candidates in descending score order and record the rank of the true supporting context\. GivenNNdiagnostic examples, mean reciprocal rank is computed as
MRR=1N∑q=1N1ranksupporting context\.\\mathrm\{MRR\}=\\frac\{1\}\{N\}\\sum\_\{q=1\}^\{N\}\\frac\{1\}\{\\operatorname\{rank\}\_\{\\text\{supporting context\}\}\}\.\(14\)MRR measures whether a score ranks the true supporting context ahead of hallucinated contexts\.
We also compute a Z\-score signal\-to\-noise ratio \(SNR\) to measure the stability of the true\-versus\-hallucinated separation\. For each questionqq, letsq\+s\_\{q\}^\{\+\}denote the score of the true context andsq,1−,sq,2−s\_\{q,1\}^\{\-\},s\_\{q,2\}^\{\-\}denote the two hallucinated\-context scores\. We first standardize the three scores within the same question:
zq,j=sq,j−μqσq\+ϵ,j∈\{\+,1,2\},z\_\{q,j\}=\\frac\{s\_\{q,j\}\-\\mu\_\{q\}\}\{\\sigma\_\{q\}\+\\epsilon\},\\quad j\\in\\\{\+,1,2\\\},\(15\)whereμq\\mu\_\{q\}andσq\\sigma\_\{q\}are computed over the three candidate scores for questionqq, andϵ\\epsilonis used for numerical stability\. We then compute two standardized margins for each question:
Δq,k=zq\+−zq,k−,k∈\{1,2\}\.\\Delta\_\{q,k\}=z\_\{q\}^\{\+\}\-z\_\{q,k\}^\{\-\},\\quad k\\in\\\{1,2\\\}\.\(16\)Pooling all2N2Nmargins intoΔ\\Delta, the reported SNR is
SNR=mean\(Δ\)std\(Δ\)\.\\mathrm\{SNR\}=\\frac\{\\operatorname\{mean\}\(\\Delta\)\}\{\\operatorname\{std\}\(\\Delta\)\}\.\(17\)A higher SNR indicates that the true context is separated from hallucinated contexts with a larger and more stable standardized margin, making the score more suitable as a reward signal\.
## Appendix CEvaluation Protocols
### C\.1Benchmark Summary
Table[6](https://arxiv.org/html/2606.03329#A3.T6)summarizes the evaluation benchmarks used in Section[5\.3](https://arxiv.org/html/2606.03329#S5.SS3)\. InfoMem and the outcome\-only GRPO baseline are evaluated under the same chunk\-wise memory\-agent framework, with the same memory budget, decoding configuration, benchmark subset, and metric for each benchmark\. ReMemR1 is evaluated under its original callback\-retrieval framework, described in Appendix[C\.2](https://arxiv.org/html/2606.03329#A3.SS2), to preserve train\-test consistency\.
Table 6:Summary of evaluation benchmarks and metrics\.MRCR\-8needle\(OpenAI,[2025](https://arxiv.org/html/2606.03329#bib.bib15)\)is particularly challenging in the chunk\-wise memory\-agent setting because the model cannot directly attend over the full context\. Instead, it must preserve multiple sparse targets through compressed sequential memory updates\. This setting is difficult for a 1\.5B model under an 8\-needle retrieval task, so absolute sequence\-match scores are low across methods\. The comparison remains controlled because all methods use the same model scale, memory length, downsampled subset, and evaluation protocol\.
For RULER synthetic QA\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\), we use the 262,144\-token evaluation setting and report F1\. CorpusQA\(Luet al\.,[2026](https://arxiv.org/html/2606.03329#bib.bib16)\)is evaluated on the 128K\-token subset, and LongMemEval\([Wuet al\.,](https://arxiv.org/html/2606.03329#bib.bib17)\)is evaluated on the 115K\-token LongMemEval\-S subset\. For CorpusQA and LongMemEval, we use LLM\-as\-judge evaluation with Kimi\-K2\.6; the judging protocol is described in Appendix[C\.3](https://arxiv.org/html/2606.03329#A3.SS3)\. For MRCR\-8needle, we use the same fixed 100\-example subset for all methods to reduce evaluation cost\. All model generations during benchmark evaluation are served with vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.03329#bib.bib36)\)\. Each evaluation is run once\. Following the original ReMemR1 paper and code, ReMemR1 is evaluated with temperature sampling att=0\.7t=0\.7, whereas all other evaluations use greedy sampling\. The evaluation datasets are released under permissive MIT or Apache\-2\.0 licenses and can be used for academic research\.
### C\.2ReMemR1 Callback\-Retrieval Framework
ReMemR1 augments the standard chunk\-wise memory\-agent framework with callback retrieval over previously generated memories\. At each chunk, the model maintains a current memory and a callback query\. The callback query is used to retrieve a relevant historical memory from the memories generated in earlier steps, and the retrieved memory is inserted into the next recurrent prompt together with the current memory, the current chunk, and the original question\. The model then generates an updated memory and a new callback query for the next step\. At the final step, ReMemR1 similarly uses the accumulated memory state and callback\-retrieved historical memory to produce the answer\.
Because this callback mechanism changes the inference framework rather than only the reward function, we do not force ReMemR1 into the pure chunk\-wise setting used by InfoMem\. Instead, we keep the shared settings aligned where applicable, including the base model, training data, chunk size, rollout decoding configuration, and benchmark evaluation protocol, while preserving ReMemR1\-specific callback and advantage settings\.
### C\.3LLM\-as\-Judge Protocol
CorpusQA and LongMemEval are evaluated with the same binary answer\-equivalence judging protocol\. For each example, the judge receives the query, the ground\-truth answer candidates, and the model answer\. It does not receive the method name or any training configuration\. The judge returns a binary label, where 1 denotes a correct answer and 0 denotes an incorrect answer\. The reported score is the percentage of examples labeled correct\.
We use Kimi\-K2\.6 as the judge model and apply the following prompt:
```
SYSTEM_PROMPT = """You are a strict
answer-equivalence judge.
Return 1 only when the model answer fully
contains one ground-truth candidate with
exactly the same meaning: nothing meaningful
is missing and nothing meaningful is added.
Ignore only semantically empty surface
differences, such as articles like "the",
punctuation, spaces, line breaks, case, LaTeX,
or boxing / markup symbols.
If the answer is missing any part of the
ground truth, adds any meaningful content,
or you are unsure, return 0.
First write a brief reason. Then write the
final binary judgment inside
<answer></answer>.
The <answer> tag must contain exactly one
character: 0 or 1."""
USER_TEMPLATE = """Query:
{query}
Ground truth candidates:
{ground_truth_candidates}
Model answer:
{answer}
Does the model answer fully contain any
ground-truth candidate with exactly the same
meaning, allowing only semantically empty
surface characters such as "the", punctuation,
spaces, case differences, or boxing / markup
symbols?
Output format:
Reason: <brief reason>
<answer>0 or 1</answer>"""
```
The final decision is parsed from the binary value inside the<answer\>tag\. This protocol treats paraphrases and minor formatting differences as correct only when the predicted answer preserves the complete meaning of one ground\-truth candidate\. Answers are marked incorrect if they omit essential information, contradict the ground truth, add meaningful unsupported content, or cannot be confidently judged as equivalent\.
### C\.4Human Calibration of LLM\-as\-Judge
We further conduct human calibration to assess the reliability of the LLM judge\. We hired an undergraduate annotator and presented each example with the query, the ground\-truth answer candidates, and the model answer\. The annotator assigned the same binary correctness label as the LLM\-as\-judge protocol, and we compared the human labels with the LLM\-judge outputs\. We explicitly informed the annotator that the annotations would be used for anonymized academic research\. The annotator was paid at an hourly rate ten times the local minimum wage\.
For calibration, we inspect three CorpusQA result files, corresponding to the initial instruct model, the Outcome\-only GRPO baseline, and InfoMem, as well as one LongMemEval result file from the Outcome\-only GRPO baseline\. This verification shows high agreement with human judgments, with 99\.5% agreement on CorpusQA and 96\.8% agreement on LongMemEval\.
### C\.5Checkpoint Selection Protocol
For the main comparison experiments, all RL methods are trained for 120 optimization steps under the same training budget\. During training, we evaluate checkpoints every 2 optimization steps on the validation split described in Appendix[A\.1](https://arxiv.org/html/2606.03329#A1.SS1)\. The final checkpoint reported for each method is selected according to the highest validation accuracy under this fixed validation protocol\. If multiple checkpoints obtain the same validation accuracy, we select the earlier checkpoint to favor better generalization\.
This selection rule is fixed before benchmark comparison and is applied consistently to all training methods\. It prevents selecting checkpoints based on test\-set performance, while still accounting for the instability of long\-context RL fine\-tuning\. Diagnostic variants are evaluated with the same validation\-based selection principle when benchmark scores are reported\.
## Appendix DPrompt Templates
The memory\-update and final\-answer templates follow the MemAgent setting\(Yuet al\.,[2025](https://arxiv.org/html/2606.03329#bib.bib1)\)\. We use the following template for chunk\-wise memory updates:
```
TEMPLATE = """
You are presented with a problem, a section of an
article that may contain the answer to the problem,
and a previous memory. Please read the provided
section carefully and update the memory with the
new information that helps to answer the problem.
Be sure to retain all relevant details from the
previous memory while adding any new, useful
information.
<problem>
{prompt}
</problem>
<memory>
{memory}
</memory>
<section>
{chunk}
</section>
Updated memory:
"""
```
The final answer is generated with the following template:
```
TEMPLATE_FINAL_BOXED = """
You are presented with a problem and a previous
memory. Please answer the problem based on the
previous memory and put the answer in \\boxed{{}}.
<problem>
{prompt}
</problem>
<memory>
{memory}
</memory>
Your answer:
"""
```
The information\-gain rewardrgainr\_\{\\mathrm\{gain\}\}uses the same final\-answer template: the ground\-truth answer is appended afterYour answer:and scored by teacher\-forced per\-token log\-likelihood, with either the final memory or an empty memory field\.
QueryPMI instead uses the following query\-conditioned template:
```
TEMPLATE_QUERY_PMI = """
Based on the previous memory,
<memory>
{memory}
</memory>
we can answer the Query: """
```
## Appendix EAdditional Training Curves and Diagnostics
### E\.1Main Training Curves
Figure 5:Training curves for the main comparison runs and diagnostic variants\.Figure 6:Validation curves for the main comparison runs and diagnostic variants\.
\(a\)Training accuracy\.

\(b\)Validation accuracy\.
Figure 7:ReMemR1 training dynamics under its callback\-retrieval chunk\-wise framework\. Both curves are smoothed with a sliding window of 5\.Figures[5](https://arxiv.org/html/2606.03329#A5.F5)and[6](https://arxiv.org/html/2606.03329#A5.F6)report the training and validation curves for the main comparison runs and diagnostic variants\. All main comparison methods are trained for 120 optimization steps under the same data, model, rollout number, and validation protocol\. Checkpoints are evaluated every 2 optimization steps, and the reported benchmark checkpoint is selected according to the validation protocol described in Appendix[C\.5](https://arxiv.org/html/2606.03329#A3.SS5)\.
The no\-standardization ablation was additionally continued beyond 120 steps, because its curve was still increasing near the end of the original training budget\. After continuing the run, the best checkpoint was still step 118, which lies within the original 120\-step budget\. Therefore, the reported benchmark result for this ablation remains comparable to the other methods under the fixed checkpoint\-selection protocol\.
ReMemR1 uses a callback\-retrieval chunk\-wise framework rather than the pure chunk\-wise framework used by InfoMem and the outcome\-only GRPO baseline\. We therefore report its training dynamics separately in Figure[7](https://arxiv.org/html/2606.03329#A5.F7), instead of overlaying them with the main comparison curves\. Because of this framework difference, ReMemR1 is also trained for more than 120 optimization steps\.Similar Articles
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.
MemTrain: Self-Supervised Context Memory Training
MemTrain proposes a self-supervised training framework that uses masked reconstruction and intermediate memory recall proxy tasks on Wikipedia corpora to enhance LLM agents' context memory, achieving up to 17.67 point gains on downstream memory-intensive QA benchmarks.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
This paper empirically studies how the composition of training data (curriculum) affects the skills learned by RL-based memory agents in multi-session question answering. It finds that curriculum composition acts as a fine-grained lever on specialization, with mixed benchmarks yielding the best overall performance and narrow out-of-domain sets transferring targeted temporal reasoning skills.
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
Introduces Belief Entropy and Metacognitive Memory Policy Optimization (MMPO) to improve memory quality in long-horizon LLM agents, outperforming existing methods and maintaining performance over long contexts.