What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

arXiv cs.CL 05/25/26, 04:00 AM Papers
Summary
This paper empirically studies how the composition of training data (curriculum) affects the skills learned by RL-based memory agents in multi-session question answering. It finds that curriculum composition acts as a fine-grained lever on specialization, with mixed benchmarks yielding the best overall performance and narrow out-of-domain sets transferring targeted temporal reasoning skills.
arXiv:2605.23067v1 Announce Type: new Abstract: Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Source: [https://arxiv.org/html/2605.23067](https://arxiv.org/html/2605.23067)
Xinjie He1, Zhiyuan Lin2, Su Liu2, Jialun Wu3, Qiyang Xie4, Weikai Zhou2, and Shuai Xiao2 1Columbia University 2Independent Researcher 3Johns Hopkins University 4Northeastern University

\(May 2026\)

###### Abstract

Reinforcement learning \(RL\) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi\-session dialogue\. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires\. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in\-domain \(LoCoMo\), mixed\-benchmark \(LoCoMo \+ LongMemEval\), and out\-of\-domain \(LongMemEval only\)\. Across two benchmarks and ten question types, curriculum composition acts as a fine\-grained lever on specialization rather than a uniform scaling factor on performance\. The mixed curriculum yields the strongest overall F1 on both evaluation sets\. Training on a narrow out\-of\-domain set transfers a targeted skill — temporal reasoning — despite weak aggregate performance\. And per\-type differences substantially exceed aggregate differences, indicating that single\-number benchmark comparisons systematically underreport curriculum effects\. We further report two practical lessons from adapting GRPO to a single\-GPU regime: cross\-benchmark mixing requires filtering format\-specific noise from memory banks to preserve training signal, and binary exact\-match reward produces no learning signal at the small group sizes \(G=4G=4\) required on one GPU, motivating continuous reward functions in this regime\.

## 1Introduction

Large language models operate within fixed context windows, with no persistent memory across interactions\. This limitation is acute in multi\-session dialogue, where users expect the system to recall preferences, events, and relationships from prior conversations\. Recent work addresses this by augmenting LLMs with external memory banks — structured stores that persist across sessions and support retrieval at inference time\[[2](https://arxiv.org/html/2605.23067#bib.bib1),[27](https://arxiv.org/html/2605.23067#bib.bib2),[10](https://arxiv.org/html/2605.23067#bib.bib3),[18](https://arxiv.org/html/2605.23067#bib.bib4)\]\. A key challenge is learning to use retrieved memories well: selecting relevant entries from a noisy candidate set, reasoning across them, and producing concise answers\. Heuristic pipelines rely on fixed retrieval rules; RL\-based approaches\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]instead let the agent discover such selection and reasoning patterns through outcome\-driven training, achieving competitive results with small supervision budgets\. The framing that has dominated this line of work is training on a single benchmark and reporting aggregate scores\. Yet memory\-augmented QA is a multi\-skill problem: different question types exercise different combinations of retrieval precision, multi\-hop composition, temporal ordering, and knowledge\-update tracking\. When a single benchmark exercises only a subset of these skills, single\-benchmark training implicitly selects which capabilities the RL signal reinforces\. This raises two questions the existing literature leaves open\. First, does exposing the RL signal to a broader source of question types — mixing benchmarks — bias the policy toward a more general memory skill set, or does it dilute the in\-domain skill without cross\-benchmark gains? Second, when only a small out\-of\-domain set is available, does RL surface a targeted capability, or does it fail entirely? We refer to the resulting per\-question\-type profile as the policy’s*specialization*: systematic variation in per\-type strengths induced by training curriculum composition\. To answer these, we fix architecture \(Qwen\-2\.5\-7B with LoRA\), RL algorithm \(GRPO\), and all optimization hyperparameters, and vary only the training curriculum\. Config A replicates the single\-benchmark baseline of prior RL\-memory work with 152 LoCoMo QA pairs\. Config B mixes LoCoMo with 60 LongMemEval pairs \(212 total\)\. Config C trains on the 60 LongMemEval pairs alone\. All three are evaluated on both LoCoMo \(1,307 test questions, 4 types\) and LongMemEval \(415 test questions, 6 types\)\. Our contributions are as follows\. \(i\) We present a controlled curriculum study for RL\-based memory agents in which architecture, algorithm, and hyperparameters are held fixed and only the training source varies\. Under this design, per\-question\-type differences substantially exceed aggregate differences, so curriculum composition acts as a lever on specialization rather than on overall accuracy \(Section 4\.2\)\. \(ii\) We draw practical guidance from this design space: mixing benchmarks yields the strongest generalist in our setting, a narrow out\-of\-domain set can induce a targeted behavior \(temporal reasoning\), and the transition between targeted specialization and more stable aggregate gains appears between roughly 60 and 150 training examples in our setup \(Section 5\.3\)\. \(iii\) We document two engineering findings that affect the reproducibility of GRPO\-based memory training on a single GPU: cross\-benchmark data requires filtering format\-specific noise from the memory bank \(Section 5\.1\), and binary exact\-match reward collapses to zero advantage atG=4G=4, strongly motivating continuous reward functions when the large\-group regime is not available \(Section 5\.2\)\.

![Refer to caption](https://arxiv.org/html/2605.23067v1/fig1_rl_agent_experiment.png)Figure 1:Experimental design\. Three curricula, identical training recipe, evaluated on two benchmarks with per\-type F1 breakdown\.
## 2Related Work

### 2\.1Memory\-Augmented LLM Agents

The challenge of equipping LLMs with persistent memory has motivated several architectural approaches\. Early agent frameworks such as Reflexion\[[22](https://arxiv.org/html/2605.23067#bib.bib27)\]demonstrated the value of persistent state for multi\-step reasoning, though their memory policies are largely handcrafted\. Recent memory systems build on this foundation: MemGPT\[[14](https://arxiv.org/html/2605.23067#bib.bib28)\]treats the LLM’s context window as a virtual memory with OS\-inspired paging between a primary and secondary store\. Mem0\[[2](https://arxiv.org/html/2605.23067#bib.bib1)\]provides a modular memory system with explicit CRUD operations\. A\-Mem\[[27](https://arxiv.org/html/2605.23067#bib.bib2)\]introduces dynamic agentic memory with structured entries\. LangMem\[[10](https://arxiv.org/html/2605.23067#bib.bib3)\]chains memory entries across sessions\. Zep\[[18](https://arxiv.org/html/2605.23067#bib.bib4)\]employs a temporally\-aware knowledge graph for agent memory, benchmarking directly against MemGPT on Deep Memory Retrieval and LongMemEval\. These systems rely on heuristic memory management policies\. Recent work has begun applying RL to memory\-augmented agents\. Memory\-R1\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]trains both a Memory Manager \(for CRUD operations\) and an Answer Agent \(for memory\-grounded QA\) using GRPO\[[21](https://arxiv.org/html/2605.23067#bib.bib8)\], achieving strong results with 152 training examples\. Our work focuses on the Answer Agent component — the agent that selects and reasons over retrieved memories to answer questions — and studies how training data composition affects its learned skills\. We use heuristic memory construction and focus our RL training on the answer generation policy, isolating the effect of curriculum composition from memory management quality\.

### 2\.2Benchmarks for Long\-Term Memory

LoCoMo\[[12](https://arxiv.org/html/2605.23067#bib.bib6)\]\(ACL 2024\) provides multi\-session dialogues averaging 26,000 tokens with approximately 200 questions per dialogue spanning single\-hop, multi\-hop, open\-domain, and temporal reasoning\. It features rich multi\-party conversations between named speakers\. LongMemEval\[[25](https://arxiv.org/html/2605.23067#bib.bib7)\]\(ICLR 2025\) offers 500 questions across six categories: single\-session\-user, single\-session\-assistant, single\-session\-preference, multi\-session, temporal\-reasoning, and knowledge\-update\. Each question is paired with approximately 40 haystack sessions in user\-assistant chat format, testing precise retrieval from large conversation histories\. Both benchmarks exercise retrieval\-augmented generation\[[11](https://arxiv.org/html/2605.23067#bib.bib17)\]over dialogue, relying on dense retrievers\[[9](https://arxiv.org/html/2605.23067#bib.bib18),[19](https://arxiv.org/html/2605.23067#bib.bib19)\]to surface relevant context\. These benchmarks test complementary skills: LoCoMo emphasizes reasoning over rich dialogues, while LongMemEval emphasizes precise retrieval and temporal reasoning\. This complementarity motivates our mixed\-curriculum approach\.

### 2\.3Curriculum Learning for RL\-Based LLM Training

Curriculum learning — structuring training data to improve learning — is a long\-standing idea\[[1](https://arxiv.org/html/2605.23067#bib.bib24),[5](https://arxiv.org/html/2605.23067#bib.bib25)\]that has been revisited for LLM post\-training\. Recent work in this area has focused on the*difficulty axis*: difficulty\-based curricula\[[15](https://arxiv.org/html/2605.23067#bib.bib9),[8](https://arxiv.org/html/2605.23067#bib.bib10)\]schedule examples from easy to hard, and distribution\-level curricula\[[24](https://arxiv.org/html/2605.23067#bib.bib11)\]reweight data sources to manage effective problem difficulty\. In RL specifically, reward variance and reward shaping have both been used as proxies for difficulty\. Our axis is different\. We studysource composition— which benchmarks the training data is drawn from — while keeping the per\-example difficulty signal unchanged\. We do not order examples, reweight them, or prune them by difficulty; Configs A, B, and C see the same per\-example reward function under the same optimizer\. The question is whether widening the source distribution at fixed per\-example signal changes which skills the policy acquires\. Under the tightly controlled design we adopt, this axis is orthogonal to difficulty\-based curricula, and the two could be combined in future work\. To our knowledge, source\-level curriculum composition has not been studied in the RL\-for\-memory\-agents setting\.

### 2\.4GRPO and Reward Design

Group Relative Policy Optimization \(GRPO\)\[[21](https://arxiv.org/html/2605.23067#bib.bib8)\]computes advantages relative to a group ofGGsampled completions, eliminating the need for a learned value function used in PPO\-based RLHF\[[13](https://arxiv.org/html/2605.23067#bib.bib20),[20](https://arxiv.org/html/2605.23067#bib.bib21)\]\. GRPO also sidesteps the preference\-pair formulation of DPO\[[17](https://arxiv.org/html/2605.23067#bib.bib22)\]by scoring completions directly against a task reward\. It was used successfully in DeepSeek\-R1\[[3](https://arxiv.org/html/2605.23067#bib.bib12)\]and subsequent RL\-for\-LLM work\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]\. A less\-studied aspect of GRPO is the interaction between reward sparsity and group size\. We show that binary exact\-match reward, which is used for the Memory\-R1 Answer Agent\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]and for DeepSeek\-R1 verifiable tasks\[[3](https://arxiv.org/html/2605.23067#bib.bib12)\]in the large\-group regime, collapses to zero within\-group variance at the small group sizes \(G=4G=4\) that fit on a single GPU, producing no task\-relevant gradient for the Answer Agent\. We treat this as a practical barrier to single\-GPU reproduction rather than a theoretical claim, and analyze it in Section 5\.2\. Taken together, these threads — memory\-augmented agents\[[2](https://arxiv.org/html/2605.23067#bib.bib1),[27](https://arxiv.org/html/2605.23067#bib.bib2),[10](https://arxiv.org/html/2605.23067#bib.bib3),[18](https://arxiv.org/html/2605.23067#bib.bib4),[14](https://arxiv.org/html/2605.23067#bib.bib28)\], RL for memory\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\], long\-term dialogue benchmarks\[[12](https://arxiv.org/html/2605.23067#bib.bib6),[25](https://arxiv.org/html/2605.23067#bib.bib7)\], and curriculum learning\[[15](https://arxiv.org/html/2605.23067#bib.bib9),[8](https://arxiv.org/html/2605.23067#bib.bib10),[24](https://arxiv.org/html/2605.23067#bib.bib11)\]— leave this intersection comparatively underexplored\. Prior RL\-for\-memory work fixes a single source benchmark, so architecture/algorithm effects cannot be separated from data\-composition effects\. Prior curriculum work targets general mathematical or code reasoning rather than memory\-grounded QA\. And prior GRPO recipes assume a large\-group regime in which the reward\-variance issue is hidden\. Our work fills this gap with a controlled source\-level comparison and reports the small\-group reward\-variance constraint as a practical consequence\.

## 3Method

### 3\.1Task: Memory\-Augmented Question Answering

Given a multi\-session dialogue history and a question, the task is to answer the question using information distributed across sessions\. The agent receives a set of retrieved memory entries \(extracted from the dialogue\) and must select relevant entries, reason over them, and produce a concise answer\. Formally, the agent policyπθ\\pi\_\{\\theta\}maps a questionqqand retrieved memoriesℳret\\mathcal\{M\}\_\{\\text\{ret\}\}to an answeryy:

y∼πθ\(⋅∣q,ℳret\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q,\\mathcal\{M\}\_\{\\text\{ret\}\}\)whereℳret\\mathcal\{M\}\_\{\\text\{ret\}\}is a set of top\-kkmemory entries retrieved via embedding similarity from the full memory bank\.

### 3\.2Training Configurations

All three configurations share identical hyperparameters, differing only in training data:

Table 1:Training configurations\. Data composition is the only variable across runs\.
### 3\.3RL Training with GRPO

We fine\-tune Qwen\-2\.5\-7B\-Instruct\[[16](https://arxiv.org/html/2605.23067#bib.bib26)\]with LoRA\[[7](https://arxiv.org/html/2605.23067#bib.bib14)\]\(r=16r=16,α=32\\alpha=32\), a parameter\-efficient adapter approach\[[6](https://arxiv.org/html/2605.23067#bib.bib16)\]that keeps most of the base model frozen\. Training uses GRPO\[[21](https://arxiv.org/html/2605.23067#bib.bib8)\]with group sizeG=4G=4, the largest value that fits on a single 48 GB GPU\. The reward is the sum of a token\-level F1 term between extracted answer and gold answer \(primary\) and a small XML format term capped at 0\.2 \(secondary\); we initially adopted binary exact\-match reward following prior work\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\], but this produced zero gradient signal atG=4G=4and the switch to F1 is analyzed in Section 5\.2\. We use a learning rate of5×10−65\\times 10^\{\-6\}with cosine decay, effective batch size 4 \(batch size 1 with 4 gradient accumulation steps\), and a 512\-token completion cap\. Configs A and B train for 2 epochs and Config C for 3, corresponding to roughly 76, 106, and 45 gradient steps respectively; the larger epoch count on C partially compensates for its smaller dataset but the total number of updates remains unequal\. All training runs on a single NVIDIA L40S \(48 GB\)\. The agent is prompted to output structured XML with<selected\_memories\>,<reasoning\>, and<answer\>tags\. The format reward provides a small bonus for correct structure while the F1 reward drives task learning\. Full hyperparameters are in Appendix A\.

### 3\.4Memory Construction and Retrieval

For each example we construct a memory bank from dialogue sessions\. We first extract substantive turns, keeping only those longer than five words and discarding greetings\. For chat\-format data \(LongMemEval\), we additionally filter out long assistant responses \(more than 40 words\), retaining user turns and short confirmations; this reduces the bank by roughly 50% and improves retrieval precision, as analyzed in Section 5\.1\. The memory bank is capped at 200 entries per conversation\. At inference time we retrieve the top 60 entries by embedding similarity using a Sentence\-BERT\-style encoder\[[19](https://arxiv.org/html/2605.23067#bib.bib19)\]\(all\-MiniLM\-L6\-v2\), and truncate any prompt exceeding roughly 3,000 tokens to prevent context overflow\.

### 3\.5Evaluation

We evaluate on two benchmarks\. LoCoMo\[[12](https://arxiv.org/html/2605.23067#bib.bib6)\]contributes 1,307 test questions drawn from 10 dialogues and split across four question types: single\-hop \(238\), multi\-hop \(279\), open\-domain \(705\), and temporal \(85\)\. LongMemEval\[[25](https://arxiv.org/html/2605.23067#bib.bib7)\]contributes 415 test questions across 415 conversations in six categories: single\-session\-user \(55\), single\-session\-assistant \(39\), single\-session\-preference \(20\), multi\-session \(118\), temporal\-reasoning \(119\), and knowledge\-update \(64\)\. Token\-level F1 after answer normalization is our primary metric\. We additionally collect an LLM\-as\-Judge rating on a 1–5 scale using Claude 3 Haiku to score each answer on accuracy, relevance, and completeness, which we use as a compatibility check against a coarser semantic evaluator \(Section 4\.3\)\. Throughout, we use training progress bins Q1–Q4 to denote the four equal\-length quartiles of gradient steps within a run\.

## 4Results

### 4\.1Main Results

Table 2 presents overall results\. Config B \(mixed curriculum\) achieves the highest F1 on both benchmarks, outperforming the baseline by \+0\.012 on LoCoMo and \+0\.014 on LongMemEval\.

Table 2:Overall F1 scores\. Best RL\-trained configuration in bold\.While overall gains are modest, the per\-question\-type analysis reveals substantially larger and more informative differences\.

### 4\.2Per\-Question\-Type Analysis: Where Curriculum Composition Matters

The central empirical finding of this paper is that different training curricula produce models whose per\-type profiles diverge far more than their aggregate scores suggest\. Tables 3 and 4 show per\-type differences that are, in magnitude, several times larger than the gap in overall F1 between any two configurations\. Different question types stress different combinations of retrieval filtering and compositional reasoning\. Mixed curricula expose the policy to a broader distribution of such behaviors, while narrow curricula reinforce a smaller subset\. We therefore expect curriculum composition to shape specialization patterns more strongly than aggregate performance\. The per\-type results below are broadly consistent with this interpretation\.

Table 3:LoCoMo F1 by question type\. Bold indicates the best\-performing RL\-trained configuration; BestΔ\\Deltais that configuration’s gain over baseline, or "—" if no trained configuration exceeds baseline\.On LoCoMo, per\-category F1 gaps between configurations are small — within roughly 0\.02 F1 in every row of Table 3\. Config A shows the largest per\-category gains over the no\-RL baseline, most visibly on temporal \(\+0\.015\) and multi\-hop \(\+0\.013\), reflecting the advantage of in\-domain training on category\-specific retrieval patterns\. We note that Config B has 40% more training examples than Config A \(212 vs 152\)\. However, the additional 60 examples come from LongMemEval, and Config C shows that 60 LongMemEval examples alone produce negligible overall improvement\. This suggests Config B’s gains stem from curriculum diversity — the combination of LoCoMo and LongMemEval skills — rather than simply having more data\.

Table 4:LongMemEval F1 by question type\. Bold indicates the best\-performing RL\-trained configuration; BestΔ\\Deltais that configuration’s gain over baseline, or "—" if no trained configuration exceeds baseline\.We read Tables 3 and 4 as a set of directional observations; we do not claim individual per\-type gains are statistically robust at the evaluation sizes available \(categorynnranges from 20 to 705\)\. Different curricula concentrate their benefit on different question types\. Config A’s largest LoCoMo gains \(temporal \+0\.015, multi\-hop \+0\.013\) target categories that stress in\-domain retrieval and reasoning\. Config B’s largest LongMemEval gains \(knowledge\-update \+0\.023, single\-session\-user \+0\.035\) target categories that stress cross\-session fact tracking and preference retrieval\. Config B’s aggregate advantage on both benchmarks \(Table 2\) is therefore not a uniform lift but a redistribution across categories\. The small out\-of\-domain curriculum produces a more concentrated specialization: Config C posts the highest temporal\-reasoning score \(0\.202\) despite only a modest aggregate lift over the no\-RL baseline, suggesting that 60 LongMemEval examples suffice to expose the policy to temporal composition patterns but not to broadly strengthen retrieval\. On categories where no trained model exceeds baseline — notably single\-session\-preference \(n=20n=20\) — the differences are small in absolute terms and likely reflect sampling variation at that category size\. The largest single per\-type gain in the study \(\+0\.057 on single\-session\-assistant, Config A,n=39n=39\) should be read cautiously for the same reason\. The aggregate\-vs\-per\-type gap — per\-type differences several times the size of any overall F1 gap — is the pattern we emphasize, because it is present across categories of different sizes and across both benchmarks\.

![Refer to caption](https://arxiv.org/html/2605.23067v1/fig2_heatmap.png)Figure 2:Per\-question\-type F1 across all models and both benchmarks\. Color intensity shows delta from baseline \(green = improvement, red = regression\)\. Bold values indicate the best RL\-trained configuration per type\. Curriculum effects concentrate in specific question types, with per\-type differences several times larger than overall gaps\.
### 4\.3LLM\-as\-Judge as a Compatibility Check

A natural concern about using token\-level F1 as the primary metric is that F1 under\-rewards semantically correct but lexically different answers, which could in principle hide curriculum effects that a semantic judge would surface\. To check for this, we score every model’s output with Claude 3 Haiku on a 1–5 scale covering accuracy, relevance, and completeness \(Appendix C\)\. Mean scores cluster tightly across all four models \(3\.22–3\.39\), and no model falls below a mean of 3\.0 on either benchmark, so RL training does not degrade answer quality in a way the judge would flag\. Critically, the per\-model ordering under the judge is not systematically different from the ordering under F1 — the judge does not reveal a hidden curriculum effect that F1 missed\. We therefore keep F1 as the primary metric for the rest of the analysis; we do not read the judge as a competing signal, only as a compatibility check against a coarser semantic evaluator\.

## 5Analysis

### 5\.1Memory Bank Preprocessing Matters for Cross\-Benchmark Transfer

LongMemEval uses a user\-assistant chat format where assistant responses are typically long, generic advice \(e\.g\., "That sounds great\! Here are some tips for…"\)\. When included in the memory bank, these responses constitute approximately 50% of entries but contribute no useful facts for answering questions about the user\. We compared two versions of the Config B training run\. The first version \(v1\) stored all dialogue turns as memories, yielding about 466 entries per example with roughly 50% assistant filler\. The second version \(v2\) filtered out long assistant responses, leaving about 231 user\-focused entries per example\.

Table 5:Effect of memory bank preprocessing on Config B training signal\.The filtered version showed 22% higher mean F1 during training and a stronger upward trend\. This demonstrates that memory bank quality directly affects RL training signal quality\. Practitioners building cross\-benchmark curricula should preprocess data to remove format\-specific noise rather than naively mixing sources\.

### 5\.2Reward Sparsity and GRPO Group Size

Binary exact\-match \(EM\) reward assigns 1\.0 for exact matches and 0\.0 otherwise\. GRPO standardizes rewards within a group ofGGcandidates:Ai=\(ri−r¯\)/σrA\_\{i\}=\(r\_\{i\}\-\\bar\{r\}\)/\\sigma\_\{r\}\. When all candidates in a group receive the same reward,σr=0\\sigma\_\{r\}=0and no gradient flows\. With large group sizes \(G≈128G\\approx 128\) some candidates are likely to hit exact match, preserving within\-group variance; withG=4G=4on a single GPU, the probability of any candidate exactly matching the gold answer is near zero for open\-ended QA\. In our initial experiments with EM reward, the task reward component was 0\.0 on every training step across the full run \(several hundred steps\)\. The total reward was capped at 0\.2 \(the format reward ceiling\), and GRPO produced no task\-relevant gradient signal\. Switching to token\-level F1 reward resolved this immediately: at step 5, task reward was 0\.23 with std=0\.03, providing sufficient variance for learning\. We do not claim this issue is unique to GRPO — sparse rewards are a known difficulty\. Rather, GRPO makes the variance collapse especially visible because advantages are normalized within each sampled group\. Under continuous reward \(F1\), within\-group variance is preserved at any group size; under binary reward in the small\-group regime it is not\. The practical rule of thumb is straightforward: on single\-GPU GRPO, prefer continuous reward functions\. This also intersects with known concerns about reward specification, where overly sparse or mis\-specified rewards cause agents to exploit proxy signals rather than learn the intended skill\[[23](https://arxiv.org/html/2605.23067#bib.bib23)\]\.

### 5\.3Training Set Size and Observed Thresholds

Config C’s mean reward declines from Q1 to Q4 \(0\.344→0\.3250\.344\\to 0\.325\) across its three epochs, while Configs A and B both trend upward\. This is consistent with Config C revisiting each example three times and beginning to fit the training distribution rather than continuing to improve on held\-out questions\. At the same time, Config C is not simply noise around the no\-RL baseline: it achieves the highest temporal\-reasoning score of any configuration \(0\.202\), indicating that a narrow 60\-example set can transfer a targeted capability under GRPO\. Combined with the upward trajectories of Configs A \(n=152n=152\) and B \(n=212n=212\), the pattern in our study is that very small training sets \(n≈60n\\approx 60\) can induce targeted behaviors but do not yield reliable aggregate improvement, while roughly 150 examples suffices for positive aggregate deltas\. The transition between targeted specialization and more stable aggregate gains therefore appears between roughly 60 and 150 examples in our setup; a finer scan would be needed to pin it down\.

Table 6:Reward trajectory and overall F1 gain by training set size\. F1 gain is the absolute change over baseline; the range spans LoCoMo and LongMemEval test sets\.

## 6Limitations and Future Work

Single\-GPU constraints\.Our experiments use LoRA\[[7](https://arxiv.org/html/2605.23067#bib.bib14)\]withG=4G=4, compared to full fine\-tuning withG≈128G\\approx 128in prior RL\-for\-memory work\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]\. Memory\-efficient variants such as QLoRA\[[4](https://arxiv.org/html/2605.23067#bib.bib15)\]could further reduce the footprint but were not needed on the L40S\. The smaller group size necessitated switching from exact\-match to F1 reward and likely limits absolute performance\. Full fine\-tuning, and the resulting larger groups, would strengthen the curriculum comparison\.Heuristic memory construction\.We use heuristic extraction rather than a trained Memory Manager\. Prior work\[[28](https://arxiv.org/html/2605.23067#bib.bib5)\]attributes roughly 7\.5 F1 points to RL\-trained memory management\. Whether curriculum composition has analogous specialization effects on a learned Memory Manager is an open question\.Limited benchmarks\.We evaluate on LoCoMo and LongMemEval\. MSC\[[26](https://arxiv.org/html/2605.23067#bib.bib13)\]and EverMemBench would broaden the evaluation axis\.Single model family\.All experiments use Qwen\-2\.5\-7B\-Instruct\. Replication across Llama\-family and Mistral\-family backbones would test whether the specialization pattern we report is model\-specific\.Coarser data\-size scan\.Our three configurations test 60, 152, and 212 training examples\. A finer scan between 60 and 152 would sharpen any claim about the effective data\-size regime\.

## 7Conclusion

We presented a controlled study of training data composition in RL\-based memory agents, holding model, algorithm, and optimizer fixed and varying only the curriculum source\. Across two benchmarks and ten question types, curriculum composition acts more like a lever on specialization than a uniform scaling factor on aggregate performance: the mixed curriculum produces the strongest generalist in our setting, a narrow out\-of\-domain set transfers a specific temporal\-reasoning behavior despite weak aggregate scores, and per\-type gaps are several times the size of any aggregate gap\. Alongside the curriculum findings, we report two practical lessons from adapting GRPO to a single GPU: cross\-benchmark data must be preprocessed to remove format\-specific noise to preserve training signal, and binary exact\-match reward collapses to zero advantage at the small group sizes that fit on one GPU, strongly motivating continuous reward functions in this regime\. We hope this controlled slice of the design space is useful to practitioners reasoning about how to allocate limited RL training budget across benchmarks\.

## 8Reproducibility Statement

All experiments were conducted on a single NVIDIA L40S GPU \(48 GB\) using LoRA fine\-tuning\. Training configurations, hyperparameters, and evaluation scripts are provided in the supplementary material\.

Trained checkpoints and prediction files are available at: [https://github\.com/EvaxHe/rl\-memory\-curriculum](https://github.com/EvaxHe/rl-memory-curriculum)\. The exact code revision used for this preprint is taggedv1\.0\-arxivin that repository\.

## References

- \[1\]\(2009\)Curriculum Learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2\.3](https://arxiv.org/html/2605.23067#S2.SS3.p1.1)\.
- \[2\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh,et al\.\(2025\)Mem0: Building Production\-Ready AI Agents with Scalable Long\-Term Memory\.arXiv preprint\.Note:arXiv:2504\.19413Cited by:[§1](https://arxiv.org/html/2605.23067#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[3\]DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song,et al\.\(2025\)DeepSeek\-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning\.arXiv preprint\.Note:arXiv:2501\.12948Cited by:[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[4\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: Efficient Finetuning of Quantized LLMs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2305\.14314Cited by:[§6](https://arxiv.org/html/2605.23067#S6.p1.2)\.
- \[5\]A\. Graves, M\. G\. Bellemare, J\. Menick, R\. Munos, and K\. Kavukcuoglu\(2017\)Automated Curriculum Learning for Neural Networks\.InInternational Conference on Machine Learning \(ICML\),Note:arXiv:1704\.03003Cited by:[§2\.3](https://arxiv.org/html/2605.23067#S2.SS3.p1.1)\.
- \[6\]N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. de Laroussilhe,et al\.\(2019\)Parameter\-Efficient Transfer Learning for NLP\.InInternational Conference on Machine Learning \(ICML\),Note:arXiv:1902\.00751Cited by:[§3\.3](https://arxiv.org/html/2605.23067#S3.SS3.p1.5)\.
- \[7\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: Low\-Rank Adaptation of Large Language Models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2106\.09685Cited by:[§3\.3](https://arxiv.org/html/2605.23067#S3.SS3.p1.5),[§6](https://arxiv.org/html/2605.23067#S6.p1.2)\.
- \[8\]G\. Jiang, W\. Feng, G\. Quan, C\. Hao, Y\. Zhang, G\. Liu, and H\. Wang\(2025\)VCRL: Variance\-Based Curriculum Reinforcement Learning for Large Language Models\.arXiv preprint\.Note:arXiv:2509\.19803Cited by:[§2\.3](https://arxiv.org/html/2605.23067#S2.SS3.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[9\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\.\-t\. Yih\(2020\)Dense Passage Retrieval for Open\-Domain Question Answering\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:arXiv:2004\.04906Cited by:[§2\.2](https://arxiv.org/html/2605.23067#S2.SS2.p1.1)\.
- \[10\]LangChain\(2025\)LangMem SDK for Agent Long\-Term Memory\.Note:Blog post[https://www\.langchain\.com/blog/langmem\-sdk\-launch](https://www.langchain.com/blog/langmem-sdk-launch)\. Accessed May 2026Cited by:[§1](https://arxiv.org/html/2605.23067#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[11\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin,et al\.\(2020\)Retrieval\-Augmented Generation for Knowledge\-Intensive NLP Tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2005\.11401Cited by:[§2\.2](https://arxiv.org/html/2605.23067#S2.SS2.p1.1)\.
- \[12\]A\. Maharana, D\.\-H\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating Very Long\-Term Conversational Memory of LLM Agents\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),Note:arXiv:2402\.17753Cited by:[§2\.2](https://arxiv.org/html/2605.23067#S2.SS2.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2),[§3\.5](https://arxiv.org/html/2605.23067#S3.SS5.p1.1)\.
- \[13\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright,et al\.\(2022\)Training Language Models to Follow Instructions with Human Feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2203\.02155Cited by:[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[14\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: Towards LLMs as Operating Systems\.arXiv preprint\.Note:arXiv:2310\.08560Cited by:[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[15\]S\. Parashar, S\. Gui, X\. Li, H\. Ling, S\. Vemuri,et al\.\(2025\)Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning\.arXiv preprint\.Note:arXiv:2506\.06632Cited by:[§2\.3](https://arxiv.org/html/2605.23067#S2.SS3.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[16\]Qwen Team\(2024\)Qwen2\.5 Technical Report\.arXiv preprint\.Note:arXiv:2412\.15115Cited by:[§3\.3](https://arxiv.org/html/2605.23067#S3.SS3.p1.5)\.
- \[17\]R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn\(2023\)Direct Preference Optimization: Your Language Model Is Secretly a Reward Model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2305\.18290Cited by:[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[18\]P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef\(2025\)Zep: A Temporal Knowledge Graph Architecture for Agent Memory\.arXiv preprint\.Note:arXiv:2501\.13956Cited by:[§1](https://arxiv.org/html/2605.23067#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[19\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: Sentence Embeddings Using Siamese BERT\-Networks\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:arXiv:1908\.10084Cited by:[§2\.2](https://arxiv.org/html/2605.23067#S2.SS2.p1.1),[§3\.4](https://arxiv.org/html/2605.23067#S3.SS4.p1.1)\.
- \[20\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal Policy Optimization Algorithms\.arXiv preprint\.Note:arXiv:1707\.06347Cited by:[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[21\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song,et al\.\(2024\)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.arXiv preprint\.Note:arXiv:2402\.03300Cited by:[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2),[§3\.3](https://arxiv.org/html/2605.23067#S3.SS3.p1.5)\.
- \[22\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: Language Agents with Verbal Reinforcement Learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2303\.11366Cited by:[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1)\.
- \[23\]J\. Skalse, N\. H\. R\. Howe, D\. Krasheninnikov, and D\. Krueger\(2022\)Defining and Characterizing Reward Hacking\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2209\.13085Cited by:[§5\.2](https://arxiv.org/html/2605.23067#S5.SS2.p1.5)\.
- \[24\]Z\. Wang, G\. Cui, K\. Wan, and W\. Zhao\(2025\)DUMP: Automated Distribution\-Level Curriculum Learning for RL\-Based LLM Post\-Training\.arXiv preprint\.Note:arXiv:2504\.09710Cited by:[§2\.3](https://arxiv.org/html/2605.23067#S2.SS3.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[25\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\.\-W\. Chang, and D\. Yu\(2025\)LongMemEval: Benchmarking Chat Assistants on Long\-Term Interactive Memory\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2410\.10813Cited by:[§2\.2](https://arxiv.org/html/2605.23067#S2.SS2.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2),[§3\.5](https://arxiv.org/html/2605.23067#S3.SS5.p1.1)\.
- \[26\]J\. Xu, A\. Szlam, and J\. Weston\(2022\)Beyond Goldfish Memory: Long\-Term Open\-Domain Conversation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\), Volume 1: Long Papers,pp\. 5180–5197\.Note:arXiv:2107\.07567Cited by:[§6](https://arxiv.org/html/2605.23067#S6.p1.2)\.
- \[27\]W\. Xu, K\. Mei, H\. Gao, J\. Tan, Z\. Liang, and Y\. Zhang\(2025\)A\-Mem: Agentic Memory for LLM Agents\.arXiv preprint\.Note:arXiv:2502\.12110Cited by:[§1](https://arxiv.org/html/2605.23067#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2)\.
- \[28\]S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding,et al\.\(2025\)Memory\-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning\.arXiv preprint\.Note:arXiv:2508\.19828Cited by:[§1](https://arxiv.org/html/2605.23067#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23067#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23067#S2.SS4.p1.2),[§3\.3](https://arxiv.org/html/2605.23067#S3.SS3.p1.5),[§6](https://arxiv.org/html/2605.23067#S6.p1.2)\.

## Appendix AHyperparameters

Table 7:Full training hyperparameters\.
## Appendix BPrompt Template

```
You are an Answer Agent for a conversational AI assistant.
You have access to a memory bank containing facts from past conversations.

Given a question and retrieved memories, you must:
1. Select the most relevant memories for answering the question.
2. Reason step-by-step using the selected memories.
3. Provide a concise, accurate answer.

Output format:
<selected_memories>[list the memory IDs or snippets]</selected_memories>
<reasoning>[your step-by-step reasoning]</reasoning>
<answer>[your final answer - be concise]</answer>
```

## Appendix CLLM\-as\-Judge Scores

Table 8:Mean LLM\-as\-Judge scores \(Claude 3 Haiku, 1–5 scale\)\.See Section 4\.3 for interpretation\.

## Appendix DQualitative Examples

#### Example 1: Knowledge\-update \(Config B correct, Baseline wrong\)

Question: "What is the name of the music streaming service I have been using lately?" Gold answer: "Spotify"

Table 9:Model answers for Example 1: Knowledge\-update \(Config B correct, Baseline wrong\)\.Config B’s mixed training, which includes LongMemEval examples with user\-preference questions, helps the agent locate and retrieve the specific fact from the memory bank\.

#### Example 2: Temporal reasoning \(Config C strongest\)

Question: "How many days passed between the day I started watering my herb garden and the day I harvested my first batch of fresh herbs?" Gold answer: "24 days"

Table 10:Model answers for Example 2: Temporal reasoning \(Config C strongest\)\.While Config C’s answer is numerically incorrect, it attempts temporal reasoning rather than declining to answer — a behavioral difference attributable to LongMemEval’s temporal\-reasoning training examples\.

#### Example 3: Single\-session retrieval \(all trained models improve\)

Question: "What size is my new Samsung TV?" Gold answer: "55\-inch"

Table 11:Model answers for Example 3: Single\-session retrieval \(all trained models improve\)\.RL training broadly improves the agent’s ability to locate specific facts in the memory bank, regardless of curriculum composition\.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Similar Articles

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

RL Post-Training Builds Compositional Reasoning Strategies

Understanding Reasoning from Pretraining to Post-Training

Submit Feedback

Similar Articles

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
RL Post-Training Builds Compositional Reasoning Strategies
Understanding Reasoning from Pretraining to Post-Training