The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

arXiv cs.AI Papers

Summary

Introduces Janus, a plug-in memory controller for LLMs that selectively accepts or rejects candidate memory updates using a Memory Momentum Trigger and a compact hybrid evaluation set, improving average accuracy by +2.7 to +4.6 points across multiple datasets.

arXiv:2606.31121v1 Announce Type: new Abstract: Sequentially evolving LLM memory enables agents to reuse past experience, but existing systems usually deploy each locally generated memory update without checking whether it improves future behavior. As a result, updates that help the current task may overwrite useful knowledge, introduce over-specific rules, or bias the final memory toward recent examples. We propose Janus, a plug-in memory controller that decides whether to accept a candidate memory update or retain the previous memory. To make this decision efficient, Janus uses a Memory Momentum Trigger to identify suspicious deviations in the memory-update trajectory, and compares old and new memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks instead of replaying the full history. Janus is method-agnostic and wraps existing updaters without changing their update rules. Across six datasets, two backbone LLMs, and two memory updaters, Janus improves average accuracy by +2.7 to +4.6 points over the corresponding base updaters.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:37 AM

# The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory
Source: [https://arxiv.org/html/2606.31121](https://arxiv.org/html/2606.31121)
Zihan Chen♠, Songwei Dong♠, Chengshuai Shi♣, Peng Wang♠, Song Wang♠,Cong Shen♠,Jundong Li♠ ♠University of Virginia,♣Princeton University,♠University of Central Florida, \{brf3rx, hxt5ap, pw7nc, cong, jundong\}@virginia\.edu cs1083@princeton\.edu,song\.wang@ucf\.edu

###### Abstract

Sequentially evolving LLM memory enables agents to reuse past experience, but existing systems usually deploy each locally generated memory update without checking whether it improves future behavior\. As a result, updates that help the current task may overwrite useful knowledge, introduce over\-specific rules, or bias the final memory toward recent examples\. We propose Janus, a plug\-in memory controller that decides whether to accept a candidate memory update or retain the previous memory\. To make this decision efficient, Janus uses a Memory Momentum Trigger to identify suspicious deviations in the memory\-update trajectory, and compares old and new memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks instead of replaying the full history\. Janus is method\-agnostic and wraps existing updaters without changing their update rules\. Across six datasets, two backbone LLMs, and two memory updaters, Janus improves average accuracy by \+2\.7 to \+4\.6 points over the corresponding base updaters\.

The Past Is Prologue: A Plug\-in Controller for Selective Updates in Sequentially Evolving LLM Memory

Zihan Chen♠, Songwei Dong♠, Chengshuai Shi♣, Peng Wang♠,Song Wang♠,Cong Shen♠,Jundong Li♠♠University of Virginia,♣Princeton University,♠University of Central Florida,\{brf3rx, hxt5ap, pw7nc, cong, jundong\}@virginia\.educs1083@princeton\.edu,song\.wang@ucf\.edu

## 1Introduction

Large language models \(LLMs\) are increasingly deployed as sequential task\-solving agents that learn from past interactions through external memory\(Xianget al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib13); Fanget al\.,[2025b](https://arxiv.org/html/2606.31121#bib.bib12); Weiet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib11)\)\. In this setting, the LLM repeatedly encounters tasks, produces answers, receives feedback, and updates its memory for future use\. The resulting trajectories contain successful solutions, failed attempts, feedback signals, and intermediate reasoning traces, which can serve as reusable experience for improving future decisions\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib10); Weiet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib11); Suzgunet al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib1); Zhouet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib14)\)\. This form of*sequentially evolving memory*goes beyond static conversational recall: memory is not merely a record of past interactions, but a test\-time adaptation mechanism that shapes the LLM’s behavior on future tasks\. Such systems are important for reasoning assistants\(Hoet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib15)\), tool\-use agents\(Wanget al\.,[2025b](https://arxiv.org/html/2606.31121#bib.bib16)\), and interactive decision\-making systems\(Zhenget al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib17); Agrawalet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib18)\), where performance depends not only on the current input but also on the accumulated experience\.

![Refer to caption](https://arxiv.org/html/2606.31121v1/x1.png)
![Refer to caption](https://arxiv.org/html/2606.31121v1/x2.png)

Figure 1:Top:Existing sequential memory updates do not guarantee that the final memory will better support future tasks\.Bottom:On GPQA, intermediate memory snapshots from two sequential memory methods exhibit non\-monotonic test set performance, motivating the need to control which memory updates are deployed\.A common mechanism behind these systems is to update memory using feedback from the current taskWanget al\.\([2025c](https://arxiv.org/html/2606.31121#bib.bib21)\); Weiet al\.\([2025](https://arxiv.org/html/2606.31121#bib.bib11)\)\. This feedback is often expressed as a natural\-language signal, sometimes viewed as a form of*text gradient*: instead of providing numerical gradients over parameters, the environment provides textual feedback that suggests how the agent should revise its reasoning, strategy, or memory for future tasks\. Existing sequential memory methods typically follow a retrieve–solve–update loop: the agent retrieves memory, solves the current task, and then uses the current trajectory and feedback to update the memory\(Suzgunet al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib1); Zhaoet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib10); Weiet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib11)\)\. This design is intuitive and efficient, but it also introduces a fundamental risk\. Each update is usually optimized locally for the most recent task, while its effect on the final memory state is rarely evaluated globally \(Figure[1](https://arxiv.org/html/2606.31121#S1.F1)\)\. As a result, a memory update that appears useful for the current task may overwrite previously useful knowledge, introduce noisy task\-specific rules, or bias the memory toward recent examples\. The key challenge is therefore not only how to generate memory updates, but also how to decide whether a proposed update should actually be deployed\.

This creates a memory validation problem\. Ideally, before deploying a candidate memory, the agent should estimate whether it improves performance beyond the current task\. The most direct way to obtain such a signal is to compare the previous memory and the candidate memory on tasks that approximate the future task distribution\. Since the future is unavailable, previously encountered tasks provide a natural proxy\. However, validating every candidate memory on the full history is computationally impractical: the replay cost grows with the number of seen tasks and would introduce substantial latency after each update\. A cheaper alternative is to compare memories at fixed intervals, such as everyNNsteps, but such schedules are heuristic and may miss harmful updates that occur between scheduled checks\. Similarly, using only a small fixed replay set reduces cost but risks overfitting the deployment decision to stale examples\. Thus, an effective memory controller must answer two coupled questions:*when*should old and new memory be compared, and*what*tasks should be used for comparison?

To solve the above challenges, we proposeJanus, a plug\-in memory controller that wraps existing sequential memory updaters and decides whether each candidate memory update should be accepted or rejected\. Instead of treating every generated memory update as automatically beneficial, Janus views memory updating as a deployment decision: a candidate memory should replace the previous memory only when it is likely to improve the utility of the final memory for future tasks\. Janus introduces two key designs to make this decision efficient\. First, it uses aMemory Momentum Trigger\(MMT\) to decide when old and new memory should be explicitly compared\. Rather than triggering comparison after every task or at a fixed interval, MMT tracks the trajectory of memory changes and triggers comparison when a candidate update deviates substantially from the recent update direction, indicating that the update may either introduce useful new knowledge or distort the memory toward recent task\-specific information\. Second, when comparison is triggered, Janus evaluates the previous and candidate memories on a compact hybrid evaluation batch instead of replaying the full task history\. This batch combines a stored support set, which includes both coverage tasks representing the seen task distribution and boundary tasks where memory choices previously changed correctness, with a fresh slice of recently encountered tasks to avoid overfitting the decision to a fixed support set\. In this way, Janus improves memory deployment by selectively testing suspicious updates while keeping replay cost controlled\. Our main contributions are summarized as follows:

- •Memory deployment challenge\.We identify a key limitation of sequential LLM memory systems: locally generated memory updates are not necessarily globally useful, and blindly accepting them may produce final memories that are biased toward noisy task\-specific information\.
- •Efficient plug\-in memory control\.We propose Janus, a method\-agnostic controller that wraps existing memory updaters and efficiently decides whether to accept or reject candidate memory updates\. Janus combines a Memory Momentum Trigger with a compact hybrid evaluation batch over coverage, boundary, and fresh tasks, enabling old\-versus\-new memory selection without modifying the underlying updater or replaying the full task history\.
- •Strong empirical results\.Across six datasets, two backbone LLMs, and two sequential memory updater backbones, Janus consistently improves final memory usefulness, with average gains ranging from\+2\.7\+2\.7to\+4\.6\+4\.6percentage points over the corresponding base updaters\.

## 2Method

### 2\.1Problem Setting

![Refer to caption](https://arxiv.org/html/2606.31121v1/x3.png)Figure 2:Overview of Janus\.Given a task and the current memoryMt−1M\_\{t\-1\}, a base updater proposes a candidate memoryM^t\\widehat\{M\}\_\{t\}\. Janus acts as a plug\-in controller that decides whether to deploy this candidate or retain the previous memory\. It first uses a Memory Momentum Trigger to detect whether the candidate update deviates from the recent memory\-update trajectory\. If triggered, Janus comparesMt−1M\_\{t\-1\}andM^t\\widehat\{M\}\_\{t\}on a compact hybrid evaluation set composed of coverage, boundary, and fresh tasks, and deploys the memory with better evaluation performance\. If not triggered, Janus accepts the candidate directly to avoid unnecessary replay\.FollowingWeiet al\.\([2025](https://arxiv.org/html/2606.31121#bib.bib11)\), we consider a sequentially evolving memory setting where an LLMℒ\{\\mathcal\{L\}\}solves a stream of tasks while maintaining an external memory\. Let𝒟=\{\(xt,yt\)\}t=1T\\mathcal\{D\}=\\\{\(x\_\{t\},y\_\{t\}\)\\\}\_\{t=1\}^\{T\}denote the task sequence\. At steptt, the model uses the previous memoryMt−1M\_\{t\-1\}to predicty^t=ℒ​\(xt,Mt−1\)\.\\hat\{y\}\_\{t\}=\{\\mathcal\{L\}\}\(x\_\{t\},M\_\{t\-1\}\)\.After receiving feedbackftf\_\{t\}, such as correctness signals or textual critiques, a base memory updater proposes a revised memory:

M^t=Update​\(Mt−1,\(xt,y^t,ft\),ℒ\)\.\\widehat\{M\}\_\{t\}=\\texttt\{Update\}\\left\(M\_\{t\-1\},\(x\_\{t\},\\hat\{y\}\_\{t\},f\_\{t\}\),\{\\mathcal\{L\}\}\\right\)\.\(1\)Existing sequential memory methods typically deploy this candidate directly by settingMt=M^tM\_\{t\}=\\widehat\{M\}\_\{t\}\. In contrast, Janus treats memory updating as a deployment decision: givenMt−1M\_\{t\-1\}andM^t\\widehat\{M\}\_\{t\}, it chooses

Mt∈\{Mt−1,M^t\},M\_\{t\}\\in\\\{M\_\{t\-1\},\\widehat\{M\}\_\{t\}\\\},\(2\)with the goal of maintaining a final memoryMTM\_\{T\}that generalizes well to future unseen tasks\.

### 2\.2Janus: Plug\-in Memory Control

As illustrated in Figure[1](https://arxiv.org/html/2606.31121#S1.F1), blindly accepting every memory update can leave an LLM with a final memory that fails to support future unseen tasks\. Motivated by this observation, we proposeJanus, a plug\-in controller that wraps a base memory updater and decides whether each candidate memory should be deployed\. After solving thett\-th task, the base updater proposes a candidate memoryM^t\\widehat\{M\}\_\{t\}based on the previous memoryMt−1M\_\{t\-1\}and the current interaction\. Rather than directly settingMt=M^tM\_\{t\}=\\widehat\{M\}\_\{t\}, Janus chooses whether to accept the candidate memory or retain the previous memory\. This design addresses two core challenges in sequential memory control:*when*to compare the previous and candidate memories, and*what*tasks to use for comparison\. For the first challenge, Janus introduces a Memory Momentum Trigger, which detects suspicious deviations in the memory\-update trajectory and avoids unnecessary comparisons\. For the second challenge, Janus uses a compact hybrid evaluation set, combining representative, memory\-sensitive, and fresh tasks to approximate full\-history replay with much lower cost\.

#### Memory Momentum Trigger \(MMT\)\.

Ideally, we would verify every memory update before deployment\. However, comparingMt−1M\_\{t\-1\}andM^t\\widehat\{M\}\_\{t\}after every task incurs substantial replay cost, while triggering comparisons at fixed intervals, e\.g\., everyNNsteps, is heuristic and may miss abrupt harmful updates\. Janus instead views memory updates as a trajectory and triggers comparison only when the candidate update deviates substantially from recent memory evolution\. Letϕ​\(⋅\)\\phi\(\\cdot\)denote a text encoder that maps a memory state into a vector space\. We represent the candidate update direction as

zt=ϕ​\(M^t\)−ϕ​\(Mt−1\)\.z\_\{t\}=\\phi\(\\widehat\{M\}\_\{t\}\)\-\\phi\(M\_\{t\-1\}\)\.\(3\)Janus maintains an exponential moving average of previous update directions:

mt=β​mt−1\+\(1−β\)​zt,m\_\{t\}=\\beta m\_\{t\-1\}\+\(1\-\\beta\)z\_\{t\},\(4\)whereβ\\betacontrols the strength of memory momentum\. The intuition is that comparison is most valuable when a candidate update substantially changes the recent trajectory of memory evolution\. Even when tasks arrive in a shuffled order, consecutive candidate updates may still induce aligned changes to the memory, corresponding to incremental refinements rather than major memory transitions\. In such cases, repeatedly comparing old and new memories provides limited additional benefit\. In contrast, a sharp directional deviation indicates that the candidate memory may significantly alter the deployed memory: it may introduce useful new knowledge, but it may also overwrite broadly useful information with recent\-task\-specific content\. Janus therefore treats the historical momentummt−1m\_\{t\-1\}as a compact summary of recent memory evolution and uses directional misalignment as a signal for when explicit validation is needed\.

Specifically, Janus triggers an old\-versus\-new memory comparison when

cos⁡\(zt,mt−1\)<τ,\\cos\(z\_\{t\},m\_\{t\-1\}\)<\\tau,\(5\)whereτ\\tauis a threshold\. If the trigger does not fire, Janus directly accepts the candidate memory, i\.e\.,Mt=M^tM\_\{t\}=\\widehat\{M\}\_\{t\}\. If the trigger fires, Janus evaluatesMt−1M\_\{t\-1\}andM^t\\widehat\{M\}\_\{t\}on a compact evaluation set and deploys the memory state with better performance\.

#### Hybrid Trigger\-Time Evaluation Set\.

When the memory momentum trigger fires, Janus needs to decide whether the candidate memoryM^t\\widehat\{M\}\_\{t\}should replace the current memoryMt−1M\_\{t\-1\}\. A reliable comparison should approximate full\-history replay, but evaluating on all previously seen tasks would make the replay cost grow linearly with the task stream\. Janus therefore constructs a compact hybrid evaluation set that serves three goals: covering the global support of previously seen tasks, focusing on tasks that are sensitive to memory changes, and incorporating newly encountered tasks\. Formally, at trigger timett, Janus evaluates the two memory states on

ℰt=𝒮tcov∪𝒮tbdry∪ℱt,\\mathcal\{E\}\_\{t\}=\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}\\cup\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}\\cup\\mathcal\{F\}\_\{t\},\(6\)where𝒮tcov\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}is a coverage set,𝒮tbdry\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}is a boundary set, andℱt\\mathcal\{F\}\_\{t\}is a fresh set\. The coverage set summarizes the broad distribution of previously seen tasks, the boundary set stores memory\-sensitive tasks where memory choices have previously changed the agent’s behavior, and the fresh set injects tasks encountered since the last trigger\. Together, these subsets provide a bounded\-cost approximation to full\-history replay\. Givenℰt\\mathcal\{E\}\_\{t\}, Janus evaluates both memories using the same LLM and deploys the memory with higher evaluation performance:

Mt=arg​maxM∈\{Mt−1,M^t\}⁡Performance​\(M;ℰt,ℒ\)\.M\_\{t\}=\\operatorname\*\{arg\\,max\}\_\{M\\in\\\{M\_\{t\-1\},\\widehat\{M\}\_\{t\}\\\}\}\\mathrm\{Performance\}\(M;\\mathcal\{E\}\_\{t\},\{\\mathcal\{L\}\}\)\.As the task stream evolves, Janus refreshes each subset ofℰt\\mathcal\{E\}\_\{t\}to maintain an informative estimate of memory utility\. We update each subset as follows\.

Coverage set update\.The coverage set𝒮tcov\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}preserves representative support of the observed tasks\. At each trigger timett, Janus refreshes this set by clustering the embeddings of all tasks seen so far:

𝒰tcov=\{x1,…,xt\}\.\\mathcal\{U\}^\{\\mathrm\{cov\}\}\_\{t\}=\\\{x\_\{1\},\\ldots,x\_\{t\}\\\}\.\(7\)Using the previous centroids as initialization when available, Janus clusters\{ϕ​\(x\):x∈𝒰tcov\}\\\{\\phi\(x\):x\\in\\mathcal\{U\}^\{\\mathrm\{cov\}\}\_\{t\}\\\}and selects the task closest to each centroid as the new representative\. This global refresh keeps the coverage set compact while ensuring that its representatives remain aligned with the full observed tasks\.

Boundary set update\.The boundary set𝒮tbdry\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}focuses the evaluation set on memory\-sensitive tasks\. After comparingMt−1M\_\{t\-1\}andM^t\\widehat\{M\}\_\{t\}onℰt\\mathcal\{E\}\_\{t\}, Janus identifies the flip set:

ℬt=\{x∈ℰt:𝟏corr​\(x;Mt−1\)≠𝟏corr​\(x;M^t\)\},\\mathcal\{B\}\_\{t\}=\\left\\\{x\\in\\mathcal\{E\}\_\{t\}:\\mathbf\{1\}\_\{\\mathrm\{corr\}\}\(x;M\_\{t\-1\}\)\\neq\\mathbf\{1\}\_\{\\mathrm\{corr\}\}\(x;\\widehat\{M\}\_\{t\}\)\\right\\\},\(8\)where𝟏corr​\(x;M\)∈\{0,1\}\\mathbf\{1\}\_\{\\mathrm\{corr\}\}\(x;M\)\\in\\\{0,1\\\}indicates whether the LLMℒ\{\\mathcal\{L\}\}answers taskxxcorrectly when using memoryMM\. These tasks are informative because the choice of memory state directly changes the agent’s correctness\. Janus first removes tasks already selected into the coverage set:

ℬ~t=ℬt∖𝒮tcov\.\\widetilde\{\\mathcal\{B\}\}\_\{t\}=\\mathcal\{B\}\_\{t\}\\setminus\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}\.\(9\)It then fills the boundary set with tasks fromℬ~t\\widetilde\{\\mathcal\{B\}\}\_\{t\}\. If the number of new flip tasks is insufficient, Janus retains non\-overlapping tasks from the previous boundary set:

𝒮tbdry⊆ℬ~t∪\(𝒮t−bdry∖𝒮tcov\)\.\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}\\subseteq\\widetilde\{\\mathcal\{B\}\}\_\{t\}\\cup\\left\(\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t^\{\-\}\}\\setminus\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}\\right\)\.\(10\)This update rule preserves hard, behavior\-changing examples while allowing the boundary set to evolve as new memory\-sensitive regions are discovered\.

Fresh set update\.Letℓ​\(t\)\\ell\(t\)denote the time step of the most recent trigger beforett\. At trigger timett, Janus constructs the fresh setℱt\\mathcal\{F\}\_\{t\}by sampling from tasks encountered afterℓ​\(t\)\\ell\(t\):

ℱt⊆\{xℓ​\(t\)\+1,…,xt\}\.\\mathcal\{F\}\_\{t\}\\subseteq\\\{x\_\{\\ell\(t\)\+1\},\\ldots,x\_\{t\}\\\}\.\(11\)This set prevents memory comparison from becoming a closed loop over a fixed replay buffer\. By including newly observed tasks, Janus can test whether the candidate memory helps in recently explored regions of the task stream and can also discover new behavior\-changing cases\.

## 3Experiment

### 3\.1Experimental Settings

Table 1:Main results on six datasets with two LLMs\.Accuracy \(%\) is reported for each dataset\. Janus is applied as a plug\-in controller to two base memory updaters\. Shaded rows indicate Janus\-enhanced variants\.Datasets and LLMs\.We evaluate Janus on six datasets covering mathematical reasoning \(MATH500Hendryckset al\.\([2021](https://arxiv.org/html/2606.31121#bib.bib6)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.31121#bib.bib7)\)\), scientific reasoning \(GPQA DiamondReinet al\.\([2023](https://arxiv.org/html/2606.31121#bib.bib3)\)\), professional STEM reasoning \(MMLU\-Pro Engineering and PhysicsWanget al\.\([2024](https://arxiv.org/html/2606.31121#bib.bib2)\)\), code generation HumanEvalChenet al\.\([2021](https://arxiv.org/html/2606.31121#bib.bib4)\)\), and tool/API use \(APIBench\-HFPatilet al\.\([2024](https://arxiv.org/html/2606.31121#bib.bib5)\)\)\. For memory construction, we sample training examples from each dataset: 500 for MATH, non\-overlapping GPQA Main examples for GPQA Diamond, 250 examples for each MMLU\-Pro subset, 50 HumanEval problems, and 250 APIBench\-HF examples\. Evaluation is conducted on the corresponding held\-out test sets, with another 250 examples used for testing on MMLU\-Pro and APIBench\-HF\. We focus on APIBench\-HF because other APIBench subsets are relatively small and nearly saturated in our setting\. Full task prompts are provided in Appendix[C](https://arxiv.org/html/2606.31121#A3)\. Our main experiments use Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.31121#bib.bib8)\)and DeepSeek\-V4\-FlashDeepSeek\-AI \([2026](https://arxiv.org/html/2606.31121#bib.bib9)\)\. Unless otherwise specified, we setmax\-new\-tokens=8192andtemperature=0\.7; we enable thinking mode for Qwen3\-8B and use non\-thinking mode for DeepSeek\-V4\-Flash for cost efficiency\.

Baselines\.We compare Janus with a representative set of sequential memory baselines under the same task stream\.Memory\-freesolves each task using only the underlying LLM without persistent memory, serving as a reference for measuring the benefit of external memory\.ExpRAG\(Weiet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib11)\)is a lightweight experience\-reuse baseline that retrieves the top\-kkmost similar past tasks according to embedding similarity\.DC\-RS\(Suzgunet al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib1)\)maintains a dynamic cheatsheet and uses retrieval\-and\-synthesis to organize past experience into structured memory\.ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib10)\)derives reusable insights from successful and failed trajectories through reflection; we adapt it to the sequential memory setting while following its original multi\-attempt strategy\. For all base memory methods, we follow the original papers’ recommended hyperparameter settings\. For Janus, unless otherwise specified, we use support\-set sizeK=20K=20, whereK′=12K^\{\\prime\}=12tasks are allocated to the coverage set𝒮tcov\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}and the remainingK−K′=8K\-K^\{\\prime\}=8tasks are allocated to the boundary set𝒮tbdry\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}\. We set the fresh\-set size toKℱ=5K\_\{\\mathcal\{F\}\}=5, the MMT threshold toτ=0\.0\\tau=0\.0, and the momentum coefficient toβ=0\.9\\beta=0\.9\. We provide detailed algorithmic descriptions in Appendix[D](https://arxiv.org/html/2606.31121#A4)and study Janus’s sensitivity to key hyperparameters in Section[3\.6](https://arxiv.org/html/2606.31121#S3.SS6)\.

### 3\.2Main Results

Table[1](https://arxiv.org/html/2606.31121#S3.T1)reports the main results across six datasets and two LLMs\. We highlight two key observations\.Structured memory updates are not always better than raw experience reuse\.Compared with the memory\-free baseline, most memory\-based methods improve performance, showing the general value of external memory in sequential task solving\. However, stronger memory abstraction does not guarantee better final performance\. Although DC\-RS and ExpeL extract higher\-level cheatsheets or reusable insights from past trajectories, they do not consistently outperform the lightweight ExpRAG baseline, which simply retrieves similar past experiences\. This suggests that extracted memory can sometimes lose useful task\-specific details, introduce noisy or over\-specialized rules, or bias the final memory toward recent tasks\. This observation motivates our central claim: the key issue is not only how to generate memory, but also whether each generated memory should be deployed\.Janus consistently improves final memory usefulness across base updaters and LLMs\.When applied to DC\-RS and ExpeL, Janus improves the corresponding base updater across both LLMs and all datasets\. On average, DC\-RS\+Janus improves over DC\-RS from79\.579\.5to83\.283\.2on Qwen3\-8B and from76\.776\.7to81\.381\.3on DeepSeek\-V4\. Similarly, ExpeL\+Janus improves over ExpeL from78\.378\.3to81\.581\.5on Qwen3\-8B and from79\.679\.6to82\.382\.3on DeepSeek\-V4\. The best\-performing method in each LLM block is also a Janus\-enhanced variant\. These results support Janus as a method\-agnostic controller: instead of replacing the base updater, it improves sequential memory systems by filtering locally generated updates and maintaining a more useful final memory for future tasks\.

### 3\.3MMT Trigger Ablation

Table 2:Ablation study of the Memory Momentum Trigger\.We compare Janus with alternative trigger policies using Qwen3\-8B on GPQA and HumanEval\. “Base” denotes the base updater without Janus\.“Always” compares old and new memory after every task, while “Random” and “Periodic” trigger comparisons under comparable trigger budgets to Janus\.Table[2](https://arxiv.org/html/2606.31121#S3.T2)studies whether the Memory Momentum Trigger identifies useful moments for old\-versus\-new memory comparison\. We compare Janus with four alternatives:*Base*, which never triggers comparison and directly accepts each candidate update;*Always*, which compares after every task;*Random*, which triggers comparison at each step with the same trigger probability as Janus; and*Periodic*, which triggers everyNNsteps, withNNchosen to approximately match Janus’s trigger count for the same updater and dataset\. This ablation targets the “when to compare” challenge: an effective controller should trigger comparison at informative memory\-transition points rather than relying on a hand\-designed schedule\.

The results show that trigger timing matters\. Compared with random and periodic triggering, Janus achieves stronger or comparable accuracy under similar trigger budgets, especially on GPQA\. This suggests that MMT is not merely reducing the number of comparisons, but is able to capture meaningful changes in the memory\-update trajectory where deployment decisions are more likely to affect final performance\. At the same time, Janus approaches the performance of the always\-trigger policy with substantially fewer comparisons\. On HumanEval, Janus uses only a small fraction of triggers for both DC\-RS and ExpeL while remaining close to always triggering\. On GPQA, Janus even outperforms always triggering for DC\-RS, indicating that more frequent comparison is not necessarily better when the evaluation set is compact and potentially noisy\. Overall, the results support MMT as an efficient trigger mechanism for deciding when memory comparison is needed\.

### 3\.4Support Set Composition Ablation

![Refer to caption](https://arxiv.org/html/2606.31121v1/x4.png)Figure 3:Ablation study of the hybrid trigger\-time evaluation set\.We evaluate Janus with Qwen3\-8B and the DC\-RS updater on GPQA and HumanEval, reporting final test accuracy\. Each ablation removes one component: coverage, boundary, or fresh tasks\.Figure[3](https://arxiv.org/html/2606.31121#S3.F3)studies the composition of the hybrid evaluation set used when Janus triggers an old\-versus\-new memory comparison\. This ablation targets the “what to compare” challenge: after deciding that a candidate memory should be checked, Janus must evaluate it on a compact but informative set that approximates full\-history replay\. We compare full Janus against three variants that remove coverage, boundary, or fresh tasks\. The results show that all three components contribute to reliable memory selection\. Removing the coverage set or the boundary set consistently reduces performance, suggesting that they provide complementary signals\. Coverage tasks preserve broad representativeness over the seen task distribution, while boundary tasks focus the comparison on memory\-sensitive cases where different memory states are likely to change the prediction\. The largest drop comes from removing the fresh set, especially on both GPQA and HumanEval\. This indicates that relying only on the stored support set can make the comparison stale and self\-reinforcing\. Fresh tasks provide recent pending evidence that has not yet been absorbed into the support set, allowing Janus to evaluate candidate memories on newly encountered regions of the task stream\. Overall, the ablation confirms that Janus’s hybrid evaluation set is important for making effective memory deployment decisions under a bounded comparison budget\.

### 3\.5Memory Deployment Ablation

![Refer to caption](https://arxiv.org/html/2606.31121v1/x5.png)Figure 4:Old\-vs\-new deployment decision ablation\.We evaluate memory checkpoints obtained after processing20%20\\%,40%40\\%,60%60\\%,80%80\\%, and100%100\\%of the task stream using Qwen3\-8B on GPQA and MMLU\-Pro \(Eng\.\)\. The base updater directly deploys every candidate memory, while Janus selectively accepts or rejects candidate updates\.Figure[4](https://arxiv.org/html/2606.31121#S3.F4)evaluates whether the core deployment decision in Janus is useful\. Instead of only reporting the final memory after the full stream, we measure the test accuracy of intermediate memory states after processing20%20\\%,40%40\\%,60%60\\%,80%80\\%, and100%100\\%of the training stream\. This ablation directly tests whether selectively choosing between the previous memory and the candidate memory helps preserve useful information as more tasks are observed\.

The results show that directly accepting every memory update does not necessarily lead to better future\-task performance\. For both DC\-RS and ExpeL, the base updater often improves in the early or middle stages but then stagnates or drops as the stream continues\. This indicates that additional memory updates can introduce noisy, overly specific, or conflicting information, causing the final memory to become less useful than earlier memory states\. In contrast, Janus produces a more stable upward trend in most settings and achieves better final accuracy across all four comparisons\. This supports the central motivation of Janus: sequential memory systems need a deployment controller, because the newest memory is not always the best memory for future tasks\. The improvement is especially clear near the end of the stream\. While the base updaters can degrade after processing more tasks, Janus tends to preserve or improve test accuracy, suggesting that its old\-vs\-new decision helps retain broadly useful memory while filtering harmful candidate updates\. Overall, this ablation confirms that Janus is not merely changing the update schedule; its explicit accept\-or\-reject decision is important for maintaining a final memory that generalizes better to unseen tasks\. We further analyze robustness to different task\-stream orders in Appendix[A](https://arxiv.org/html/2606.31121#A1), where Janus also yields more stable final\-memory performance\.

### 3\.6Hyperparameter Sensitivity

Table 3:Hyperparameter sensitivity of Janus\-DC\-RS\.We sweep the support\-set sizeK=\|𝒮t\|K=\|\\mathcal\{S\}\_\{t\}\|and the MMT thresholdτ\\tauusing Qwen3\-8B on GPQA\. Unless otherwise specified, we use the default settingK=20K=20,K′=\|𝒮tcov\|=12K^\{\\prime\}=\|\\mathcal\{S\}^\{\\mathrm\{cov\}\}\_\{t\}\|=12,\|𝒮tbdry\|=K−K′=8\|\\mathcal\{S\}^\{\\mathrm\{bdry\}\}\_\{t\}\|=K\-K^\{\\prime\}=8,Kℱ=\|ℱt\|=5K\_\{\\mathcal\{F\}\}=\|\\mathcal\{F\}\_\{t\}\|=5, andτ=0\.0\\tau=0\.0\. Evaluation cost is approximated by\#Trig\.×\(K\+Kℱ\)\\\#\\mathrm\{Trig\.\}\\times\(K\+K\_\{\\mathcal\{F\}\}\)\.Table[3](https://arxiv.org/html/2606.31121#S3.T3)analyzes how Janus is affected by two key hyperparameters: the support\-set sizeKKand the MMT thresholdτ\\tau\. These two parameters correspond to the two main efficiency dimensions of Janus\. The support\-set sizeKKcontrols the coverage–boundary portion of the trigger\-time evaluation set, whileτ\\taucontrols how frequently MMT triggers such comparisons\. We report final test accuracy together with the number of triggers and the estimated evaluation cost\. The results show that increasing the comparison budget does not always improve performance\. For the support\-set sweep,K=20K=20gives the best accuracy among the tested values, while larger support sets increase evaluation cost without yielding better performance\. This suggests that a compact but informative support set is preferable to simply using more replay tasks\. For the trigger threshold sweep, increasingτ\\taumakes Janus more sensitive to memory\-trajectory deviations and therefore increases the trigger rate\. A largerτ\\taucan improve accuracy, but it also incurs higher evaluation cost\. Overall, these results show that Janus provides a controllable trade\-off between final memory performance and replay cost, with the default setting offering a strong balance between accuracy and efficiency\.

## 4Related Work

Sequentially Evolving LLM Memory\.Memory has become a central mechanism for enabling LLM agents to adapt over sequential interactions\(Xianget al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib13); Fanget al\.,[2025a](https://arxiv.org/html/2606.31121#bib.bib28); Weiet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib11); Chhikaraet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib30)\)\. Recent memory\-augmented agents extract reusable information from prior trajectories, feedback, reflections, workflows, or task solutions, and use such information to improve future reasoning and decision\-making\(Wanget al\.,[2025a](https://arxiv.org/html/2606.31121#bib.bib26); Fanget al\.,[2025b](https://arxiv.org/html/2606.31121#bib.bib12); Zhonget al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib27); Xuet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib24)\)\. Representative approaches include reflection\- or experience\-based methods such as ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib10)\), dynamic memory construction methods such as Dynamic Cheatsheet\(Suzgunet al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib1)\), workflow\-level memory methods such as Agent Workflow Memory\(Wanget al\.,[2025c](https://arxiv.org/html/2606.31121#bib.bib21)\), and retrieval\- or structure\-based memory systems such as G\-Memory\(Zhanget al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib23)\)and Memento\(Zhouet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib14)\)\. These methods mainly focus on how to construct or update memory, and assume that each newly generated memory should be directly deployed for future tasks\. In contrast, Janus studies the deployment decision itself: given a candidate memory produced by an existing updater, should the agent accept it or keep the previous memory? This distinction is important because locally useful updates can still bias the final memory toward recent tasks or noisy task\-specific rules, reducing its usefulness for future unseen tasks\.

Test\-Time Adaptation and Memory Control\.Test\-time adaptation aims to improve model behavior during deployment, often by using feedback from the test environment rather than updating the model solely during offline training\(Wanget al\.,[2021](https://arxiv.org/html/2606.31121#bib.bib33); Zhanget al\.,[2022](https://arxiv.org/html/2606.31121#bib.bib34)\)\. Recent LLM agent systems extend this idea through natural\-language feedback loops, where agents reflect on failures, revise solutions, refine prompts, or update strategies across interactions\(Shinnet al\.,[2023](https://arxiv.org/html/2606.31121#bib.bib35); Madaanet al\.,[2023](https://arxiv.org/html/2606.31121#bib.bib29); Gouet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib36); Asaiet al\.,[2024](https://arxiv.org/html/2606.31121#bib.bib37)\)\. A related line of work formalizes such feedback as textual gradients, where natural\-language critiques play a role analogous to gradients by indicating how an output, prompt, or policy should be mproved\(Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib32)\)\. These methods show that feedback can support continual improvement without directly modifying model parameters\. However, feedback\-driven updates are often local to the current task or recent trajectory, and may not improve the final memory used for future unseen tasks\. Janus addresses this deployment challenge by selectively validating candidate updates with a momentum\-based trigger and a compact hybrid evaluation set, deciding which feedback\-induced memories should actually be deployed\.

## 5Conclusion

Sequentially evolving LLM memory raises a different scaling question from simply storing more experience or updating memory more often\. As agents accumulate longer task histories, the critical issue becomes how to allocate additional inference effort: not every new memory should be trusted, and not every update deserves expensive validation\. Janus points to a selective scaling strategy for self\-evolving agents, where extra tokens are spent only when a candidate memory update is likely to change the trajectory of future behavior\. By treating memory updating as a deployment decision, Janus shifts the focus from blindly accepting locally generated memories to controlling which memories should shape future inference\. Across multiple datasets, backbone LLMs, and base memory updaters, Janus consistently improves final memory performance while avoiding unnecessary comparisons\. Our ablations further show that both the trigger mechanism and the hybrid evaluation set are important for reliable memory deployment\. This perspective suggests that robust sequentially evolving LLM memory requires not only better memory writers, but also mechanisms that decide when memory refinement is worth the cost\.

## Limitations

#### Scope of memory mechanisms\.

Our evaluation focuses on prompt\-based sequential memory systems, where the base LLM remains fixed and memory is updated through retrieval, summarization, reflection, cheatsheets, or other external text\-based mechanisms\. This scope matches many current memory\-augmented agent systems and allows us to compare different updaters under a unified test\-time protocol\. However, another emerging direction is training\-based evolution, where experience is used to update policies, skills, or model behavior through learning, such as reinforcement learning\-based methods\(Yanet al\.,[2025](https://arxiv.org/html/2606.31121#bib.bib31); Xiaet al\.,[2026](https://arxiv.org/html/2606.31121#bib.bib22)\)\. Extending Janus to jointly control external memory updates and parameter\- or policy\-level adaptation is an important direction for future work\.

#### Evaluation coverage\.

Our experiments cover multiple task types and two backbone LLMs, but they do not exhaust all possible sequential agent environments\. In particular, long\-horizon interactive environments, multi\-agent settings, and tasks with non\-stationary distributions may introduce additional challenges for memory control\. Further evaluation in these settings would help better characterize the generality of Janus\.

## Ethical considerations

Our work focuses on improving the reliability of sequential LLM memory systems by controlling whether candidate memory updates should be deployed\. In real\-world applications, memory systems may store sensitive user information or incorrect task\-specific rules, so they should be used with appropriate privacy protections, data filtering, and human oversight when necessary\. We do not foresee major negative societal impacts from the proposed method itself, but responsible deployment depends on the underlying LLM, task domain, and memory contents

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1)\.
- Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InInternational conference on learning representations,Vol\.2024,pp\. 9112–9141\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li,et al\.\(2025a\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.arXiv preprint arXiv:2508\.07407\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025b\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Yang, N\. Duan, W\. Chen,et al\.\(2024\)Critic: large language models can self\-correct with tool\-interactive critiquing\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 57734–57811\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- M\. Ho, C\. Si, Z\. Feng, F\. Yu, Y\. Yang, Z\. Liu, Z\. Hu, and L\. Qin \(2025\)Arcmemo: abstract reasoning composition with lifelong llm memory\.arXiv preprint arXiv:2509\.04439\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou \(2026\)Dynamic cheatsheet: test\-time learning with adaptive memory\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7080–7106\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§1](https://arxiv.org/html/2606.31121#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p2.9),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- D\. Wang, E\. Shelhamer, S\. Liu, B\. Olshausen, and T\. Darrell \(2021\)Tent: fully test\-time adaptation by entropy minimization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uXl3bZLkr3c)Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- J\. Wang, Z\. Guo, W\. Ma, and M\. Zhang \(2025a\)How far can llms improve from experience? measuring test\-time learning ability in llms with human comparison\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25688–25702\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025b\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2025c\)Agent workflow memory\.InInternational Conference on Machine Learning,pp\. 63897–63911\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p2.1),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- T\. Wei, N\. Sachdeva, B\. Coleman, Z\. He, Y\. Bei, X\. Ning, M\. Ai, Y\. Li, J\. He, E\. H\. Chi,et al\.\(2025\)Evo\-memory: benchmarking llm agent test\-time learning with self\-evolving memory\.arXiv preprint arXiv:2511\.20857\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§1](https://arxiv.org/html/2606.31121#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.31121#S2.SS1.p1.6),[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p2.9),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Scope of memory mechanisms\.](https://arxiv.org/html/2606.31121#Sx1.SS0.SSS0.Px1.p1.1)\.
- Z\. Xiang, C\. Yang, Z\. Chen, Z\. Wei, Y\. Tang, Z\. Teng, Z\. Peng, Z\. Li, C\. Huang, Y\. He,et al\.\(2026\)A systematic survey of self\-evolving agents: from model\-centric to environment\-driven co\-evolution\.Available at SSRN 6626878\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, J\. Bi, K\. Kersting, J\. Z\. Pan,et al\.\(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv preprint arXiv:2508\.19828\.Cited by:[Scope of memory mechanisms\.](https://arxiv.org/html/2606.31121#Sx1.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative ai by backpropagating language model feedback\.Nature639\(8055\),pp\. 609–616\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- G\. Zhang, M\. Fu, K\. Wang, F\. Wan, M\. Yu, and S\. Yan \(2026\)G\-memory: tracing hierarchical memory for multi\-agent systems\.Advances in Neural Information Processing Systems38,pp\. 12988–13018\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- M\. Zhang, S\. Levine, and C\. Finn \(2022\)Memo: test time robustness via adaptation and augmentation\.Advances in neural information processing systems35,pp\. 38629–38642\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p2.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§1](https://arxiv.org/html/2606.31121#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.31121#S3.SS1.p2.9),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig,et al\.\(2025\)Skillweaver: web agents can self\-improve by discovering and honing skills\.arXiv preprint arXiv:2504\.07079\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§1](https://arxiv.org/html/2606.31121#S1.p1.1),[§4](https://arxiv.org/html/2606.31121#S4.p1.1)\.

## Appendix AStream Order Analysis

Figure[5](https://arxiv.org/html/2606.31121#A1.F5)analyzes the sensitivity of sequential memory methods to the order of the task stream\. This experiment is closely related to our motivation: because base memory updaters revise memory using local feedback from the current task, the final memory may depend heavily on which tasks appear later in the stream\. We therefore shuffle the training stream with different random seeds and evaluate the final frozen memory on the same HumanEval test set\. The results show that Janus improves both accuracy and stability under stream\-order changes\. For DC\-RS, the base updater exhibits noticeable variance across different orders, indicating that directly accepting every candidate memory can lead to order\-dependent final memories\. With Janus, the final performance is consistently higher and the variation across orders becomes smaller\. A similar pattern appears for ExpeL: the base updater is sensitive to the stream order, while ExpeL\+Janus achieves stronger and more stable performance\.

These results support the role of Janus as a memory\-deployment controller\. By selectively accepting or rejecting candidate updates, Janus reduces the effect of myopic and recent\-biased memory revisions\. Rather than letting the final memory be determined by the particular order in which tasks are encountered, Janus helps maintain a memory state that is more robust for future inference\.

![Refer to caption](https://arxiv.org/html/2606.31121v1/x6.png)Figure 5:Stream order analysis\.We evaluate the final frozen\-memory test accuracy under five different shuffled task\-stream orders using Qwen3\-8B on HumanEval\. Each point corresponds to one stream order, and error bars show the variation across orders\.
## Appendix BQualitative Case Study

We provide examples to illustrate how Janus behaves as a memory\-deployment controller\. Each example is taken from a triggered decision during training\. When the Memory Momentum Trigger fires, Janus compares the previous memoryMt−1M\_\{t\-1\}and the candidate memoryM^t\\widehat\{M\}\_\{t\}on the compact evaluation set, and then either rolls back toMt−1M\_\{t\-1\}or deploysM^t\\widehat\{M\}\_\{t\}\. For consistency, we use APIBench\-HF for the DC\-RS cases and MATH for the ExpeL cases\. For rollback cases, we show the candidate memory rejected by Janus; for accept cases, we show the previous memory that Janus replaces\.

Table 4:Representative Janus memory\-deployment decisions\.All cases are triggered by MMT\. Janus compares the previous memory and the candidate memory on the compact evaluation set and deploys the memory with better support\-set performance\.#### DC\-RS rollback: rejecting a narrow recent\-task rewrite\.

In the first APIBench\-HF case, the current task asks the agent to classify news into categories such as technology, sports, or politics\. Before this step, the deployed memoryMt−1M\_\{t\-1\}contains API\-use patterns from a recent robotics\-related trajectory, including:

> *Use Hugging Face pipelines for robotics task automation;* *Load pretrained Decision Transformer models for motor control tasks;* *Implement zero\-shot reinforcement learning with multimodal transformers\.*

This previous memory is not necessarily ideal for every future APIBench task\. However, the candidate memoryM^t\\widehat\{M\}\_\{t\}proposed by DC\-RS rewrites the cheatsheet almost entirely around the latest news\-classification task:

> *Use zero\-shot classification for multi\-domain text categorization;* *Optimize classification with cross\-encoder models;* *Utilize transformer\-based pipelines for real\-time classification\.*

The first rule directly reflects the current label space, while the remaining rules mostly repeat zero\-shot text\-classification advice\. Thus, the candidate does not merely add a useful insight; it shifts the whole memory toward one recent task type\. On the compact evaluation set, this rewrite reduces the support\-set score from58\.358\.3to25\.025\.0\. Janus therefore rolls back toMt−1M\_\{t\-1\}\. This case illustrates that Janus does not require the previous memory to be semantically perfect; it only needs to detect that the candidate deployment would make the memory less useful under the support\-set comparison\.

#### DC\-RS accept: deploying a useful new tool pattern\.

The second APIBench\-HF case shows the opposite behavior\. Here, the current task requires classifying product images of electronic devices\. The previous memory is dominated by multilingual text\-classification rules:

> *Use the Hugging Face pipeline API for text summarization with minimal code;* *Implement zero\-shot classification with XLM\-RoBERTa models for multilingual text categorization;* *Apply zero\-shot classification with pre\-trained multilingual models for cross\-lingual categorization\.*

The candidate memory adds a new vision\-oriented API pattern:

> *Use CLIP\-based zero\-shot image classification for electronic device categorization;* *Deploy Hugging Face pipelines for zero\-shot image classification with custom device categories;* *Apply zero\-shot image classification with CLIP models for cross\-domain product categorization\.*

Unlike the previous rollback case, this update introduces a genuinely new and reusable capability that the old memory lacks\. The support\-set score increases from53\.353\.3to86\.786\.7, so Janus acceptsM^t\\widehat\{M\}\_\{t\}\. This case shows that Janus is not conservative by default; it deploys candidate memories when they improve broader support\-set behavior\.

#### ExpeL rollback: rejecting a correct but narrow task\-derived rule\.

In the first MATH case, the current problem asks for the greatest whole number that must divide the product of any three consecutive positive integers\. The previous ExpeL memory contains a diverse set of reusable mathematical heuristics, including:

> *Combine like terms in polynomial expressions by grouping and summing coefficients;* *Use Fermat’s Little Theorem for modular arithmetic when dealing with exponents modulo prime numbers;* *Apply Simon’s Favorite Factoring Trick to solve equations by adding constants to factor expressions\.*

The candidate memory keeps most previous rules but adds the following rule:

> *The product of any three consecutive positive integers is always divisible by 6\.*

This rule is mathematically correct, but it is tightly tied to the current problem pattern and has limited coverage over the support tasks used in this comparison\. In other words, correctness on the latest task does not imply that deploying the updated memory improves broader memory utility\. The support\-set comparison confirms this: the previous memory scores73\.373\.3, while the candidate memory drops to53\.353\.3\. Janus therefore rejects the candidate and keeps the earlier memory\. This example shows that Janus can filter even correct local updates when they reduce the usefulness of the deployed memory for the broader task stream\.

#### ExpeL accept: replacing a low\-yield rule with a reusable procedure\.

The second MATH case involves converting repeating decimals into a common fraction\. Before the update, the last rule in the deployed memory is a simple percentage heuristic:

> *Calculate percentage by dividing the count of desired elements by total elements and multiplying by 100\.*

The candidate memory replaces this low\-yield rule with a procedural rule for converting repeating decimals:

> *Convert repeating decimals to fractions by settingxxequal to the decimal, multiplying by10n10^\{n\}wherennis the number of repeating digits, subtracting the original equation to eliminate the repeating part, and solving forxxas a fraction\.*

Although the two rules do not address the same mathematical topic, this is exactly the deployment decision Janus is designed to make under a bounded memory budget: whether the candidate memory state is more useful than the previous one\. The new rule encodes a general algebraic procedure for a recurring class of problems, whereas the replaced percentage rule is a simpler heuristic with less support\-set utility in this context\. The support\-set score improves from42\.942\.9to57\.157\.1, so Janus accepts the candidate memory\.

#### Takeaway\.

These cases show that Janus does not judge whether a candidate memory is locally correct or semantically similar to the replaced rule\. Instead, it evaluates whether the resulting memory state is more useful on a compact support set\. Thus, Janus can reject correct but narrow updates when they reduce broader utility, and accept task\-motivated updates when they introduce reusable procedures or capabilities\.

## Appendix CTask Prompts

MATH PromptSolve the following math problem\.Please do your derivation in LaTeX as much as possible\. Keep the reasoning concise but complete\. At the end, put only the final answer in the LAST \\boxed\{\.\.\.\}\. If the final answer is a fraction, use LaTeX fraction form like \\frac\{a\}\{b\} \(not decimal approximation unless required\)\.Problem: \{question\}

GPQA PromptAnswer the following multiple\-choice question\.Question: \{question\}Choices: \{choices\}Think carefully and then output only the final option letter \(A/B/C/D\)\.

MMLU\-Pro PromptAnswer the following multiple\-choice question\.Question: \{question\}Choices: \{choices\}Think carefully and then output only the final option letter \(\{valid\_labels\}\)\.

HumanEval PromptComplete the following Python function\. Implement \{entry\_point\} so that it satisfies the docstring\.`‘‘‘`python\{question\}`‘‘‘`Return the full implementation \(including the function signature\) inside a single`‘‘‘`python\.\.\.`‘‘‘`code block\. Do not add any example usage or extra prose after the code block\.

APIBench PromptYou are a helpful API writer who can write APIs based on requirements\.\{question\}Write a Python program in 1 to 2 lines to call API in \{framework\}\.The answer should follow the format:`<<<`domain`\>\>\>`$DOMAIN,`<<<`api\_call`\>\>\>`: $API\_CALL,`<<<`api\_provider`\>\>\>`: $API\_PROVIDER,`<<<`explanation`\>\>\>`: $EXPLANATION,`<<<`code`\>\>\>`: $CODE\. Here are the requirements:1\.$DOMAIN should be inferred from the task description\.2\.$API\_CALL should have only one line of code that calls the API\.3\.$API\_PROVIDER should be the programming framework used\.4\.$EXPLANATION should be a step\-by\-step explanation\.5\.$CODE is the Python code\.6\.Do not repeat the format in your answer\.

## Appendix DAlgorithms

Algorithm 1ExpeL0:Sequence of tasks

𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model

ℒ\{\\mathcal\{L\}\}, self\-reflection model

ℒreflect\{\\mathcal\{L\}\}\_\{\\textsc\{reflect\}\}, insight prompt

Pi​nP\_\{in\}, retrieval budget

KK, step size

LL, max tries

ZZ
1:Initialize:Experience pool

ℬ←Fmanual\{\\mathcal\{B\}\}\\leftarrow F\_\{\\textsc\{manual\}\}\(seed demos if available\), recent success set

𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset, insight set

ι^←∅\\hat\{\\iota\}\\leftarrow\\emptyset
2:forTask

Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do

3:Retrieve Top\-

KKsimilar success cases

FsimF\_\{\\textsc\{sim\}\}from

ℬ\{\\mathcal\{B\}\}⊳\\trianglerightSearch

4:

ν←""\\nu\\leftarrow\\texttt\{""\}
5:

τsucc←∅\\tau^\{\\textsc\{succ\}\}\\leftarrow\\emptyset,

τfail←∅\\tau^\{\\textsc\{fail\}\}\\leftarrow\\emptyset
6:for

z=1z=1to

ZZdo

7:

y^i\(z\)←ℒ​\(xi,Fsim,ι^,ν\)\\hat\{y\}\_\{i\}^\{\(z\)\}\\leftarrow\{\\mathcal\{L\}\}\(x\_\{i\},F\_\{\\textsc\{sim\}\},\\hat\{\\iota\},\\nu\)⊳\\trianglerightSynthesis

8:

s​u​c​c​e​s​s\(z\)←Evaluator​\(xi,yi,y^i\(z\)\)success^\{\(z\)\}\\leftarrow\\mathrm\{Evaluator\}\(x\_\{i\},y\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
9:if

s​u​c​c​e​s​s\(z\)=1success^\{\(z\)\}=1then

10:

τsucc←\(xi,y^i\(z\)\)\\tau^\{\\textsc\{succ\}\}\\leftarrow\(x\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
11:

ℬ←ℬ∪\{τsucc\}\{\\mathcal\{B\}\}\\leftarrow\{\\mathcal\{B\}\}\\cup\\\{\\tau^\{\\textsc\{succ\}\}\\\}⊳\\trianglerightEvolve

12:break

13:else

14:

τfail←\(xi,y^i\(z\)\)\\tau^\{\\textsc\{fail\}\}\\leftarrow\(x\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
15:

ν←Concat​\(ν,ℒreflect​\(τfail\)\)\\nu\\leftarrow\\textsc\{Concat\}\(\\nu,\\;\{\\mathcal\{L\}\}\_\{\\textsc\{reflect\}\}\(\\tau^\{\\textsc\{fail\}\}\)\)⊳\\trianglerightReflect

16:endif

17:endfor

18:if

τsucc≠∅\\tau^\{\\textsc\{succ\}\}\\neq\\emptysetthen

19:

𝒮←𝒮∪\{τsucc\}\{\\mathcal\{S\}\}\\leftarrow\{\\mathcal\{S\}\}\\cup\\\{\\tau^\{\\textsc\{succ\}\}\\\}
20:endif

21:if

τsucc≠∅​and​τfail≠∅\\tau^\{\\textsc\{succ\}\}\\neq\\emptyset\\ \\and\\ \\tau^\{\\textsc\{fail\}\}\\neq\\emptysetthen

22:

ι^←ℒ​\(Pi​n,τsucc,τfail,ι^\)\\hat\{\\iota\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\},\\tau^\{\\textsc\{succ\}\},\\tau^\{\\textsc\{fail\}\},\\hat\{\\iota\}\)⊳\\trianglerightPair Update

23:endif

24:if

\|𝒮\|=L\|\{\\mathcal\{S\}\}\|=Lthen

25:

ι^←ℒ​\(Pi​n,𝒮,ι^\)\\hat\{\\iota\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\},\{\\mathcal\{S\}\},\\hat\{\\iota\}\)⊳\\trianglerightBatch Update

26:

𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset
27:endif

28:endfor

Algorithm 2Dynamic Cheatsheet Retrieval\-and\-Synthesis \(DC\-RS\)0:Sequence of tasks

𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model

ℒ\{\\mathcal\{L\}\}, retrival budget

KK, generator template

Pg​e​nP\_\{gen\}, curator template

Pc​u​rP\_\{cur\}
1:Initialize:Cheatsheet

ℳ0←∅\{\\mathcal\{M\}\}\_\{0\}\\leftarrow\\emptyset, history

ℋ0←∅\{\\mathcal\{H\}\}\_\{0\}\\leftarrow\\emptyset
2:forTask

Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do

3:if

\|ℋi−1\|\>0\|\{\\mathcal\{H\}\}\_\{i\-1\}\|\>0then

4:Retrieve Top\-

KKsimilar past cases

Cr​e​t​rC\_\{retr\}from

ℋi−1\{\\mathcal\{H\}\}\_\{i\-1\}⊳\\trianglerightSearch

5:else

6:

𝒫←∅\\mathcal\{P\}\\leftarrow\\emptyset
7:endif

8:Update memory state

ℳi←ℒ​\(Pc​u​r,Cr​e​t​r,xi,ℳi−1\)\{\\mathcal\{M\}\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{cur\},\\;C\_\{retr\},\\;x\_\{i\},\\;\{\\mathcal\{M\}\}\_\{i\-1\}\)⊳\\trianglerightEvolve

9:Answer

y^i←ℒ​\(Pg​e​n,xi,ℳi\)\\hat\{y\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{gen\},\\;x\_\{i\},\\;\{\\mathcal\{M\}\}\_\{i\}\)⊳\\trianglerightSynthesis

10:Update history

ℋi←ℋi−1∪\{xi,y^i\}\{\\mathcal\{H\}\}\_\{i\}\\leftarrow\{\\mathcal\{H\}\}\_\{i\-1\}\\cup\\\{x\_\{i\},\\hat\{y\}\_\{i\}\\\}⊳\\trianglerightEvolve

11:endfor

Similar Articles

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

arXiv cs.CL

MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Hugging Face Daily Papers

EvoArena introduces a benchmark for evaluating LLM agents in dynamic environments with progressive updates across terminal, software, and social domains, while EvoMem proposes a patch-based memory paradigm that records structured evolution; experiments show current agents achieve only 39.6% accuracy on EvoArena, and EvoMem yields average gains of 1.5% on the benchmark and improvements on GAIA and LoCoMo.

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks

Hugging Face Daily Papers

Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.