Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
Summary
Introduces SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory that measures multiple dimensions beyond aggregate metrics, revealing trade-offs between adaptability and stability.
View Cached Full Text
Cached at: 05/18/26, 06:40 AM
# Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
Source: [https://arxiv.org/html/2605.15384](https://arxiv.org/html/2605.15384)
Songwei Dong University of Virginia hxt5ap@virginia\.edu &Zihan Chen∗ University of Virginia brf3rx@virginia\.edu &Chengshuai Shi Princeton University cs1083@princeton\.edu &Peng Wang University of Virginia pw7nc@virginia\.edu &Jundong Li University of Virginia jundong@virginia\.edu &Cong Shen University of Virginia cong@virginia\.edu
###### Abstract
Memory plays a central role in enabling large language models \(LLMs\) to operate over sequential tasks by accumulating and reusing experience over time\. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold\-out accuracy or cumulative online performance\. We argue that these metrics can be misleading: they collapse distinct memory behaviors into a single number and obscure critical failure modes such as forgetting and negative transfer\. In this paper, we introduceSeqMem\-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory\. Drawing inspiration from continual learning, it targets a distinct test\-time setting in which memory is external, prompt\-mediated, and updated without changing model parameters\. Rather than only measuring whether the final memory state improves performance,SeqMem\-Evalexamines how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference\. Specifically, it measures online utility, hold\-out generalization, backward transfer, and forgetting, providing a finer\-grained view of whether memory updates help current tasks, generalize to unseen tasks, improve past predictions, or degrade previously acquired knowledge\. Through extensive experiments across diverse tasks and memory methods, we uncover several previously overlooked phenomena\. In particular, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from significant forgetting or negative transfers\. Moreover, different memory designs exhibit distinct trade\-offs between adaptability and stability, which are invisible under standard evaluation metrics\. Our findings show that aggregate metrics systematically miss several recurring failure modes, suggesting that a multi\-dimensional perspective is essential for understanding LLM memory\. Code and evaluation framework are available at:[https://github\.com/ShenGroup/SeqMem\-Eval](https://github.com/ShenGroup/SeqMem-Eval)
## 1Introduction
Large language models \(LLMs\) are increasingly equipped with external memory to evolve over sequential tasks\(Xianget al\.,[2026](https://arxiv.org/html/2605.15384#bib.bib12); Fanget al\.,[2025b](https://arxiv.org/html/2605.15384#bib.bib11); Weiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib10)\), where models are expected to accumulate experience and adapt their behavior over time\. Through continual interaction with tasks or environments, LLMs generate rich trajectories that contain not only successful solutions but also failed attempts, feedback signals, and intermediate reasoning traces\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.15384#bib.bib9)\)\. Rather than serving as passive records of past interactions, these trajectories can provide valuable experience for refining future decisions, improving task strategies, and enabling test\-time adaptation\(Weiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib10); Suzgunet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib6); Zhouet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib13)\)\. Such sequentially evolving systems are central to many emerging applications, including reasoning assistants\(Hoet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib14)\), tool\-use agents\(Wanget al\.,[2025b](https://arxiv.org/html/2605.15384#bib.bib15)\), and interactive decision\-making systems\(Zhenget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib16); Agrawalet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib17)\), where performance depends not only on the current input but also on previously encountered tasks\. Wei et al\.Weiet al\.\([2025](https://arxiv.org/html/2605.15384#bib.bib10)\)formalize this setting by viewing memory as an evolving state that isretrieved,synthesized, andupdatedthroughout a task sequence\. This perspective marks an important step beyond static conversational recall, highlighting the role of memory in test\-time adaptation and experience reuse\.
Despite this progress, the evaluation of sequentially evolving LLM memory remains incomplete\. Existing studies typically assess memory methods using aggregate performance metrics, such as final hold\-out accuracy after memory construction\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.15384#bib.bib9)\)or cumulative online accuracy along the sequence\(Weiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib10); Suzgunet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib6)\)\. While useful, these metrics collapse the behavior of an evolving memory system into a single number\. As a result, they do not reveal whether memory updates genuinely improve future behavior, whether later experiences help consolidate earlier ones, or whether the system forgets knowledge that was previously useful\. In practice, similar final or average accuracy may mask fundamentally different learning dynamics: one method may steadily accumulate reusable knowledge, while another may exhibit oscillatory behavior or transient improvements followed by degradation\. Thus, aggregate metrics can create an illusion of comparable memory quality, even when the underlying memory dynamics are substantially different\.
Figure 1:SeqMem\-Eval: Beyond aggregate evaluation of LLM memory\.Left: In sequential settings, an LLM processes a stream of tasks while maintaining an evolving memory state\. Middle: Existing evaluations reduce memory performance to aggregate metrics, which collapses complex memory dynamics and hides important behaviors\. Right:SeqMem\-Evaldecomposes memory quality into multiple dimensions, including online utility, hold\-out generalization, backward transfer, forgetting, and efficiency, enabling fine\-grained analysis of how memory evolves\.In this paper, we proposeSeqMem\-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory\. While motivated by continual learning\(Wuet al\.,[2022](https://arxiv.org/html/2605.15384#bib.bib19); Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2605.15384#bib.bib18)\), sequentially evolving LLM memory is a distinct test\-time setting: the LLM remains fixed, and adaptation occurs through updates to an external textual memory that influences predictions via retrieval and context construction\. Therefore, evaluation should focus not only on final task performance, but also on how these memory updates affect predictions throughout the sequence\.SeqMem\-Evalcaptures this behavior through five complementary dimensions:*online utility*,*hold\-out generalization*,*backward transfer*,*forgetting*, and*efficiency*\. Together, these diagnostics reveal whether memory updates are useful, transferable, stable, and cost\-effective, rather than merely improving aggregate accuracy\.
We conduct a systematic empirical study across diverse tasks, models, and representative memory methods under theSeqMem\-Evalprotocol\. Our results show that standard aggregate metrics can be misleading: methods with strong final or online accuracy may still exhibit substantial forgetting, limited backward transfer, or weak generalization from accumulated experience\. These findings suggest that current evaluation practices can overestimate the effectiveness of LLM memory and obscure important failure modes\. Our contributions are summarized as follows:
- •Diagnostic evaluation framework\.We introduceSeqMem\-Eval, a continual\-learning\-inspired framework for evaluating sequentially evolving LLM memory beyond aggregate accuracy\.
- •Comprehensive empirical study\.We provide a systematic comparison of representative memory methods across diverse tasks and models under a unified sequential evaluation protocol\.
- •Actionable findings for memory design\.We identify key failure modes, including forgetting, limited backward transfer, and weak generalization, offering design implications for more reliable memory\-augmented LLMs\.
## 2Related Work
#### Sequentially evolving LLM memory and evaluation\.
Memory has become a central mechanism for enabling LLM agents to move beyond isolated inputs and adapt over sequential interactions\(Xianget al\.,[2026](https://arxiv.org/html/2605.15384#bib.bib12); Fanget al\.,[2025a](https://arxiv.org/html/2605.15384#bib.bib24); Madaanet al\.,[2023](https://arxiv.org/html/2605.15384#bib.bib25); Chhikaraet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib26)\)\. Recent memory\-augmented agents extract reusable information from prior trajectories, feedback, reflections, or workflows, and use such information to improve future reasoning and decision\-making\(Wanget al\.,[2025a](https://arxiv.org/html/2605.15384#bib.bib30); Chenet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib29); Fanget al\.,[2025b](https://arxiv.org/html/2605.15384#bib.bib11); Zhonget al\.,[2024](https://arxiv.org/html/2605.15384#bib.bib27); Xuet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib28)\)\. Representative approaches include reflection\- or experience\-based methods such as ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.15384#bib.bib9)\), dynamic memory construction methods such as Dynamic Cheatsheet\(Suzgunet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib6)\), workflow\-level memory methods such as Agent Workflow Memory \(AWM\)\(Wanget al\.,[2024c](https://arxiv.org/html/2605.15384#bib.bib7)\), and retrieval\- or structure\-based memory systems such as G\-Memory\(Zhanget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib8)\)and Memento\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib13)\)\. These methods differ in how they store, retrieve, and update memory, but share the goal of improving future behavior through test\-time experience reuse\(Tanget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib33); Fenget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib32); Hoet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib14)\)\. Recent benchmark efforts further formalize this setting: for example, Evo\-Memory\(Weiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib10)\)converts static datasets into sequential task streams and evaluates agents whose memory is searched, synthesized, and evolved after each interaction\. Broader works of self\-evolving agents also frame memory and trajectory reuse as part of environment\-centric self\-evolutionXianget al\.\([2026](https://arxiv.org/html/2605.15384#bib.bib12)\); Gaoet al\.\([2025](https://arxiv.org/html/2605.15384#bib.bib31)\)\. However, existing evaluations still largely rely on aggregate metrics such as final hold\-out accuracy, cumulative online accuracy, or average success rate\(Ouyanget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib34); Wuet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib35)\)\. Such metrics are useful for comparing end performance, but they provide limited insight into memory dynamics, like whether a method retains useful information\. Our work complements prior memory methods and benchmarks by focusing on diagnostic evaluation rather than proposing another memory architecture\.
#### Continual learning evaluation and diagnostic metrics\.
Our evaluation perspective is inspired by continual learning, where models learn from a sequence of tasks while attempting to acquire new knowledge without forgetting old knowledge\(Biesialskaet al\.,[2020](https://arxiv.org/html/2605.15384#bib.bib36); Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.15384#bib.bib37); Wanget al\.,[2024a](https://arxiv.org/html/2605.15384#bib.bib38)\)\. Classical continual learning evaluation uses metrics such as backward transfer, forward transfer, and stability–plasticity trade\-offs to characterize learning dynamics beyond final accuracy\(Chaudhryet al\.,[2018](https://arxiv.org/html/2605.15384#bib.bib42),[2019](https://arxiv.org/html/2605.15384#bib.bib41); Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2605.15384#bib.bib18); Wuet al\.,[2022](https://arxiv.org/html/2605.15384#bib.bib19); Qiet al\.,[2023](https://arxiv.org/html/2605.15384#bib.bib39); Wanget al\.,[2023](https://arxiv.org/html/2605.15384#bib.bib40)\)\. Recent works on continual learning for LLMs further highlight the importance of updating large models with new knowledge and skills while preserving previous capabilities\(Shiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib44); Wuet al\.,[2024](https://arxiv.org/html/2605.15384#bib.bib43)\)\. Sequential LLM memory shares this temporal structure, but differs in a crucial way: most memory\-augmented agents do not update model parameters, and instead rely on external, prompt\-mediated, or retrieval\-based memory states\(Zhenget al\.,[2023](https://arxiv.org/html/2605.15384#bib.bib47); Lianget al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib46); Liet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib45); Weiet al\.,[2025](https://arxiv.org/html/2605.15384#bib.bib10)\)\. Therefore, continual learning metrics cannot be directly reused without adaptation\. We adapt their diagnostic principles to sequential LLM memory by defining online utility, hold\-out generalization, backward transfer, and forgetting over evolving memory states, enabling fine\-grained analysis of memory behavior beyond aggregate performance\.
## 3SeqMem\-Eval: A Diagnostic Evaluation Framework
We proposeSeqMem\-Eval, a diagnostic framework for evaluating sequentially evolving LLM memory\. Unlike classical continual learning, this setting keeps LLMs fixed and changes behavior through external memory updates, retrieval, and prompt construction\. Thus, evaluation should go beyond endpoint performance to measure whether memory is useful online, generalizes to unseen tasks, consolidates past experience, preserves acquired utility, and remains computationally efficient\.
### 3\.1Sequential Memory Evaluation Setting
Following Wei et al\.Weiet al\.\([2025](https://arxiv.org/html/2605.15384#bib.bib10)\), we consider a sequentially evolving memory setting, where an LLMℒ\{\\mathcal\{L\}\}interacts with a stream of tasks and maintains an external memory that is updated over time\. Let𝒟=\{\(xt,yt\)\}t=1T\\mathcal\{D\}=\\\{\(x\_\{t\},y\_\{t\}\)\\\}\_\{t=1\}^\{T\}denote a task sequence, wherextx\_\{t\}is the input at stepttandyty\_\{t\}is the corresponding target\. At each step, the model maintains a memory stateMtM\_\{t\}, which may contain raw trajectories, retrieved examples, summaries, workflows, reflections, or other forms of accumulated experience from previous interactions\. Given inputxtx\_\{t\}, the system retrieves or constructs a contextCtC\_\{t\}from the current memory stateMtM\_\{t\}, and the LLM produces a predictiony^t=ℒ\(xt,Ct\)\\hat\{y\}\_\{t\}=\{\\mathcal\{L\}\}\(x\_\{t\},C\_\{t\}\)\. After prediction, the system may receive feedbackftf\_\{t\}, such as correctness signals, execution results, or environment feedback\. The memory is then updated by a method\-specific update function:
Mt\+1=Update\(Mt,xt,y^t,ft;ℒ\),M\_\{t\+1\}=\\texttt\{Update\}\(M\_\{t\},x\_\{t\},\\hat\{y\}\_\{t\},f\_\{t\};\{\\mathcal\{L\}\}\),\(1\)whereℒ\{\\mathcal\{L\}\}is included because some memory methods use the LLM itself to generate, refine, compress, or reorganize memory entries\. For methods that do not rely on LLM\-based memory updates, this argument can be omitted, andUpdatereduces to a non\-parametric operation, such as appending the current trajectory or updating a retrieval index\.
This formulation abstracts a broad class of sequential memory methods, including retrieval\-based memory, summarization\-based memory, workflow memory, and reflection\-based memory\. We define diagnostic metrics over the evolving memory states\{Mt\}t=1T\\\{M\_\{t\}\\\}\_\{t=1\}^\{T\}to evaluate different aspects of memory behavior\. Before introducing the formal definitions in the following subsections, Table[1](https://arxiv.org/html/2605.15384#S3.T1)provides an overview of theSeqMem\-Evalmetrics and the memory behavior captured by each metric\.
Table 1:Overview ofSeqMem\-Evaldiagnostics\. The metrics cover five complementary perspectives of sequential memory evaluation, with formal definitions provided in the following subsections\.PerspectiveMetricWhat it measuresOnline UtilityOnlineAccFinal cumulative online performance\.PEDPeak\-to\-end degradation in the online trajectory\.MERRecovery from the worst online phase\.rminr\_\{\\min\}Timing of the lowest online performance\.Hold\-out GeneralizationHoldOutAccFinal performance on unseen examples\.TrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}Direction of hold\-out performance over memory evolution\.Backward TransferBWT\(t\)\\mathrm\{BWT\}\(t\)Effect of later memory updates on earlier examples\.IVImmediate reusability of the newly updated memory\.ForgettingF\(t\)\\mathrm\{F\}\(t\)Loss of previously achieved capability over time\.EfficiencyToken ConsumptionTotal token cost of sequential evaluation\.RuntimeEnd\-to\-end wall\-clock time\.
### 3\.2Online Utility
*Online utility*reflects how well the model performs as it progressively updates its memory over the task sequence\. Rather than reducing online performance to a single scalar, we treat it as a trajectory that captures how the model’s behavior evolves over time\.
Formally, let𝒟=\{\(xτ,yτ\)\}τ=1T\\mathcal\{D\}=\\\{\(x\_\{\\tau\},y\_\{\\tau\}\)\\\}\_\{\\tau=1\}^\{T\}denote the online task sequence\. For each stepτ\\tau, we record the per\-step online performance and the cumulative online accuracy as
A\(τ\)=Acc\(xτ;Mτ\),A¯\(τ\)=1τ∑i=1τA\(i\),A\(\\tau\)=\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\}\),\\quad\\overline\{A\}\(\\tau\)=\\frac\{1\}\{\\tau\}\\sum\_\{i=1\}^\{\\tau\}A\(i\),\(2\)whereAcc\(x;M\)∈\{0,1\}\\mathrm\{Acc\}\(x;M\)\\in\\\{0,1\\\}indicates whether the model correctly answersxxusing memory stateMM\.
We report the following online utility metrics:
•Online Accuracy \(OnlineAcc\)\.The final value of the cumulative online accuracy curve:
OnlineAcc=A¯\(T\)\.\\mathrm\{OnlineAcc\}=\\overline\{A\}\(T\)\.\(3\)WhileOnlineAcc\\mathrm\{OnlineAcc\}summarizes endpoint performance, it does not capture how the trajectory reaches that endpoint\. We further derive three trajectory\-level diagnostics from the cumulative curve\.
•Peak\-to\-End Drop \(PED\)\.The drop from the best cumulative online performance to the final value:
PED=maxτ∈\[1,T\]A¯\(τ\)−A¯\(T\)\.\\mathrm\{PED\}=\\max\_\{\\tau\\in\[1,T\]\}\\overline\{A\}\(\\tau\)\-\\overline\{A\}\(T\)\.\(4\)A largerPED\\mathrm\{PED\}indicates that the method reaches strong intermediate performance but later degrades\.
•Minimum\-to\-End Recovery \(MER\)\.The recovery from the worst cumulative online performance to the final value:
MER=A¯\(T\)−minτ∈\[1,T\]A¯\(τ\)\.\\mathrm\{MER\}=\\overline\{A\}\(T\)\-\\min\_\{\\tau\\in\[1,T\]\}\\overline\{A\}\(\\tau\)\.\(5\)A largerMER\\mathrm\{MER\}indicates that the method improves after a low\-performance phase\.
•Extremum Timingrminr\_\{\\min\}\[Optional\]\.The relative location of the minimum cumulative online accuracy:
rmin=1Targminτ∈\[1,T\]A¯\(τ\)\.r\_\{\\min\}=\\frac\{1\}\{T\}\\arg\\min\_\{\\tau\\in\[1,T\]\}\\overline\{A\}\(\\tau\)\.\(6\)This secondary diagnostic helps distinguish trajectory shapes with similarMER\\mathrm\{MER\}andPED\\mathrm\{PED\}, such as gradual improvement versus delayed recovery\.
Overall, these diagnostics complementOnlineAcc\\mathrm\{OnlineAcc\}by characterizing whether the online trajectory reflects improvement, recovery, degradation, or volatile memory dynamics\. Table[2](https://arxiv.org/html/2605.15384#S3.T2)summarizes the qualitative interpretation of these trajectory patterns\. Since strong online performance may still arise from short\-term adaptation or task\-order effects, we next evaluate whether the accumulated memory generalizes beyond the observed stream using hold\-out test sets\.
Table 2:Qualitative interpretation of online trajectory diagnostics\.MER\\mathrm\{MER\}andPED\\mathrm\{PED\}provide coarse indicators of desirable or undesirable online dynamics, whilerminr\_\{\\min\}further refines the interpretation by distinguishing different improvement patterns\.Trajectory patternMERPEDrminr\_\{\\min\}Preference \(Interpretation\)Gradual improvementHighLow / 0Early✓Preferred \(Effective accumulation\)Drop\-then\-recoverHighLow / 0Middle✓Acceptable \(Delayed recovery\)Early peak then degradationLow / 0HighLate✗Undesirable \(Unstable evolution\)Rapid drop then stabilizationLow / 0HighLate✗Undesirable \(Persistent degradation\)Stable but non\-improvingLowLowArbitrary⚫Mixed \(Limited memory effect\)Highly fluctuatingHighHighArbitrary⚫Mixed \(Volatile memory dynamics\)
### 3\.3Hold\-out Generalization
Online utility measures performance within the observed task stream, but strong online performance does not necessarily imply that memory has learned reusable knowledge\. To evaluate this aspect, we measure*hold\-out generalization*on a fixed set of unseen tasks under different memory states\.
Let𝒟test=\{xitest\}i=1M\\mathcal\{D\}\_\{\\mathrm\{test\}\}=\\\{x\_\{i\}^\{\\mathrm\{test\}\}\\\}\_\{i=1\}^\{M\}denote a hold\-out test set that is never used for memory updates\. At stepτ\\tau, after processing the firstτ\\tauonline tasks, the model maintains memory stateMτM\_\{\\tau\}\.
•Hold\-out Accuracy \(HoldOutAcc\)\.We define the hold\-out accuracy trajectory and its final value as
H\(τ\)=1M∑i=1MAcc\(xitest;Mτ\),HoldOutAcc=H\(T\)\.H\(\\tau\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\mathrm\{Acc\}\(x\_\{i\}^\{\\mathrm\{test\}\};M\_\{\\tau\}\),\\quad\\mathrm\{HoldOutAcc\}=H\(T\)\.\(7\)The final valueHoldOutAcc\\mathrm\{HoldOutAcc\}measures the generalization ability of the final memory state after processing the full online sequence\.
•Hold\-out Trend \(TrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}\)\.To capture the overall direction of memory generalization, we compute the slope of the least\-squares line fitted to the normalized hold\-out trajectory:
TrendHO=slope\(\{\(τ/T,H\(τ\)\)\}τ=1T\)\.\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}=\\mathrm\{slope\}\\left\(\\\{\(\\tau/T,H\(\\tau\)\)\\\}\_\{\\tau=1\}^\{T\}\\right\)\.\(8\)A positiveTrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}indicates that hold\-out performance generally improves as more experiences are incorporated into memory, while a flat or negative value suggests limited generalization benefit or possible degradation\.
### 3\.4Backward Transfer
Hold\-out generalization evaluates whether memory improves performance on unseen tasks\. We next consider the complementary question: how later memory updates affect previously encountered tasks\. This is related tobackward transferin continual learning\(Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2605.15384#bib.bib18)\), but here it is defined over evolving external memory states rather than model parameters\.
•Backward Transfer \(BWT\)\.For a temporal horizontt, we define
BWT\(t\)=1T−t∑τ=1T−t\[Acc\(xτ;Mτ\+t\)−Acc\(xτ;Mτ\)\],\\mathrm\{BWT\}\(t\)=\\frac\{1\}\{T\-t\}\\sum\_\{\\tau=1\}^\{T\-t\}\\left\[\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\+t\}\)\-\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\}\)\\right\],\(9\)whereMτM\_\{\\tau\}denotes the memory state after processing theτ\\tau\-th task, andMτ\+tM\_\{\\tau\+t\}denotes the memory state after the nextttupdates\. A positiveBWT\(t\)\\mathrm\{BWT\}\(t\)indicates that later memory updates improve performance on earlier tasks, a value close to zero suggests limited retrospective effect, and a negative value indicates negative backward transfer\.
•Immediate Validity \(IV\)\.We define*Immediate Validity*as the one\-step backward transfer:
IV=BWT\(1\)=1T−1∑τ=1T−1\[Acc\(xτ;Mτ\+1\)−Acc\(xτ;Mτ\)\]\.\\mathrm\{IV\}=\\mathrm\{BWT\}\(1\)=\\frac\{1\}\{T\-1\}\\sum\_\{\\tau=1\}^\{T\-1\}\\left\[\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\+1\}\)\-\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\}\)\\right\]\.\(10\)IV\\mathrm\{IV\}evaluates whether the memory update induced by the current task immediately makes that experience reusable\. A positiveIV\\mathrm\{IV\}suggests that the update effectively incorporates the new experience, while a value close to zero or below indicates that the update provides limited reusable benefit\.
### 3\.5Forgetting
BWT measures net transfer from later memory updates to earlier tasks, but it does not capture whether a capability that was temporarily acquired is later lost\. We therefore measure*forgetting*, which evaluates whether previously achieved performance is retained over time\.
•Forgetting\.For a temporal horizontt, we define forgetting as
F\(t\)=1T−t∑τ=1T−t\[maxτ′∈\[τ,τ\+t\]Acc\(xτ;Mτ′\)−Acc\(xτ;Mτ\+t\)\],\\mathrm\{F\}\(t\)=\\frac\{1\}\{T\-t\}\\sum\_\{\\tau=1\}^\{T\-t\}\\left\[\\max\_\{\\tau^\{\\prime\}\\in\[\\tau,\\,\\tau\+t\]\}\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau^\{\\prime\}\}\)\-\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\+t\}\)\\right\],\(11\)where the first term is the best performance achieved onxτx\_\{\\tau\}within\[τ,τ\+t\]\[\\tau,\\tau\+t\], and the second term is its performance afterttadditional memory updates\. A largerF\(t\)\\mathrm\{F\}\(t\)indicates stronger loss of previously acquired capability, while values close to zero suggest stable retention\.
ComputingF\(t\)\\mathrm\{F\}\(t\)exactly requires evaluating each past task under all intermediate memory states, which can be expensive for long sequences\. In practice, we use a checkpoint\-based approximation that reuses the evaluations performed for backward transfer\. Let𝒯=\{t1,t2,…\}\\mathcal\{T\}=\\\{t\_\{1\},t\_\{2\},\\dots\\\}denote a discrete set of temporal horizons\. For each taskxτx\_\{\\tau\}and horizontt, we approximate
maxτ′∈\[τ,τ\+t\]Acc\(xτ;Mτ′\)≈maxti∈𝒯,ti≤tAcc\(xτ;Mτ\+ti\)\.\\max\_\{\\tau^\{\\prime\}\\in\[\\tau,\\,\\tau\+t\]\}\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau^\{\\prime\}\}\)\\approx\\max\_\{t\_\{i\}\\in\\mathcal\{T\},\\,t\_\{i\}\\leq t\}\\mathrm\{Acc\}\(x\_\{\\tau\};M\_\{\\tau\+t\_\{i\}\}\)\.\(12\)This approximation reduces computation while still capturing whether performance achieved at earlier checkpoints is lost after subsequent memory updates\. Together, BWT and forgetting provide complementary views: BWT measures net retrospective transfer, whereas forgetting measures degradation from the best previously achieved performance\.
### 3\.6Efficiency
Memory methods may improve performance by using longer contexts, storing more experiences, or invoking the LLM multiple times\. Thus, memory quality should be evaluated together with computational cost\. We measure efficiency along two dimensions\.
∙\\bulletToken Consumption\.We report the total number of tokens used during sequential evaluation, including prompts, retrieved memories, final answers, and intermediate reasoning or memory updates\.
∙\\bulletRuntime\.We report the wall\-clock time required to process the full task sequence, including LLM calls and overhead from retrieval, memory construction, and summarization\.
Together, these metrics reveal whether performance gains come from better memory behavior or substantially larger computational budgets\.
## 4Experiments and Analysis
We evaluate sequentially evolving LLM memory under the proposedSeqMem\-Evalframework across diverse tasks, models, and representative memory methods\. We provide the detailed experimental setup in Appendix[A](https://arxiv.org/html/2605.15384#A1)\. Rather than only comparing final accuracy, we organize our analysis around the following research questions \(RQs\):
- •RQ1: Online Utility\.How does memory affect performance during sequential inference, and what online trajectory patterns emerge as memory evolves?
- •RQ2: Hold\-out Generalization\.Does the evolving memory improve performance on unseen tasks, including in\-distribution and out\-of\-distribution hold\-out sets?
- •RQ3: Backward Transfer\.Do later memory updates help consolidate knowledge for previously encountered tasks, or do they induce negative transfer?
- •RQ4: Forgetting\.Does the memory retain capabilities achieved during the sequence, or does performance on earlier tasks degrade over time?
- •RQ5: Efficiency\.What are the efficiency trade\-offs of different memory mechanisms in terms of computational cost?
### 4\.1Analysis of Online Utility \(RQ1\)
Finding 1\.Memory improves aggregate online accuracy, but the gains do not imply stable accumulation\.Table[3](https://arxiv.org/html/2605.15384#S4.T3)reports the standard aggregate view of online performance\. Under this view, memory\-augmented methods often appear beneficial: most methods improve over the memory\-free baseline across datasets, and stronger memory mechanisms such as G\-Memory and ExpeL\-MT achieve clear gains across both LLM backbones\. However, the cumulative trajectories in Figure[2](https://arxiv.org/html/2605.15384#S4.F2)reveal a more nuanced picture\. We also observe thatmonotonic improvement is largely absent: across HumanEval and MMLU\-Pro\-Engineering, many methods either decline after an initially strong phase or rapidly drop and then plateau\. This suggests that additional memory updates do not consistently translate into sustained experience accumulation\.
Table 3:Accuracy \(%\) across LLMs, tasks, and methods\. Colored subscripts indicate changes relative to the corresponding memory\-free baseline\. Best results are bolded\.ModelMethodHumanEvalMATH500APIBenchMMLU\-Eng\.MMLU\-Phys\.ALFWorldQwen3\-8BMemory\-free81\.868\.664\.550\.365\.652\.9ExpReck=1084\.1↑\\uparrow2\.373\.6↑\\uparrow5\.064\.3↓\\downarrow0\.252\.1↑\\uparrow1\.867\.5↑\\uparrow1\.9–ExpReck=382\.6↑\\uparrow0\.872\.8↑\\uparrow4\.263\.9↓\\downarrow0\.651\.7↑\\uparrow1\.469\.4↑\\uparrow3\.8–ExpRAGk=382\.6↑\\uparrow0\.873\.0↑\\uparrow4\.467\.0↑\\uparrow2\.553\.1↑\\uparrow2\.868\.4↑\\uparrow2\.8–DC\-RS84\.8↑\\uparrow3\.068\.0↓\\downarrow0\.671\.4↑\\uparrow6\.951\.5↑\\uparrow1\.265\.5↓\\downarrow0\.1–AWM84\.1↑\\uparrow2\.368\.8↑\\uparrow0\.273\.9↑\\uparrow9\.450\.5↑\\uparrow0\.267\.2↑\\uparrow1\.651\.4↓\\downarrow1\.5G\-Memory83\.3↑\\uparrow1\.573\.6↑\\uparrow5\.080\.8↑\\uparrow16\.351\.5↑\\uparrow1\.268\.9↑\\uparrow3\.362\.9↑\\uparrow10\.0ExpeL\-ST86\.4↑\\uparrow4\.673\.4↑\\uparrow4\.870\.6↑\\uparrow6\.153\.6↑\\uparrow3\.367\.8↑\\uparrow2\.259\.3↑\\uparrow6\.4ExpeL\-MT90\.9↑\\uparrow9\.180\.2↑\\uparrow11\.682\.0↑\\uparrow17\.567\.2↑\\uparrow16\.979\.7↑\\uparrow14\.179\.3↑\\uparrow26\.4MiniMax\-M2\.7Memory\-free95\.081\.452\.760\.174\.668\.6ExpReck=1095\.5↑\\uparrow0\.582\.4↑\\uparrow1\.050\.3↓\\downarrow2\.464\.2↑\\uparrow4\.182\.5↑\\uparrow7\.9–ExpReck=395\.5↑\\uparrow0\.583\.6↑\\uparrow2\.253\.6↑\\uparrow0\.963\.2↑\\uparrow3\.185\.5↑\\uparrow10\.9–ExpRAGk=395\.5↑\\uparrow0\.581\.6↑\\uparrow0\.253\.5↑\\uparrow0\.862\.7↑\\uparrow2\.685\.0↑\\uparrow10\.4–DC\-RS91\.7↓\\downarrow3\.375\.6↓\\downarrow5\.862\.0↑\\uparrow9\.363\.3↑\\uparrow3\.282\.3↑\\uparrow7\.7–AWM95\.5↑\\uparrow0\.582\.6↑\\uparrow1\.266\.3↑\\uparrow13\.661\.7↑\\uparrow1\.682\.2↑\\uparrow7\.672\.1↑\\uparrow3\.5G\-Memory97\.0↑\\uparrow2\.083\.6↑\\uparrow2\.273\.5↑\\uparrow20\.872\.2↑\\uparrow12\.183\.9↑\\uparrow9\.383\.6↑\\uparrow15\.0ExpeL\-ST96\.2↑\\uparrow1\.283\.0↑\\uparrow1\.665\.1↑\\uparrow12\.462\.0↑\\uparrow1\.984\.7↑\\uparrow10\.170\.0↑\\uparrow1\.4ExpeL\-MT97\.7↑\\uparrow2\.787\.8↑\\uparrow6\.480\.8↑\\uparrow28\.175\.8↑\\uparrow15\.790\.2↑\\uparrow15\.688\.6↑\\uparrow20\.0
Finding 2\.Methods with similar online accuracy can exhibit substantially different memory dynamics\.Table[4](https://arxiv.org/html/2605.15384#S4.T4)summarizes the MER and PED diagnostics across all datasets and methods, together with the extremum timing metricrminr\_\{\\min\}\. The results show that methods with comparable finalOnlineAcccan still occupy substantially different regions in the diagnostic space\. On HumanEval, several methods have non\-trivial PED but small MER, corresponding to early high performance followed by degradation\. On ALFWorld, the separation is more pronounced: ExpeL\-MT achieves high MER with relatively lower PED compared with other agentic methods, while AWM\-Online suffers from large PED and near\-zero MER, indicating severe peak\-to\-end degradation without recovery\.
\(a\)Qwen3\-8B
\(b\)MiniMax\-M2\.7
Figure 2:Online accuracy over sequential steps for different models\.Overall,RQ1shows that aggregate OnlineAcc provides only a coarse endpoint measure\. Although memory methods often improve online performance on average, their trajectories reveal whether these gains are stable, recoverable, or transient\. This supports the need to evaluate online utility as an evolving trajectory rather than a single aggregate score\.
Table 4:Online trajectory diagnostics on Qwen3\-8B and MiniMax\-M2\.7\. Higher MER and lower PED are generally preferred;rminr\_\{\\min\}indicates the relative timing of the lowest online performance and should be interpreted together with MER and PED\.MethodHumanEvalMATH500APIBenchMMLU\-Eng\.MMLU\-Phys\.ALFWorldMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowMER↑\\uparrowPED↓\\downarrowrmin↓r\_\{\\min\}\\downarrowBackbone LLM: Qwen3\-8BExpReck=100\.0010\.1210\.990\.0070\.0840\.660\.0190\.1130\.280\.0130\.0670\.380\.0180\.0600\.61–––ExpReck=30\.0010\.1740\.990\.0180\.0610\.620\.0340\.0700\.500\.0200\.0880\.500\.0110\.0760\.89–––ExpRAGk=30\.0010\.1360\.990\.0300\.0610\.100\.0310\.0950\.500\.0330\.0550\.240\.0180\.0240\.17–––DC\-RS0\.0010\.1200\.990\.0130\.0870\.620\.0360\.0760\.500\.0280\.0730\.500\.0720\.0100\.12–––AWM0\.0010\.1210\.990\.0000\.0841\.000\.0240\.0600\.500\.0310\.0570\.240\.0470\.0030\.180\.0000\.4861\.00G\-Memory0\.0000\.1141\.000\.0040\.0710\.730\.0450\.0080\.110\.0220\.0680\.230\.0340\.0220\.100\.0070\.2460\.94ExpeL\-ST0\.0070\.1360\.900\.0220\.0840\.620\.0300\.1180\.500\.0280\.0510\.240\.0570\.0080\.120\.0050\.1520\.12ExpeL\-MT0\.0050\.0910\.710\.0030\.1000\.620\.0350\.0130\.190\.0130\.0350\.160\.0200\.0120\.120\.1500\.1220\.10Backbone LLM: MiniMax\-M2\.7ExpReck=100\.0330\.0190\.390\.0060\.0700\.870\.0330\.0890\.500\.0090\.0520\.100\.0020\.0390\.92–––ExpReck=30\.0460\.0100\.420\.0280\.0470\.100\.0180\.1860\.500\.0010\.1030\.950\.0270\.0350\.17–––ExpRAGk=30\.1850\.0000\.100\.0000\.0560\.870\.0080\.1700\.500\.0050\.0650\.970\.0130\.0390\.17–––DC\-RS0\.0040\.0830\.950\.0140\.0740\.260\.0240\.1210\.500\.0060\.0730\.970\.0270\.0230\.14–––AWM0\.0330\.0460\.390\.0060\.0570\.870\.0470\.0670\.500\.0400\.0260\.100\.0190\.0180\.110\.0210\.2290\.91G\-Memory0\.0420\.0300\.420\.0020\.0660\.870\.0400\.0700\.500\.0040\.0910\.970\.0040\.0660\.890\.0040\.1640\.94ExpeL\-ST0\.0390\.0120\.100\.0140\.0640\.320\.0400\.0650\.500\.0280\.0340\.300\.0280\.0170\.180\.0000\.1701\.00ExpeL\-MT0\.0360\.0230\.390\.0180\.0370\.210\.0570\.0410\.280\.0250\.0400\.340\.0090\.0300\.170\.0860\.0340\.18
### 4\.2Analysis of Hold\-out Generalization \(RQ2\)
Finding 3\.Online gains do not reliably translate to hold\-out generalization\.Table[5](https://arxiv.org/html/2605.15384#S4.T5)shows that final in\-distribution \(ID\) hold\-out generalization is often weak, even when memory methods improve online accuracy\. Across several datasets, many methods only match or fall below the memory\-free baseline on the final hold\-out set\. For example, on APIBench, most memory methods underperform the memory\-free baseline in final hold\-out accuracy, despite showing gains in aggregate online performance\. This gap suggests that memory updates can improve performance on the observed stream without necessarily learning reusable knowledge that transfers to unseen examples\.
Finding 4\.Final hold\-out accuracy can hide unstable generalization dynamics\.The trend values in Table[5](https://arxiv.org/html/2605.15384#S4.T5)show that endpoint hold\-out accuracy provides an incomplete view of memory generalization\. PositiveTrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}values are not consistent across methods or datasets, and several methods achieve competitive final hold\-out accuracy despite negative trends\. This indicates that their final memory state may not reflect stable improvement over time\. The hold\-out trajectories in Appendix[C\.1](https://arxiv.org/html/2605.15384#A3.SS1)further support this observation: intermediate memory states can outperform the final memory state, suggesting that useful memory may be overwritten, diluted, or destabilized by later updates\. We provide a supporting case study in Appendix[F](https://arxiv.org/html/2605.15384#A6)\.
The OOD hold\-out results in Appendix[C\.2](https://arxiv.org/html/2605.15384#A3.SS2)show a similar but stronger pattern, where cross\-task generalization is substantially weaker than ID generalization and memory updates can even hurt performance compared with the memory\-free baseline\. Overall,RQ2shows that hold\-out generalization should be analyzed through both endpoint performance and trajectory trends\. These results highlight the need for memory methods to validate update quality, preserve useful intermediate states, and prevent harmful updates\.
Table 5:In\-distribution hold\-out generalization on Qwen3\-8B\.We report the finalHoldOutAcc\\mathrm\{HoldOutAcc\}using the final memory state and the hold\-out trajectory trendTrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}\. Higher values are better\.MethodHumanEvalMATH500APIBenchMMLU\-EngMMLU\-PhysALFWorldHO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}HO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}HO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}HO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}HO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}HO\-AccTrendHO\\text\{Trend\}\_\{\\text\{HO\}\}Memory\-free75\.0–67\.5–67\.8–50\.0–69\.0–75\.0–ExpReck=1078\.1\-0\.00171\.4\+0\.03054\.2\-0\.11153\.1\+0\.03966\.7\+0\.004––ExpReck=378\.1\-0\.01770\.4\-0\.00367\.5\-0\.14555\.2\+0\.01169\.8\+0\.028––ExpRAG68\.8\-0\.10767\.5\-0\.02361\.7\-0\.01453\.1\-0\.01465\.9\+0\.004––DC\-RS75\.0\+0\.06861\.8\-0\.01067\.5\+0\.01551\.0\+0\.00665\.9\-0\.004––AWM71\.9\-0\.01267\.5\+0\.01265\.0\-0\.05353\.1\+0\.08967\.4\-0\.00858\.3\+0\.032ExpeL\-ST78\.1\-0\.01270\.0\-0\.01466\.7\-0\.05458\.3\+0\.06665\.9\-0\.00458\.3\+0\.088ExpeL\-MT75\.0\-0\.00567\.5\+0\.00973\.3\+0\.07254\.2\-0\.09572\.1\+0\.05762\.5\+0\.220G\-Memory75\.0\-0\.03468\.6\-0\.00667\.5\-0\.06457\.3\+0\.03068\.2\-0\.00270\.8\-0\.090
### 4\.3Analysis of Backward Transfer \(RQ3\)
Figure 3:Backward transfer \(top row\) and forgetting \(bottom row\) for Qwen3\-8B across different temporal horizonstt\. Higher BWT is better, while lower forgetting indicates better retention\.Finding 5\.Later memory updates provide short\-term benefits, but do not reliably consolidate transferable knowledge\.As shown in the top row of Figure[3](https://arxiv.org/html/2605.15384#S4.F3), BWT is often positive at short horizons, indicating that recent memory updates can improve performance on earlier tasks\. However, this effect does not consistently strengthen as the horizonttgrows\. Instead, most methods exhibit fluctuating or diminishing BWT over longer horizons, suggesting that later memory updates do not reliably accumulate into increasingly useful retrospective knowledge\.
This overall pattern arises from different failure modes across memory designs\. Retrieval\-based methods reveal a recency–retention trade\-off: ExpRecent can exhibit negative BWT once past tasks fall outside its recent window, while ExpRAG is more stable but often remains near zero, indicating limited consolidation beyond instance retrieval\. More structured memory methods, such as G\-Memory, sometimes achieve stronger positive BWT, but their curves are still highly non\-monotonic\. These results suggest that current memory methods can reuse nearby or directly relevant experiences, but often fail to transform longer streams of interaction into stable transferable knowledge\.
Finding 6\.Immediate validity is often weak even right after a memory update\.The caset=1t=1corresponds to immediate validity, which measures whether the memory update induced by the current task improves performance on that same task\. Surprisingly, although the memory has just incorporated the current task and its feedback,BWT\(1\)\\mathrm\{BWT\}\(1\)is often below5%5\\%and can even be negative\. This suggests that many memory updates are relatively shallow: they may append, retrieve, or summarize the latest interaction, but do not necessarily extract reusable information that can immediately benefit the same task\.
Overall,RQ3reveals a key limitation of existing memory mechanisms: they can often absorb local feedback, but they do not reliably consolidate it into transferable knowledge that benefits earlier tasks over longer horizons\. This highlights the need for memory updates that go beyond storage and perform more effective reflection, abstraction, and validation\.
### 4\.4Analysis of Forgetting \(RQ4\)
Finding 7\.Sequential memory updates often interfere with previously acquired utility\.The bottom row of Figure[3](https://arxiv.org/html/2605.15384#S4.F3)shows that forgetting generally increases with the horizontt\. This indicates that performance previously achieved on earlier tasks is often not retained after additional memory updates\. As more experiences are incorporated, memory states may overwrite useful information, dilute earlier knowledge, or retrieve less relevant content for past tasks\.
Together with the BWT results,RQ4reveals a central tension in sequential LLM memory: methods can often incorporate recent feedback and sometimes improve past performance, but they struggle to preserve these gains over longer horizons\. Future memory mechanisms should therefore not only add or summarize new experiences, but also preserve useful knowledge, validate updates, and prevent interference between old and new memory content\.
Figure 4:Efficiency–performance trade\-off on Qwen3\-8B\. Bubble labels indicate total token usage normalized by the memory\-free baseline\.
### 4\.5Analysis of Efficiency \(RQ5\)
Finding 8\.Stronger memory performance often comes with substantial token and runtime overhead\.Figure[4](https://arxiv.org/html/2605.15384#S4.F4)summarizes the efficiency–performance trade\-off on Qwen3\-8B\. Most memory\-augmented methods incur clear overhead over the memory\-free baseline\. Even lightweight retrieval methods such as ExpRecent and ExpRAG require about4\.34\.3–4\.7×4\.7\\timesmore tokens while providing only modest gains\. More complex methods often achieve stronger performance but at substantially higher cost: ExpeL\-MT obtains the best Avg\-5 accuracy with7\.4×7\.4\\timestoken usage and the largest runtime increase, while G\-Memory uses around13×13\\timesmore tokens\. DC\-RS is especially costly, requiring about3\.2×3\.2\\timesruntime and36×36\\timestoken usage, yet performs close to much cheaper methods\.
Overall, RQ5 shows that memory effectiveness should be interpreted together with computational cost\. Strong performance can reflect substantially higher token and latency budgets, rather than more efficient memory utilization\. Detailed runtime and token breakdowns are provided in Appendix[C\.3](https://arxiv.org/html/2605.15384#A3.SS3)\.
## 5Overall Comparison
Figure 5:Overall radar comparison on Qwen3\-8B across multiple dimensions\. Larger radial values indicate better within\-dimension rankings\.To provide a holistic view of method behavior, Figure[5](https://arxiv.org/html/2605.15384#S5.F5)summarizes representative memory mechanisms across the five diagnostic dimensions ofSeqMem\-Eval\. For each dimension, we normalize the underlying metrics, average them into a dimension\-level score, and rank methods accordingly\. The radar plot reports these relative ranks to provide a compact profile of method trade\-offs\.
The radar profiles reveal that different algorithmic designs lead to distinct trade\-offs\. ExpeL\-MT performs strongly on online utility and hold\-out generalization, which is consistent with its use of multiple attempts and reflection\-based experience extraction: repeated trials increase the chance of solving the current task, while distilled experiential rules can provide reusable signals for later tasks\. However, the same design also introduces substantial generation and update overhead, explaining its weaker efficiency profile\. ExpeL\-ST removes the multi\-try component and therefore offers a different trade\-off: it retains some benefits of reflection\-based memory while reducing the cost associated with repeated inference, suggesting that experience abstraction can be useful even without aggressive inference\-time search\.
Retrieval\-based methods exhibit the opposite behavior\. ExpRecent and ExpRAG are comparatively efficient because they mainly reuse stored examples through recent\-window or similarity\-based retrieval, without expensive memory synthesis\. However, their weaker profiles on generalization and transfer suggest that instance\-level retrieval alone provides limited abstraction: it can reuse nearby or similar experiences, but does not reliably consolidate them into higher\-level knowledge that transfers across time or tasks\. This observation is consistent with the BWT analysis, where retrieval\-based methods show limited consolidation beyond local reuse\.
Structured memory methods occupy an intermediate regime\. G\-Memory achieves a relatively balanced profile, especially on transfer\- and retention\-related dimensions, suggesting that organizing experiences into a more structured memory can help preserve and reuse information beyond raw instance retrieval\. At the same time, its efficiency is constrained by graph construction, insight maintenance, and additional retrieval operations\. DC\-RS shows another form of trade\-off\. Its retrieve–update–solve pipeline can help the current task by reusing retrieved past history, but its fallback behavior may also weaken memory abstraction: when a new cheatsheet cannot be extracted, the retrieved history is used directly as the updated memory\. This makes the memory useful as an ICL\-style context for nearby tasks, but can also produce verbose final memories that are costly and less reliable for long\-term generalization or retention\.
Overall, this comparison reinforces the main message ofSeqMem\-Eval: memory quality is not determined by a single dominant score, but by how a memory mechanism balances adaptation, generalization, consolidation, retention, and cost\. The profiles also suggest concrete design implications\. Methods should move beyond raw instance storage toward mechanisms that abstract reusable experience, but such abstraction must be paired with validation and cost control; otherwise, stronger online performance may come primarily from additional inference or update budget rather than from more reliable memory evolution\.
## 6Conclusion
Sequentially evolving LLM memory is an important step toward agents that can accumulate experience and adapt beyond independent task solving\. However, existing evaluations often reduce memory quality to a single aggregate score\. Our study showed that this view is incomplete: similar aggregate performance can hide substantially different memory dynamics, like unstable online trajectories\. We introducedSeqMem\-Eval, a diagnostic evaluation framework that decomposes sequential memory behavior into online utility, hold\-out generalization, backward transfer, forgetting, and efficiency\. Across diverse tasks and memory methods, our analysis showed that these diagnostics provide a more complete view of memory behavior than aggregate scores alone\. The resulting observations also offer useful insights for future memory research, highlighting the need to design and evaluate memory mechanisms in terms of generalization, transfer, retention, and efficiency\. Overall,SeqMem\-Evalreframes LLM memory evaluation as the analysis of an evolving process rather than a final endpoint\. We hope this perspective helps guide the development of more reliable, generalizable, and efficient memory\-augmented LLM agents\.
## References
- \[1\]L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1)\.
- \[2\]M\. Biesialska, K\. Biesialska, and M\. R\. Costa\-Jussa\(2020\)Continual lifelong learning in natural language processing: a survey\.InProceedings of the 28th international conference on computational linguistics,pp\. 6523–6541\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]A\. Chaudhry, M\. Ranzato, M\. Rohrbach, and M\. Elhoseiny\(2018\)Efficient lifelong learning with a\-gem\.arXiv preprint arXiv:1812\.00420\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]A\. Chaudhry, M\. Rohrbach, M\. Elhoseiny, T\. Ajanthan, P\. K\. Dokania, P\. H\. Torr, and M\. Ranzato\(2019\)On tiny episodic memories in continual learning\.arXiv preprint arXiv:1902\.10486\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§A\.1](https://arxiv.org/html/2605.15384#A1.SS1.p1.1)\.
- \[6\]S\. Chen, S\. Lin, Y\. Shi, H\. Lian, X\. Gu, L\. Yun, D\. Chen, L\. Cao, J\. Liu, N\. Xia,et al\.\(2025\)Swe\-exp: experience\-driven software issue resolution\.arXiv preprint arXiv:2507\.23361\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li,et al\.\(2025\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.arXiv preprint arXiv:2508\.07407\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang\(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]E\. Feng, W\. Zhou, Z\. Liu, L\. Chen, Y\. Dong, C\. Zhang, Y\. Zhao, D\. Du, Z\. Hua, Y\. Xia,et al\.\(2025\)Get experience from practice: llm agents with record & replay\.arXiv preprint arXiv:2505\.17716\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.External Links:2103\.03874,[Link](https://arxiv.org/abs/2103.03874)Cited by:[§A\.1](https://arxiv.org/html/2605.15384#A1.SS1.p1.1)\.
- \[13\]M\. Ho, C\. Si, Z\. Feng, F\. Yu, Y\. Yang, Z\. Liu, Z\. Hu, and L\. Qin\(2025\)Arcmemo: abstract reasoning composition with lifelong llm memory\.arXiv preprint arXiv:2509\.04439\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]Z\. Li, S\. Song, H\. Wang, S\. Niu, D\. Chen, J\. Yang, C\. Xi, H\. Lai, J\. Zhao, Y\. Wang,et al\.\(2025\)Memos: an operating system for memory\-augmented generation \(mag\) in large language models\.arXiv preprint arXiv:2505\.22101\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]X\. Liang, M\. Tao, Y\. Xia, J\. Wang, K\. Li, Y\. Wang, Y\. He, J\. Yang, T\. Shi, Y\. Wang,et al\.\(2025\)Sage: self\-evolving agents with reflective and memory\-augmented abilities\.Neurocomputing647,pp\. 130470\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]D\. Lopez\-Paz and M\. Ranzato\(2017\)Gradient episodic memory for continual learning\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p3.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2605.15384#S3.SS4.p1.1)\.
- \[18\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[19\]MiniMax\(2026\)MiniMax M2\.7: early echoes of self\-evolution\.Note:[https://www\.minimax\.io/news/minimax\-m27\-en](https://www.minimax.io/news/minimax-m27-en)Accessed: 2026\-05\-04Cited by:[§A\.2](https://arxiv.org/html/2605.15384#A1.SS2.p1.1)\.
- \[20\]S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang,et al\.\(2025\)Reasoningbank: scaling agent self\-evolving with reasoning memory\.arXiv preprint arXiv:2509\.25140\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez\(2023\)Gorilla: large language model connected with massive apis\.External Links:2305\.15334,[Link](https://arxiv.org/abs/2305.15334)Cited by:[§A\.1](https://arxiv.org/html/2605.15384#A1.SS1.p1.1)\.
- \[22\]X\. Qi, Y\. Zeng, T\. Xie, P\. Chen, R\. Jia, P\. Mittal, and P\. Henderson\(2023\)Fine\-tuning aligned language models compromises safety, even when users do not intend to\!\.arXiv preprint arXiv:2310\.03693\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[23\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.SSS0.Px1.p1.1)\.
- \[24\]H\. Shi, Z\. Xu, H\. Wang, W\. Qin, W\. Wang, Y\. Wang, Z\. Wang, S\. Ebrahimi, and H\. Wang\(2025\)Continual learning of large language models: a comprehensive survey\.ACM Computing Surveys58\(5\),pp\. 1–42\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.External Links:2010\.03768,[Link](https://arxiv.org/abs/2010.03768)Cited by:[§A\.1](https://arxiv.org/html/2605.15384#A1.SS1.p1.1)\.
- \[26\]M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou\(2025\)Dynamic cheatsheet: test\-time learning with adaptive memory\.External Links:2504\.07952,[Link](https://arxiv.org/abs/2504.07952)Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.p4.1),[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§1](https://arxiv.org/html/2605.15384#S1.p2.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]X\. Tang, T\. Qin, T\. Peng, Z\. Zhou, D\. Shao, T\. Du, X\. Wei, P\. Xia, F\. Wu, H\. Zhu,et al\.\(2025\)Agent kb: leveraging cross\-domain experience for agentic problem solving\.arXiv preprint arXiv:2507\.06229\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[28\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§A\.2](https://arxiv.org/html/2605.15384#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.SSS0.Px1.p1.1)\.
- \[29\]J\. Wang, Z\. Guo, W\. Ma, and M\. Zhang\(2025\)How far can llms improve from experience? measuring test\-time learning ability in llms with human comparison\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25688–25702\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong\(2025\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1)\.
- \[31\]L\. Wang, X\. Zhang, H\. Su, and J\. Zhu\(2024\)A comprehensive survey of continual learning: theory, method and application\.IEEE transactions on pattern analysis and machine intelligence46\(8\),pp\. 5362–5383\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]X\. Wang, Y\. Zhang, T\. Chen, S\. Gao, S\. Jin, X\. Yang, Z\. Xi, R\. Zheng, Y\. Zou, T\. Gui,et al\.\(2023\)Trace: a comprehensive benchmark for continual learning in large language models\.arXiv preprint arXiv:2310\.06762\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen\(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.External Links:2406\.01574,[Link](https://arxiv.org/abs/2406.01574)Cited by:[§A\.1](https://arxiv.org/html/2605.15384#A1.SS1.p1.1)\.
- \[34\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\(2024\)Agent workflow memory\.External Links:2409\.07429,[Link](https://arxiv.org/abs/2409.07429)Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.p4.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]T\. Wei, N\. Sachdeva, B\. Coleman, Z\. He, Y\. Bei, X\. Ning, M\. Ai, Y\. Li, J\. He, E\. H\. Chi,et al\.\(2025\)Evo\-memory: benchmarking llm agent test\-time learning with self\-evolving memory\.arXiv preprint arXiv:2511\.20857\.Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.p3.7),[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§1](https://arxiv.org/html/2605.15384#S1.p2.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.15384#S3.SS1.p1.11)\.
- \[36\]R\. Wu, X\. Wang, J\. Mei, P\. Cai, D\. Fu, C\. Yang, L\. Wen, X\. Yang, Y\. Shen, Y\. Wang,et al\.\(2025\)Evolver: self\-evolving llm agents through an experience\-driven lifecycle\.arXiv preprint arXiv:2510\.16079\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]T\. Wu, M\. Caccia, Z\. Li, Y\. F\. Li, G\. Qi, and G\. Haffari\(2022\)Pretrained language model in continual learning: a comparative study\.InInternational Conference on Learning Representations 2022,Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p3.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[38\]T\. Wu, L\. Luo, Y\. Li, S\. Pan, T\. Vu, and G\. Haffari\(2024\)Continual learning for large language models: a survey\.arXiv preprint arXiv:2402\.01364\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[39\]P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[Appendix G](https://arxiv.org/html/2605.15384#A7.SS0.SSS0.Px2.p1.1)\.
- \[40\]Z\. Xiang, C\. Yang, Z\. Chen, Z\. Wei, Y\. Tang, Z\. Teng, Z\. Peng, Z\. Li, C\. Huang, Y\. He,et al\.\(2026\)A systematic survey of self\-evolving agents: from model\-centric to environment\-driven co\-evolution\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[41\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[42\]G\. Zhang, M\. Fu, G\. Wan, M\. Yu, K\. Wang, and S\. Yan\(2025\)G\-memory: tracing hierarchical memory for multi\-agent systems\.External Links:2506\.07398,[Link](https://arxiv.org/abs/2506.07398)Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.p4.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[43\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)ExpeL: llm agents are experiential learners\.External Links:2308\.10144,[Link](https://arxiv.org/abs/2308.10144)Cited by:[§A\.3](https://arxiv.org/html/2605.15384#A1.SS3.p5.1),[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§1](https://arxiv.org/html/2605.15384#S1.p2.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[44\]B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. Wang, A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig,et al\.\(2025\)Skillweaver: web agents can self\-improve by discovering and honing skills\.arXiv preprint arXiv:2504\.07079\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1)\.
- \[45\]L\. Zheng, R\. Wang, X\. Wang, and B\. An\(2023\)Synapse: trajectory\-as\-exemplar prompting with memory for computer control\.arXiv preprint arXiv:2306\.07863\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px2.p1.1)\.
- \[46\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
- \[47\]H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§1](https://arxiv.org/html/2605.15384#S1.p1.1),[§2](https://arxiv.org/html/2605.15384#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix ADetailed Experimental Setup
### A\.1Dataset configuration\.
We evaluate on a diverse set of benchmarks spanning programming, mathematical reasoning, factual and domain\-specific reasoning, tool use, and embodied interaction\. For code generation, we useHumanEval\[[5](https://arxiv.org/html/2605.15384#bib.bib1)\]; for mathematical reasoning, we adoptMATH500\[[12](https://arxiv.org/html/2605.15384#bib.bib2)\]; for factual and domain\-specific reasoning, we includeMMLU\-Pro\[[33](https://arxiv.org/html/2605.15384#bib.bib3)\], focusing on advanced subjects such as engineering and physics; for tool\-use and API grounding, we evaluate onAPIBench\[[21](https://arxiv.org/html/2605.15384#bib.bib4)\]; and for long\-horizon goal\-oriented interaction, we includeALFWorld\[[25](https://arxiv.org/html/2605.15384#bib.bib5)\]\. Together, these benchmarks cover both single\-turn and multi\-step settings, enabling evaluation of sequential memory across diverse forms of reasoning and interaction\.
For datasets without dedicated training splits \(e\.g\., HumanEval and MMLU\-Pro\), we construct the ID hold\-out set by taking the last 20% of the test set\. For APIBench, MATH500, and ALFWorld, the hold\-out sets are sampled from the training data\. The sampling is performed proportionally across task categories to preserve the original distribution, while ensuring that each category contains at least one task\. For APIBench, we only use the HF subset, as the TF and TH subsets are relatively small and are nearly saturated by most models\. Dataset statistics, including total size and hold\-out split size, are summarized in Table[6](https://arxiv.org/html/2605.15384#A1.T6)\.
Table 6:Dataset statistics, including total size and ID hold\-out size\.DatasetSequential testHold\-outHumanEval13232MATH500500280MMLU\-Pro\-Engineering87396MMLU\-Pro\-Physics1170129APIBench\-HF911120ALFWorld14024#### Sequential Evaluation Protocol\.
All datasets are evaluated under a unified sequential protocol\. Each example is processed once in sequence, and the memory state is incrementally updated after each interaction\. To ensure fair comparison, all methods are evaluated using the same retrieval budget\. Each dataset is evaluated under a fixed task order shared by all methods\. This design ensures that different memory mechanisms are compared under the same sequence of experiences\. We emphasize, however, that task ordering is an intrinsic factor in sequential memory evaluation: different orders may expose different forms of adaptation, interference, and forgetting\. Therefore, our reported results should be interpreted as diagnostics under controlled task streams rather than order\-invariant estimates of memory quality\. This protocol enables consistent measurement of online utility, hold\-out generalization, backward transfer, forgetting, and efficiency across tasks and memory methods\.
### A\.2Model configuration\.
We conduct experiments using two representative large language models: Qwen3\-8B\[[28](https://arxiv.org/html/2605.15384#bib.bib20)\]and MiniMax\-M2\.7\[[19](https://arxiv.org/html/2605.15384#bib.bib21)\]\.
#### Qwen3\-8B model configuration\.
We use the official recommended temperature setting of0\.70\.7to balance generation quality and diversity\. For HumanEval, MATH500, APIBench, and MMLU\-Pro, reasoning is disabled, and the maximum number of generated tokens is set to20482048for all methods\. For ALFWorld, due to its higher task complexity, we enable reasoning and increase the maximum number of generated tokens to81928192to allow for extended reasoning steps\. All experiments are conducted on a single A100\-80GB GPU\.
#### MiniMax\-M2\.7 model configuration\.
We access the MiniMax model via the OpenRouter API\. For all tasks and methods, we use a unified configuration with temperature set to1\.01\.0, reasoning level set to low, and the maximum number of generated tokens set to1638416384\.
### A\.3Methods
We evaluate a representative set of sequentially evolving LLM memory methods to systematically analyze memory behaviors under a unified sequential setting\.
\(1\)Memory\-free baseline\.The memory\-free baseline solves each task using only the underlying LLM without persistent memory\. It serves as a reference point for measuring whether external memory provides benefits beyond the base model\.
\(2\)Instance\-level experience retrieval\.We include ExpRecentk=10, ExpRecentk=3, and ExpRAGk=3as lightweight baselines for raw experience reuse\[[35](https://arxiv.org/html/2605.15384#bib.bib10)\]\. ExpRecentkconditions on the most recentkktasks, while ExpRAGkretrieves the top\-kkmost similar past tasks based on embedding similarity to the current input\. This comparison separates recency\-based memory from relevance\-based retrieval and evaluates whether instance\-level experiences alone are sufficient for sequential adaptation without additional memory abstraction or refinement\.
\(3\)Structured and evolving memory\.We evaluate Dynamic Cheatsheet Retrieval\-and\-Synthesis variant\(DC\-RS\)\[[26](https://arxiv.org/html/2605.15384#bib.bib6)\], Agent Workflow Memory \(AWM\)\[[34](https://arxiv.org/html/2605.15384#bib.bib7)\], and G\-Memory\[[42](https://arxiv.org/html/2605.15384#bib.bib8)\]\. These methods go beyond raw experience retrieval by organizing memory into higher\-level structures such as cheatsheets, workflows, or hierarchical memory graphs\. They allow us to examine whether structured memory representations improve transfer, retention, and generalization during sequential inference\.
\(4\)Experiential reflection\.We include ExpeL\[[43](https://arxiv.org/html/2605.15384#bib.bib9)\], which derives reusable insights from successful and failed trajectories through reflection\. Since the original formulation is not directly designed for our sequential evaluation protocol, we refactor it into a sequential memory setting\. We evaluate its multi\-try version as ExpeL\-MT, following the original strategy of allowing multiple attempts per task, and additionally implement a single\-try variant, ExpeL\-ST, for controlled comparison\. This comparison helps distinguish gains from experiential memory from those introduced by additional inference\-time attempts\. Detailed algorithmic descriptions and implementation details for all methods are provided in Appendix[D](https://arxiv.org/html/2605.15384#A4)\.
#### Method configuration\.
Method configurations are summarized in Table[7](https://arxiv.org/html/2605.15384#A1.T7)\. For all methods involving retrieval, we standardize the pipeline by using Qwen3\-0\.6B\-embedding\[[28](https://arxiv.org/html/2605.15384#bib.bib20)\]as the embedding model\. Dense retrieval is performed by computing cosine similarity using the Sentence Transformers library\[[23](https://arxiv.org/html/2605.15384#bib.bib23)\]\.
Table 7:Method\-specific configurations used in sequential experiments\.MethodConfigurationExpRecenttop\-k=3,top\-k=10ExpRAGtop\-k=3AWMtop\-k=3,induce\-steps=1DC\-RStop\-k=3ExpeL\-MTtop\-k=3,max\-tries=3,batch\-update\-size=8,max\-num\-rules=20ExpeL\-STtop\-k=3,batch\-update\-size=8,max\-num\-rules=20G\-Memorysuccessful\-topk=2,failed\-topk=1,insights\-topk=10
## Appendix BPractical Use ofSeqMem\-Eval
SeqMem\-Evalis not intended to collapse memory quality back into a single universal score\. Different deployment settings may prioritize different aspects of memory behavior, such as online adaptation, hold\-out generalization, retention, or computational cost\. For example, an interactive agent may prioritize online utility and latency, whereas a long\-term reasoning assistant may place more emphasis on hold\-out generalization and forgetting\.
As a practical guideline for method comparison, we suggest a two\-stage procedure\. First, identify methods that are Pareto competitive with respect to primary deployment constraints, such asOnlineAcc\\mathrm\{OnlineAcc\},HoldOutAcc\\mathrm\{HoldOutAcc\}, and efficiency\. Second, among these candidates, use the diagnostic metrics to inspect memory dynamics: prefer methods with positiveTrendHO\\mathrm\{Trend\}\_\{\\mathrm\{HO\}\}, non\-negative BWT at short and medium horizons, low forgetting, and low PED\. This procedure preserves the diagnostic nature of the framework while providing a concrete way to compare methods without reducing memory quality to a single score\.
## Appendix CAdditional Experimental Results
### C\.1ID hold\-out test results
\(a\)Qwen3\-8B
\(b\)MiniMax\-M2\.7
Figure 6:In\-distribution \(ID\) hold\-out accuracy over sequential steps for different models\.[Figure˜6](https://arxiv.org/html/2605.15384#A3.F6)present the full ID hold\-out trajectories over sequential memory updates\. Consistent with the main\-text analysis, hold\-out generalization rarely improves monotonically as memory evolves\. Across many datasets, trajectories fluctuate substantially around the memory\-free baseline, and later memory states often fail to outperform earlier ones\. This pattern is particularly pronounced on APIBench and MMLU\-Pro\-Engineering, where several methods exhibit transient improvements followed by degradation, suggesting that memory updates may overfit to recently observed interactions rather than accumulate transferable knowledge\. In multiple cases, peak hold\-out performance occurs at intermediate stages rather than in the final memory state, indicating that useful information can later be diluted, overwritten, or negatively affected by subsequent updates\.
The trajectory\-level analysis also reveals clear differences across memory mechanisms\. Retrieval\-based methods generally produce relatively stable, low\-variance trajectories close to the baseline, implying limited long\-term knowledge accumulation despite reduced instability\. In contrast, stronger memory systems such as ExpeL\-MT and G\-Memory often achieve higher peak hold\-out performance but exhibit highly non\-stationary trajectories with substantial oscillations across memory steps\. This trade\-off suggests that aggressive memory synthesis and refinement may improve adaptability while reducing memory stability\. Moreover, identical methods can exhibit qualitatively different hold\-out dynamics across LLM backbones, indicating that memory generalization depends not only on stored experience but also on the backbone’s ability to consistently exploit evolving memory states\. Overall, these results further support the central claim of RQ2: final hold\-out accuracy alone obscures important temporal properties of memory generalization, including instability, reversibility, and sensitivity to later updates\.
### C\.2OOD hold\-out test results on Qwen3\-8B
We evaluate OOD generalization by transferring memory states learned from the MATH500 stream to AIME2024, AIME2025, and MMLU\-Pro\-Physics, with results shown in[Figure˜7](https://arxiv.org/html/2605.15384#A3.F7)\. Compared with the ID setting, OOD trajectories exhibit weaker and less stable gains, indicating limited transferability of learned memory beyond the training distribution\.
On AIME2024 and AIME2025, memory provides only marginal improvements over the memory\-free baseline, likely due to partial overlap in mathematical reasoning patterns\. However, these gains are inconsistent across methods and memory stages, suggesting that the learned memory captures limited transferable structure\. The degradation is more pronounced on MMLU\-Pro\-Physics, where performance remains consistently below the baseline, even for methods that construct higher\-level synthesized insights\. This behavior indicates weak cross\-domain transfer and limited generalization of memory beyond the source domain\.
Across all OOD datasets, hold\-out trajectories exhibit substantial fluctuations, and later memory states frequently fail to outperform earlier ones\. Although this instability mirrors the behavior observed under ID evaluation, it becomes more pronounced under distribution shift, highlighting the lack of stable and transferable memory evolution\.
Figure 7:Out\-of\-distribution \(OOD\) hold\-out accuracy over sequential steps for the Qwen3\-8B model, evaluated on unseen tasks not encountered during the online updating process\.
### C\.3Efficiency Results
Figures[8](https://arxiv.org/html/2605.15384#A3.F8)and[9](https://arxiv.org/html/2605.15384#A3.F9)present detailed runtime and token\-consumption breakdowns across datasets and memory mechanisms for Qwen3\-8B and MiniMax\-M2\.7\. Consistent with the main\-text analysis, runtime is dominated by generation, while token overhead primarily originates from memory construction and update operations\. The breakdown further shows that different memory mechanisms induce distinct efficiency profiles\. Retrieval\-based methods such as ExpRecent and ExpRAG mainly increase input tokens through longer retrieved contexts while incurring relatively small update overhead, resulting in moderate and stable computational growth\. In contrast, methods based on synthesized or structured memory introduce substantially larger update costs\. DC\-RS exhibits particularly high token consumption because each update requires long structured prompts for cheatsheet synthesis, causing update tokens to dominate total usage on several datasets\. G\-Memory shows a different overhead pattern: although its token growth is smaller than that of DC\-RS, runtime increases substantially due to graph construction, insight maintenance, and additional retrieval operations during inference\.
The dataset\-level breakdown further indicates that efficiency depends strongly on task characteristics\. On reasoning\-intensive datasets such as MMLU\-Pro\-Engineering and MMLU\-Pro\-Physics, runtime increases disproportionately even under moderate token growth, suggesting that latency is affected not only by context length but also by more complex inference trajectories\. ALFWorld exhibits the most distinct behavior: methods with iterative interaction or retrieval mechanisms incur extremely large runtime despite comparatively moderate token usage, indicating that multi\-step environment interaction amplifies latency beyond pure language generation cost\. In addition, identical memory mechanisms can exhibit substantially different efficiency profiles across LLM backbones\. MiniMax\-M2\.7 generally produces higher runtime despite similar or lower token counts, implying that memory overhead is jointly determined by both the external memory algorithm and the backbone\-specific inference characteristics\. Overall, these results show that aggregate token usage alone is insufficient to characterize computational efficiency, as memory update complexity, retrieval structure, and interaction dynamics contribute to end\-to\-end cost differently\.
\(a\)Token consumption\.
\(b\)Runtime\.
Figure 8:Efficiency analysis of token consumption and runtime for the Qwen3\-8B model\.\(a\)Token consumption\.
\(b\)Runtime\.
Figure 9:Efficiency analysis of token consumption and runtime for the MiniMax\-M2\.7 model\.
## Appendix DAlgorithms
Algorithm 1Dynamic Cheatsheet Retrieval\-and\-Synthesis \(DC\-RS\)0:Sequence of tasks
𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model
ℒ\{\\mathcal\{L\}\}, retrival budget
KK, generator template
PgenP\_\{gen\}, curator template
PcurP\_\{cur\}
1:Initialize:Cheatsheet
ℳ0←∅\{\\mathcal\{M\}\}\_\{0\}\\leftarrow\\emptyset, history
ℋ0←∅\{\\mathcal\{H\}\}\_\{0\}\\leftarrow\\emptyset
2:forTask
Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do
3:if
\|ℋi−1\|\>0\|\{\\mathcal\{H\}\}\_\{i\-1\}\|\>0then
4:Retrieve Top\-
KKsimilar past cases
CretrC\_\{retr\}from
ℋi−1\{\\mathcal\{H\}\}\_\{i\-1\}⊳\\trianglerightSearch
5:else
6:
𝒫←∅\\mathcal\{P\}\\leftarrow\\emptyset
7:endif
8:Update cheatsheet
ℳi←ℒ\(Pcur,Cretr,xi,ℳi−1\)\{\\mathcal\{M\}\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{cur\},\\;C\_\{retr\},\\;x\_\{i\},\\;\{\\mathcal\{M\}\}\_\{i\-1\}\)⊳\\trianglerightEvolve
9:Answer
y^i←ℒ\(Pgen,xi,ℳi\)\\hat\{y\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{gen\},\\;x\_\{i\},\\;\{\\mathcal\{M\}\}\_\{i\}\)⊳\\trianglerightSynthesis
10:Update history
ℋi←ℋi−1∪\{xi,y^i\}\{\\mathcal\{H\}\}\_\{i\}\\leftarrow\{\\mathcal\{H\}\}\_\{i\-1\}\\cup\\\{x\_\{i\},\\hat\{y\}\_\{i\}\\\}⊳\\trianglerightEvolve
11:endfor
Algorithm 2Agent Workflow Memory \(AWM\)0:Sequence of tasks
𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model
ℒ\{\\mathcal\{L\}\}, induction prompt
PinP\_\{in\}, one\-shot prompt
PoneP\_\{one\}
1:Initialize:base memory
ℳ0←∅\{\\mathcal\{M\}\}\_\{0\}\\leftarrow\\emptyset
2:forTask
Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do
3:Answer
y^i←ℒ\(xi,ℳi−1\)\\hat\{y\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(x\_\{i\},\{\\mathcal\{M\}\}\_\{i\-1\}\)and get experience
ei←\(xi,y^i\)e\_\{i\}\\leftarrow\(x\_\{i\},\\hat\{y\}\_\{i\}\)⊳\\trianglerightSynthesis
4:
success←Evaluator\(ei\)success\\leftarrow\\mathrm\{Evaluator\}\(e\_\{i\}\)
5:if
success=1success=1then
6:
wi←ℒ\(Pin‖Pone‖ei∥"\#\# Summary Workflows"\)w\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\}\\;\\\|\\;P\_\{one\}\\;\\\|\\;e\_\{i\}\\;\\\|\\;\\texttt\{"\\\#\\\# Summary Workflows"\}\)⊳\\trianglerightEvolve
7:
ℳi←ℳi−1∪wi\{\\mathcal\{M\}\}\_\{i\}\\leftarrow\{\\mathcal\{M\}\}\_\{i\-1\}\\cup w\_\{i\}⊳\\trianglerightEvolve
8:else
9:
ℳi←ℳi−1\{\\mathcal\{M\}\}\_\{i\}\\leftarrow\{\\mathcal\{M\}\}\_\{i\-1\}
10:endif
11:endfor
Algorithm 3ExpeL\-ST0:Sequence of tasks
𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model
ℒ\{\\mathcal\{L\}\}, insight extraction prompt
PinP\_\{in\}, retrival budget
KK, step size
LL
1:Initialize:Experience pool
ℬ←Fmanual\{\\mathcal\{B\}\}\\leftarrow F\_\{\\textsc\{manual\}\}\(seed few\-shot demos if available\), recent success trajectories
𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset, insight set
ι^←∅\\hat\{\\iota\}\\leftarrow\\emptyset
2:forTask
Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do
3:Retrieve Top\-
KKsimilar success cases
FsimF\_\{\\textsc\{sim\}\}from
ℬ\{\\mathcal\{B\}\}⊳\\trianglerightSearch
4:Answer
y^i←ℒ\(xi,Fsim,ι^\)\\hat\{y\}\_\{i\}\\leftarrow\{\\mathcal\{L\}\}\(x\_\{i\},F\_\{\\textsc\{sim\}\},\\hat\{\\iota\}\)⊳\\trianglerightSynthesis
5:
success←Evaluator\(xi,yi,y^i\)success\\leftarrow\\mathrm\{Evaluator\}\(x\_\{i\},y\_\{i\},\\hat\{y\}\_\{i\}\)
6:if
success=1success=1then
7:
𝒮←𝒮∪\{\(xi,y^i\)\}\{\\mathcal\{S\}\}\\leftarrow\{\\mathcal\{S\}\}\\cup\\\{\(x\_\{i\},\\hat\{y\}\_\{i\}\)\\\}
8:
ℬ←ℬ∪\{\(xi,y^i\)\}\{\\mathcal\{B\}\}\\leftarrow\{\\mathcal\{B\}\}\\cup\\\{\(x\_\{i\},\\hat\{y\}\_\{i\}\)\\\}⊳\\trianglerightEvolve
9:if
\|𝒮\|=L\|\{\\mathcal\{S\}\}\|=Lthen
10:
ι^←ℒ\(Pin,𝒮,ι^\)\\hat\{\\iota\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\},\{\\mathcal\{S\}\},\\hat\{\\iota\}\)⊳\\trianglerightEvolve
11:
𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset
12:endif
13:endif
14:endfor
Algorithm 4ExpeL\-MT0:Sequence of tasks
𝒯=\{Ti\}i=1N\{\\mathcal\{T\}\}=\\\{T\_\{i\}\\\}\_\{i=1\}^\{N\}, large language model
ℒ\{\\mathcal\{L\}\}, self\-reflection model
ℒreflect\{\\mathcal\{L\}\}\_\{\\textsc\{reflect\}\}, insight prompt
PinP\_\{in\}, retrieval budget
KK, step size
LL, max tries
ZZ
1:Initialize:Experience pool
ℬ←Fmanual\{\\mathcal\{B\}\}\\leftarrow F\_\{\\textsc\{manual\}\}\(seed demos if available\), recent success set
𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset, insight set
ι^←∅\\hat\{\\iota\}\\leftarrow\\emptyset
2:forTask
Ti=\(xi,yi\)∈𝒯T\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do
3:Retrieve Top\-
KKsimilar success cases
FsimF\_\{\\textsc\{sim\}\}from
ℬ\{\\mathcal\{B\}\}⊳\\trianglerightSearch
4:
ν←""\\nu\\leftarrow\\texttt\{""\}
5:
τsucc←∅\\tau^\{\\textsc\{succ\}\}\\leftarrow\\emptyset,
τfail←∅\\tau^\{\\textsc\{fail\}\}\\leftarrow\\emptyset
6:for
z=1z=1to
ZZdo
7:
y^i\(z\)←ℒ\(xi,Fsim,ι^,ν\)\\hat\{y\}\_\{i\}^\{\(z\)\}\\leftarrow\{\\mathcal\{L\}\}\(x\_\{i\},F\_\{\\textsc\{sim\}\},\\hat\{\\iota\},\\nu\)⊳\\trianglerightSynthesis
8:
success\(z\)←Evaluator\(xi,yi,y^i\(z\)\)success^\{\(z\)\}\\leftarrow\\mathrm\{Evaluator\}\(x\_\{i\},y\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
9:if
success\(z\)=1success^\{\(z\)\}=1then
10:
τsucc←\(xi,y^i\(z\)\)\\tau^\{\\textsc\{succ\}\}\\leftarrow\(x\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
11:
ℬ←ℬ∪\{τsucc\}\{\\mathcal\{B\}\}\\leftarrow\{\\mathcal\{B\}\}\\cup\\\{\\tau^\{\\textsc\{succ\}\}\\\}⊳\\trianglerightEvolve
12:break
13:else
14:
τfail←\(xi,y^i\(z\)\)\\tau^\{\\textsc\{fail\}\}\\leftarrow\(x\_\{i\},\\hat\{y\}\_\{i\}^\{\(z\)\}\)
15:
ν←Concat\(ν,ℒreflect\(τfail\)\)\\nu\\leftarrow\\textsc\{Concat\}\(\\nu,\\;\{\\mathcal\{L\}\}\_\{\\textsc\{reflect\}\}\(\\tau^\{\\textsc\{fail\}\}\)\)⊳\\trianglerightReflect
16:endif
17:endfor
18:if
τsucc≠∅\\tau^\{\\textsc\{succ\}\}\\neq\\emptysetthen
19:
𝒮←𝒮∪\{τsucc\}\{\\mathcal\{S\}\}\\leftarrow\{\\mathcal\{S\}\}\\cup\\\{\\tau^\{\\textsc\{succ\}\}\\\}
20:endif
21:if
τsucc≠∅τfail≠∅\\tau^\{\\textsc\{succ\}\}\\neq\\emptyset\\ \\ \\tau^\{\\textsc\{fail\}\}\\neq\\emptysetthen
22:
ι^←ℒ\(Pin,τsucc,τfail,ι^\)\\hat\{\\iota\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\},\\tau^\{\\textsc\{succ\}\},\\tau^\{\\textsc\{fail\}\},\\hat\{\\iota\}\)⊳\\trianglerightPair Update
23:endif
24:if
\|𝒮\|=L\|\{\\mathcal\{S\}\}\|=Lthen
25:
ι^←ℒ\(Pin,𝒮,ι^\)\\hat\{\\iota\}\\leftarrow\{\\mathcal\{L\}\}\(P\_\{in\},\{\\mathcal\{S\}\},\\hat\{\\iota\}\)⊳\\trianglerightBatch Update
26:
𝒮←∅\{\\mathcal\{S\}\}\\leftarrow\\emptyset
27:endif
28:endfor
Algorithm 5G\-Memory0:Task stream
𝒯\{\\mathcal\{T\}\}, MAS
𝒢\{\\mathcal\{G\}\}, query graph
𝒢qry=\(𝒬,ℰqry\)\{\\mathcal\{G\}\}\_\{\\textsc\{qry\}\}=\(\{\\mathcal\{Q\}\},\{\\mathcal\{E\}\}\_\{\\textsc\{qry\}\}\)with query nodes
𝒬=\{qi\}\{\\mathcal\{Q\}\}=\\\{q\_\{i\}\\\}, insight graph
𝒢ins=\(ℐ,ℰins\)\{\\mathcal\{G\}\}\_\{\\textsc\{ins\}\}=\(\{\\mathcal\{I\}\},\{\\mathcal\{E\}\}\_\{\\textsc\{ins\}\}\), budgets
K,MK,M
1:for
\(Qi,yi\)∈𝒯\(Q\_\{i\},y\_\{i\}\)\\in\{\\mathcal\{T\}\}do
2:
𝒬S←Retrieve\(Qi,𝒬,K\)\{\\mathcal\{Q\}\}^\{S\}\\leftarrow\\textsc\{Retrieve\}\(Q\_\{i\},\{\\mathcal\{Q\}\},K\)⊳\\trianglerightSearch
3:
𝒬~S←Expand\(𝒬S,𝒢qry\)\\widetilde\{\{\\mathcal\{Q\}\}\}^\{S\}\\leftarrow\\textsc\{Expand\}\(\{\\mathcal\{Q\}\}^\{S\},\{\\mathcal\{G\}\}\_\{\\textsc\{qry\}\}\)⊳\\trianglerightSearch
4:
ℐS←ΠQ→I\(𝒬~S,𝒢ins\)\{\\mathcal\{I\}\}^\{S\}\\leftarrow\\Pi\_\{Q\\rightarrow I\}\(\\widetilde\{\{\\mathcal\{Q\}\}\}^\{S\},\{\\mathcal\{G\}\}\_\{\\textsc\{ins\}\}\)⊳\\trianglerightSearch
5:
𝒬R←Select\(𝒬~S,M\)\{\\mathcal\{Q\}\}^\{R\}\\leftarrow\\textsc\{Select\}\(\\widetilde\{\{\\mathcal\{Q\}\}\}^\{S\},M\)
6:
𝒢^inter←Sparsify\(𝒬R,Qi\)\\widehat\{\{\\mathcal\{G\}\}\}\_\{\\textsc\{inter\}\}\\leftarrow\\textsc\{Sparsify\}\(\{\\mathcal\{Q\}\}^\{R\},Q\_\{i\}\)
7:for
Cj∈𝒢C\_\{j\}\\in\{\\mathcal\{G\}\}do
8:
Memj←Φ\(ℐS,𝒢^inter,Rolej,Qi\)Mem\_\{j\}\\leftarrow\\Phi\(\{\\mathcal\{I\}\}^\{S\},\\widehat\{\{\\mathcal\{G\}\}\}\_\{\\textsc\{inter\}\},Role\_\{j\},Q\_\{i\}\)
9:endfor
10:
y^i,τi←𝒢\(Qi\)\\hat\{y\}\_\{i\},\\tau\_\{i\}\\leftarrow\{\\mathcal\{G\}\}\(Q\_\{i\}\)⊳\\trianglerightSynthesis
11:
Ψi←Evaluator\(y^i,yi\)\\Psi\_\{i\}\\leftarrow\\mathrm\{Evaluator\}\(\\hat\{y\}\_\{i\},y\_\{i\}\)
12:
𝒢inter\(Qi\)←Trace\(τi\)\{\\mathcal\{G\}\}^\{\(Q\_\{i\}\)\}\_\{\\textsc\{inter\}\}\\leftarrow\\textsc\{Trace\}\(\\tau\_\{i\}\)
13:
𝒢qry←UpdateQueryGraph\(𝒢qry,Qi,Ψi,𝒢inter\(Qi\),𝒬R,ℐS\)\{\\mathcal\{G\}\}\_\{\\textsc\{qry\}\}\\leftarrow\\textsc\{UpdateQueryGraph\}\(\{\\mathcal\{G\}\}\_\{\\textsc\{qry\}\},Q\_\{i\},\\Psi\_\{i\},\{\\mathcal\{G\}\}^\{\(Q\_\{i\}\)\}\_\{\\textsc\{inter\}\},\{\\mathcal\{Q\}\}^\{R\},\{\\mathcal\{I\}\}^\{S\}\)⊳\\trianglerightEvolve
14:
𝒢ins←UpdateInsightGraph\(𝒢ins,ℐS,𝒢inter\(Qi\),Ψi\)\{\\mathcal\{G\}\}\_\{\\textsc\{ins\}\}\\leftarrow\\textsc\{UpdateInsightGraph\}\(\{\\mathcal\{G\}\}\_\{\\textsc\{ins\}\},\{\\mathcal\{I\}\}^\{S\},\{\\mathcal\{G\}\}^\{\(Q\_\{i\}\)\}\_\{\\textsc\{inter\}\},\\Psi\_\{i\}\)⊳\\trianglerightEvolve
15:endfor
## Appendix EPrompts
AIME2024 / AIME2025 PromptSolve the following AIME problem\. The final answer is a non\-negative integer between 0 and 999\.Think step by step\. Keep the reasoning concise but complete\. At the very end, output ONLY the final integer inside a LaTeX box, e\.g\., \\boxed\{42\}\. Do not wrap anything except the integer answer inside \\boxed\{\}\.Problem: \{question\}
MATH500 PromptSolve the following math problem\.Please do your derivation in LaTeX as much as possible\. Keep the reasoning concise but complete\. At the end, put only the final answer in the LAST \\boxed\{\.\.\.\}\. If the final answer is a fraction, use LaTeX fraction form like \\frac\{a\}\{b\} \(not decimal approximation unless required\)\.Problem: \{question\}
MMLU\-Pro PromptAnswer the following multiple\-choice \{subject\} question\. Exactly one option is correct\.Think step by step\. Keep the reasoning concise but complete\. At the very end, output ONLY the letter of the correct option inside a LaTeX box, e\.g\., \\boxed\{C\}\. Do not wrap anything except the single letter A\-\-J inside \\boxed\{\}\.Question: \{question\}Options: \{options\}
HumanEval PromptComplete the following Python function\. Keep the given signature and docstring unchanged; only fill in the body \(you may add helper functions above it if needed\)\. Your function must be named exactly \{entry\_point\}\.‘‘‘python\{question\}‘‘‘You may reason briefly first, but only the code will be executed against hidden unit tests\. Put your FINAL complete function inside a single fenced block:‘‘‘python\# include the full def \{entry\_point\}\(\.\.\.\) here, plus any helpers it depends on‘‘‘Do not output any other code fence\. No commentary after the code block\.ANSWER:
ALFWorld PromptImagine you are an intelligent agent in a household environment, and your goal is to perform actions to complete the task\. At the beginning of the interaction, you will be given a detailed description of the environment and the goal\.For each turn, think briefly and then output your next action\.The available actions are:•go to \{recep\}•take \{obj\} from \{recep\}•move \{obj\} to \{recep\}•open \{recep\}•close \{recep\}•use \{obj\}•clean \{obj\} with \{recep\}•heat \{obj\} with \{recep\}•cool \{obj\} with \{recep\}where \{obj\} and \{recep\} correspond to objects and receptacles\.After each turn, the environment will provide feedback\. If the feedback is ‘‘Nothing happened’’, the previous action is invalid and you should try a different action\.Your response must follow this format:•Thought:•Action:
APIBench PromptYou are a helpful API writer who can write APIs based on requirements\.\{question\}Write a Python program in 1 to 2 lines to call API in \{framework\}\.The answer should follow the format:`<<<`domain`\>\>\>`$DOMAIN,`<<<`api\_call`\>\>\>`: $API\_CALL,`<<<`api\_provider`\>\>\>`: $API\_PROVIDER,`<<<`explanation`\>\>\>`: $EXPLANATION,`<<<`code`\>\>\>`: $CODE\. Here are the requirements:1\.$DOMAIN should be inferred from the task description\.2\.$API\_CALL should have only one line of code that calls the API\.3\.$API\_PROVIDER should be the programming framework used\.4\.$EXPLANATION should be a step\-by\-step explanation\.5\.$CODE is the Python code\.6\.Do not repeat the format in your answer\.
## Appendix FCase Study: Useful Memory State Destroyed by Subsequent Updates
This case study isolates a representative failure mode in sequentially evolving memory systems: a memory state that is initially useful gradually becomes corrupted or overwritten by later updates, eventually reducing downstream performance despite continued exposure to in\-distribution data\. We analyze ExpeL\-MT on the MMLU\-Pro\-Engineering benchmark using Qwen3\-8B as the backbone model, and focus on a single hold\-out question on which the model initially answers correctly and progressively fails as the memory evolves\. The full reconstruction is based on intermediate memory snapshots, retrieved trajectories, and checkpoint\-level evaluation logs\.
Hold\-out evaluation is conducted withmax\-tries=1so that each checkpoint reflects the quality of the memory state itself rather than additional retry\-and\-reflect behavior\.
### F\.1The Hold\-out Question
Question 87:A 112\\frac\{1\}\{2\}\-in schedule\-40 steam pipe is laid in the atmosphere where the temperature is 50∘F\. The steam inside it is saturated at 100 psia\. Consider the pipe to be a grey body and uninsulated\. The coefficient of heat transfer by natural convection from the outside surface is 2\.0 Btu/\(hr⋅\\cdotft⋅∘2\{\}^\{2\}\\cdot^\{\\circ\}R\)\. Calculate the amount of steam condensed per hour per unit length of pipe\.A\. 0\.65 B\. 0\.50C\. 0\.70D\. 0\.55 E\. 0\.40F\. 0\.80 G\. 0\.75 H\. 0\.90 I\. 1\.00J\. 0\.60
The question belongs to theHeat Transfer / Thermodynamicscategory\. The standard solution accounts for both natural convection and thermal radiation from the pipe surface\. At 100 psia, saturated steam hasTs≈327\.8∘T\_\{s\}\\approx 327\.8^\{\\circ\}F andhfg≈880Btu/lbmh\_\{fg\}\\approx 880~\\mathrm\{Btu/lbm\}\. Using the 1\.5\-in schedule\-40 outside diameterDo≈1\.90D\_\{o\}\\approx 1\.90in=0\.158ft=0\.158~\\mathrm\{ft\}, the outside area per foot is
A=πDoL≈π⋅0\.158⋅1=0\.497ft2/ft\.A=\\pi D\_\{o\}L\\approx\\pi\\cdot 0\.158\\cdot 1=0\.497~\\mathrm\{ft^\{2\}/ft\}\.The convective heat loss is
Q˙conv=hA\(Ts−T∞\)=2\.0⋅0\.497⋅\(327\.8−50\)≈276Btu/\(hr⋅ft\)\.\\dot\{Q\}\_\{\\mathrm\{conv\}\}=hA\(T\_\{s\}\-T\_\{\\infty\}\)=2\.0\\cdot 0\.497\\cdot\(327\.8\-50\)\\approx 276~\\mathrm\{Btu/\(hr\\cdot ft\)\}\.Because the pipe is grey\-body and uninsulated, radiation must also be included:
Q˙rad=ϵσA\(Ts4−T∞4\)\.\\dot\{Q\}\_\{\\mathrm\{rad\}\}=\\epsilon\\sigma A\(T\_\{s\}^\{4\}\-T\_\{\\infty\}^\{4\}\)\.Using a typical grey\-body emissivity for an oxidized metal surface, the radiative loss is on the same order as convection, yielding total heat loss about
Q˙tot≈530Btu/\(hr⋅ft\)\.\\dot\{Q\}\_\{\\mathrm\{tot\}\}\\approx 530~\\mathrm\{Btu/\(hr\\cdot ft\)\}\.Therefore,
m˙=Q˙tothfg≈530880≈0\.60lb/\(hr⋅ft\)\.\\dot\{m\}=\\frac\{\\dot\{Q\}\_\{\\mathrm\{tot\}\}\}\{h\_\{fg\}\}\\approx\\frac\{530\}\{880\}\\approx 0\.60~\\mathrm\{lb/\(hr\\cdot ft\)\}\.Thus, the standard answer isJ\\boxed\{\\text\{J\}\}\.
### F\.2Memory Evolution and Prediction Transition
We reconstruct the full 20\-slot ExpeL\-MT insight pool from intermediate memory snapshots to analyze how thermo\-related rules evolve throughout the sequential stream\. A rule is considered thermo\-related if it contains terms such assteam, condensation, enthalpy, heat transfer, latent heat,orhfgh\_\{fg\}\. Representative early\-stage rules include:
- •“Calculate heat transferred using final and initial enthalpy, considering temperature changes, specific heat capacities, and molecular weight for unit conversion\.”
- •“Calculate exit temperature of a fluid in a heat exchanger…”
- •“Calculate initial quality…”
During the early phase of the stream, which primarily contains Thermodynamics and Heat Transfer questions, the bounded insight pool gradually accumulates thermo\-related rules\. However, after the stream transitions to electrical engineering and communication topics, newly inserted rules progressively overwrite the earlier thermo\-related insights\.
ExpeL\-MT relies on two memory sources during inference: \(1\) retrieved top\-kktrajectories from the experience pool, and \(2\) an evolving insight pool produced through reflection and summarization\. To identify the source of failure, we reconstruct both components across intermediate memory stages\. Table[8](https://arxiv.org/html/2605.15384#A6.T8)summarizes the prediction trajectory together with the number of thermo\-related rules retained in memory\. Prediction correctness strongly correlates with thermo\-rule retention: the model remains mostly correct while thermo\-related insights are preserved, but becomes consistently unstable once these rules disappear afterK=582K=582\.
Importantly, retrieval quality does not degrade over time\. Both early and late checkpoints retrieve highly relevant steam\-condensation trajectories, and the retrieval similarity score increases from
0\.678→0\.753\.0\.678\\rightarrow 0\.753\.Thus, the retrieved top\-kktrajectories remain semantically relevant throughout the stream, indicating that the downstream degradation cannot be attributed to retrieval failure\.
Instead, the failure originates from corruption within the evolving insight pool\. As thermo\-related insights are progressively replaced by unrelated electrical\-engineering rules, the model loses the thermo\-specific contextual anchors required for stable reasoning\. The transition approximately coincides with the disappearance of thermo\-related rules from the bounded insight pool\. By the final checkpoint, all 20 insight slots are occupied by unrelated rules involving SCR currents, signal\-processing formulas, torque equations, capacitance relations, and Reynolds\-number similarity constraints\. These observations suggest that the degradation is primarily caused by destructive memory updates rather than insufficient retrieval quality or backbone reasoning capability\.
Table 8:Prediction trajectory of the hold\-out thermodynamics question under evolving ExpeL\-MT memory states\.CheckpointKKPred\.Correct?\# Thermo RulesModel Behavior97J✓5Uses correct thermo constants and computesm˙≈0\.60\\dot\{m\}\\approx 0\.60, mapped to J\.194D✗6Mis\-applies LMTD and obtains an incorrect temperature difference\.291J✓4Same reasoning path asK=97K=97\.388J✓3Correct thermo grounding preserved\.485J✓1Final checkpoint with stable thermo\-context grounding\.582A✗0Thermo\-related insights disappear; latent heat value drifts andm˙\\dot\{m\}shifts downward\.679E✗0Saturation temperature also drifts after thermo\-context removal\.776C✗0Partial recovery of thermo constants but incorrect option selection\.873B✗0Prediction remains unstable after complete thermo\-context overwrite\.
### F\.3Failure Mechanism and Implication
The single\-item trajectory closely mirrors the aggregate hold\-out behavior across all 96 Engineering questions: strong early improvement is followed by gradual erosion and eventual regression toward the no\-memory baseline\. The case study therefore reflects a broader structural failure mode rather than an isolated prediction error\.
This degradation emerges from the interaction between bounded memory capacity and sequential subject\-localized updates\. In ExpeL\-MT, the insight pool is capped at:
max\-num\-rules=20\.\\texttt\{max\-num\-rules\}=20\.Once the pool becomes saturated, newly generated rules implicitly replace or edit earlier entries\. However, the eviction process does not estimate downstream utility, subject importance, or future dependency\. As a result, memory retention becomes dominated by recency rather than long\-term usefulness\.
The effect is amplified by the subject\-blocked ordering of the online stream\. Neighboring tasks frequently belong to the same subject category, causing memory updates to become highly concentrated within local domains\. During the early phase of the stream, thermo\-related rules accumulate and stabilize prediction behavior on the hold\-out Heat Transfer question\. However, after the stream transitions into electrical\-engineering and communication topics, newly inserted rules progressively overwrite the earlier thermo\-related insights\.
Importantly, the failure is not caused by retrieval degradation\. Retrieved top\-kktrajectories remain highly relevant throughout the stream, and retrieval similarity even increases over time\. Instead, the degradation originates from corruption within the bounded insight pool itself\. Once thermo\-related rules disappear from memory, the model loses the thermo\-specific contextual anchors required for stable reasoning, leading to drifting constants, unstable intermediate calculations, and eventually persistent prediction failure\.
Consequently, memory updates become competitive rather than accumulative: later subjects overwrite previously useful subject\-specific context instead of consolidating it\. Together, these dynamics produce a clear form of:
catastrophic forgetting through bounded memory eviction
More broadly, this case suggests that long\-horizon memory systems likely require utility\-aware retention, adaptive eviction policies, subject\-balanced allocation, or retrieval\-time filtering over larger persistent stores\. Otherwise, useful contextual anchors accumulated early in the stream may be systematically overwritten by later updates, leading to transient gains followed by irreversible degradation\.
## Appendix GLimitations
#### Task\-order sensitivity\.
A limitation of our current study is that most of the main experiments are conducted under a fixed task order for each dataset\. This choice enables controlled comparison across memory methods, but it does not fully characterize uncertainty induced by different sequential orders\. In principle,SeqMem\-Evalcan be applied to multiple randomized task streams and report confidence intervals for the metrics\. In practice, this is computationally expensive because each task order requires rerunning the full sequential memory process, including retrieval, memory updates, hold\-out evaluation at intermediate checkpoints, and retrospective evaluation for BWT and forgetting\. For LLM\-based memory methods such as reflection\- or synthesis\-based approaches, this cost is further amplified by additional API calls and generated tokens\. We therefore leave large\-scale multi\-order uncertainty estimation as an important direction for future benchmark extensions\.
#### Scope of memory mechanisms\.
Our evaluation focuses primarily on prompt\-based sequentially evolving memory, where the base LLM remains fixed and memory is updated through retrieval, summarization, reflection, workflows, or other external text\-based mechanisms\. This scope matches many current memory\-augmented agent systems and allows us to compare diverse methods under a unified test\-time protocol\. However, another emerging direction is training\-based evolution, where experience is used not only to update an external memory state but also to improve policies, skills, or model behavior through learning\. For example, recent work such as SkillRL\[[39](https://arxiv.org/html/2605.15384#bib.bib22)\]studies recursive skill\-augmented reinforcement learning, where a skill library co\-evolves with the agent’s policy during RL optimization\. Such training\-based approaches may exhibit different memory dynamics, including longer\-term policy adaptation, different forms of transfer, and different efficiency trade\-offs\. WhileSeqMem\-Eval’s diagnostic perspective may still be useful for analyzing these systems, our current experiments do not directly evaluate training\-based evolving agents\. Extending the framework to jointly assess external memory evolution and parameter\- or policy\-level adaptation is an important direction for future work\.
#### Evaluation cost of diagnostic metrics\.
AlthoughSeqMem\-Evalprovides a more fine\-grained view of memory behavior, some of its metrics require additional evaluation beyond standard online accuracy\. For example, hold\-out generalization requires evaluating intermediate memory states on unseen tasks, while BWT and forgetting require revisiting previously encountered tasks under later memory states\. These retrospective evaluations increase the total number of LLM calls, especially for long task streams or methods with expensive memory updates\. In this work, we mitigate this cost by evaluating BWT and forgetting over selected temporal horizons and by sampling memory checkpoints for hold\-out evaluation\. Future work could explore more efficient approximations, such as adaptive checkpoint selection, smaller diagnostic probes, or surrogate estimators for memory quality\.
## Appendix HBroader Impacts
In this paper, we proposeSeqMem\-Eval, a diagnostic evaluation framework for sequential LLM memory\. Our work aims to provide a more comprehensive understanding of how memory\-augmented LLMs behave over time\. We believe better evaluation of memory behavior may help identify failure modes before deployment and support safer use of LLM agents in long\-term interactive settings\. While we emphasize the importance of responsible use, we do not anticipate any major negative societal impacts from our work\.Similar Articles
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.
Your Evals Will Break and You Won't See It Coming
Discusses the structural weakness of current evaluation methods for LLMs, which fail to anticipate qualitative shifts in capability, and argues that developing proactive evaluation infrastructure is the critical bottleneck for safe capability jumps.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem introduces a self-evolving memory architecture for LLM agents that optimizes retrieval configurations through LLM-powered diagnosis and iterative research cycles, achieving significant performance improvements on benchmarks like LoCoMo and MemBench.
MEME: Multi-entity & Evolving Memory Evaluation
The MEME benchmark evaluates AI memory systems across multiple entities and evolving conditions, revealing significant challenges in dependency reasoning that persist even with advanced retrieval techniques.
@hyunji_amy_lee: LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process …
MINTEval is a new benchmark for evaluating LLM agents and memory systems in continuously updated environments with frequent context changes. It shows that current systems perform poorly, with an average accuracy of 27.9% across representative systems.