ElasticMem: Latent Memory as a Learnable Resource for LLM Agents
Summary
ElasticMem introduces a learnable latent memory mechanism for LLM agents that adaptively allocates variable budgets to retrieved memories, improving performance on memory-intensive QA and embodied agent tasks while reducing token costs.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# ElasticMem: Latent Memory as a Learnable Resource for LLM Agents
Source: [https://arxiv.org/html/2605.30690](https://arxiv.org/html/2605.30690)
Tao Feng1, Chongrui Ye1, Tianyang Luo1, Jingjun Xu1, Xueqiang Xu1, Haozhen Zhang2Ge Liu1, Jiaxuan You11University of Illinois Urbana\-Champaign2Nanyang Technological University
###### Abstract
Long\-term memory is essential for LLM agents to reason coherently across extended interactions, personalize responses, and reuse past experience\. However, existing memory\-augmented methods typically treat memory as a fixed resource: text\-space approaches concatenate retrieved memories into the context window, causing substantial token overhead and sensitivity to noisy evidence, while latent\-space approaches reduce textual cost but still rely on rigid retrieval or fixed\-capacity memory interfaces\. This creates a mismatch between query\-dependent memory utility and fixed memory allocation\. We proposeElasticMem, a memory\-augmented LLM framework that learns to use memory as an elastic latent resource\.ElasticMembuilds an offline latent memory bank with retrieval keys and content caches, retrieves memories adaptively from the reasoner’s hidden state, assigns each retrieved memory a variable latent budget through a learned policy, and injects selected latent states as soft memory tokens for generation\. The full memory\-use process is optimized with downstream task rewards through group\-relative policy optimization\. We evaluateElasticMemon MemorySuite, covering memory\-intensive QA and embodied agent control\. Across Qwen2\.5\-3B\-Instruct and Qwen2\.5\-7B\-Instruct backbones,ElasticMemimproves weighted average QA accuracy by 26\.2% and 24\.6%, and improves ALFWorld success rate by 66\.3% and 27\.2%, respectively, over the strongest baselines, while achieving the lowest ALFWorld token cost\. Ablations and qualitative analyses further show that adaptive retrieval and elastic budget allocation helpElasticMemprioritize useful evidence and transferable plans beyond rigid cosine similarity\. Our code forElasticMemwill be released at[https://github\.com/ulab\-uiuc/ElasticMem](https://github.com/ulab-uiuc/ElasticMem)\.
## 1Introduction
Long\-term memory has become essential for enabling large language model \(LLM\) agents to reason coherently across extended interactions, personalize responses across sessions, and reuse past experience in long\-horizon tasks\(Parket al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib44); Shinnet al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib42); Zhonget al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib13); Wanget al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib41); Huet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib91); Zhanget al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib92)\)\. In realistic settings, an agent may need to retrieve user preferences from past conversations\(Sunet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib93); Panet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib94)\), track facts scattered across long dialogue histories\(Duet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib95); Tavakoliet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib96)\), or reuse procedural experience from previous trajectories\(Zhenget al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib97); Miet al\.,[2026](https://arxiv.org/html/2605.30690#bib.bib98)\)\. Thus, memory is not merely an auxiliary context source; it is a core mechanism for extending LLM agents beyond the current context window\(Yanet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib100)\)\.
Despite this importance, existing memory\-augmented LLM systems largely treat memory as a*fixed*resource shown in Table[1](https://arxiv.org/html/2605.30690#S1.T1)\. Text\-space memory methods store memories as natural\-language records and retrieve, summarize, update, or organize them before injecting selected memories into the prompt\(Zhonget al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib13); Xuet al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib14); Fanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib15); Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib16); Kanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib17); Fanget al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib18); LangChain,[2024](https://arxiv.org/html/2605.30690#bib.bib19)\)\. While interpretable, this paradigm couples memory use with prompt length: irrelevant or redundant memories still consume context tokens, and useful memories may be diluted by noisy textual evidence\. Latent\-space memory methods reduce textual overhead by compressing memory into continuous vectors or soft tokens\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib20); Chevalieret al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib90); Wanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib22)\)\. However, they typically expose a rigid memory interface: retrieval is often based on fixed similarity scores, and each retrieved or compressed memory receives a uniform or largely pre\-defined latent capacity\(Karpukhinet al\.,[2020](https://arxiv.org/html/2605.30690#bib.bib101); Reimers and Gurevych,[2019](https://arxiv.org/html/2605.30690#bib.bib106); Zhanget al\.,[2026](https://arxiv.org/html/2605.30690#bib.bib102)\)\. As a result, a memory chunk that is lexically similar to the query may be retrieved even if it provides little useful evidence, while a less similar chunk containing the key fact or transferable plan may be underused\. This creates a central mismatch: memory usefulness is query\-dependent and task\-dependent, but both memory retrieval and memory capacity are often determined by fixed design choices\.
This mismatch motivates a different view of memory use\. Rather than asking how to retrieve more memories or compress all memories equally, we ask:how can an LLM agent learn to retrieve and allocate memory adaptively, so that useful memories receive more representational budget while unhelpful memories are compressed or suppressed?Answering this question requires addressing three challenges\.First, retrieval should adapt to the current query and the reasoner’s internal state\.Most existing memory systems rely on rigid similarity\-based retrieval, where memories are selected by a fixed encoder or surface semantic matching\. However, semantic similarity does not necessarily indicate downstream utility: a memory can be topically similar but unhelpful, while a less similar memory may contain the key evidence or transferable plan\(Ganet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib99); Daiet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib107); Xuet al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib108); Shiet al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib109); Liuet al\.,[2024b](https://arxiv.org/html/2605.30690#bib.bib110)\)\.Second, memory capacity should be allocated adaptively\.Different retrieved chunks play different roles: some contain direct evidence, some provide reusable procedures, some are redundant, and some are misleading\. A fixed latent budget cannot distinguish these cases\(Yanet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib103)\)\.Third, retrieval and allocation should be learned from task outcomes\.Since the usefulness of a memory is only revealed after generation or interaction, the memory system should learn which memories to retrieve and how much capacity to assign based on downstream feedback rather than similarity scores alone\(Asaiet al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib104); Gaoet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib105)\)\.
Table 1:Comparison with representative memory\-augmented LLM paradigms\.Text\-space memory methods retrieve and inject natural\-language memories, which are interpretable but token\-expensive\. Existing latent\-space methods reduce textual overhead, but they still rely on fixed\-capacity memory interfaces or rigid retrieval mechanisms\.ElasticMemperforms reasoner\-state adaptive retrieval and allocates elastic latent capacity to each retrieved memory\.ParadigmRepresentative MethodsMemory SpaceToken EfficientReasoner\-State Adaptive RetrievalElastic CapacityTextual long\-term memoryMemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib13)\)/ Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib16)\)/ MemoryOS\(Kanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib17)\)Text✗✗✗Textual agent memoryA\-MEM\(Xuet al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib14)\)/ LightMem\(Fanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib15)\)/ MemP\(Fanget al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib18)\)Text✗✗✗Fixed latent compressionAutoCompressor\(Chevalieret al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib90)\)Latent✓✗✗Latent memory retrievalMemGen\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib20)\)/ M\+\(Wanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib22)\)Latent✓✗✗Elastic latent memoryElasticMemLatent✓✓✓
To address these challenges, we proposeElasticMem, a memory\-augmented LLM framework that treats memory as an elastic latent resource\.ElasticMemfirst constructs an offline latent memory bank by encoding each memory chunk into a retrieval key and a content cache\. At inference time, instead of retrieving memories with a fixed external retriever,ElasticMemderives a query\-conditioned retrieval state from the LoRA\-adapted reasoner’s hidden representation after sampling a retrieval\-control token\. This enables retrieval to adapt to the current query and the model’s internal reasoning state, rather than relying only on rigid similarity matching\. After retrieving candidate memories,ElasticMemuses a lightweight Transformer budget policy to assign each retrieved chunk a variable number of latent tokens\. Useful or evidence\-bearing memories can receive larger latent budgets, while redundant or misleading memories can receive few tokens or be suppressed entirely\. The selected latent states are then projected into soft memory tokens and injected into the reasoner for generation\. Crucially,ElasticMemdoes not train retrieval, allocation, and generation as isolated modules\. Instead, it jointly optimizes the memory\-use process with downstream task rewards through group\-relative policy optimization\(Shaoet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib29); Guoet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib65)\)\. The reward signal supervises the retrieval\-control decision, the memory\-budget allocations, the latent projector, and the LoRA\-adapted\(Huet al\.,[2022](https://arxiv.org/html/2605.30690#bib.bib68)\)reasoner\. This design aligns memory management with actual task utility: the model learns not only what to answer, but also which memories to retrieve and how much representational capacity each memory deserves\.
We evaluateElasticMemon MemorySuite, a memory\-oriented evaluation suite covering two complementary settings: MemorySuite\-QA, including PersonaMem\-32K, PersonaMem\-128K, LoCoMo, and LongMemEval for personalized memory recall, long\-dialogue understanding, temporal reasoning, and long\-context memory evaluation\(Jianget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib1); Maharanaet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib10); Wuet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib11)\); and MemorySuite\-Agentic, which uses ALFWorld to evaluate memory\-augmented embodied decision\-making\(Shridharet al\.,[2020](https://arxiv.org/html/2605.30690#bib.bib12)\)\. Across both Qwen2\.5\-3B\-Instruct and Qwen2\.5\-7B\-Instruct backbones,ElasticMemconsistently outperforms strong text\-space and latent\-space memory baselines\. On MemorySuite\-QA,ElasticMemimproves the weighted average accuracy by 26\.2% with Qwen2\.5\-3B\-Instruct and by 24\.6% with Qwen2\.5\-7B\-Instruct over the strongest baseline\. On ALFWorld,ElasticMemimproves the weighted average success rate by 66\.3% with Qwen2\.5\-3B\-Instruct and by 27\.2% with Qwen2\.5\-7B\-Instruct\. Notably,ElasticMemalso achieves the lowest token cost on ALFWorld among all compared methods, indicating that its gains do not come from longer interaction trajectories or excessive memory use\. Ablation studies further confirm the importance of adaptive retrieval, elastic budget allocation, and a moderate per\-chunk capacity limit, while qualitative analyses show thatElasticMemlearns to allocate larger budgets to memories that contain useful evidence or transferable plans, even when they are not the most similar chunks under cosine retrieval\.
## 2Preliminaries
### 2\.1Memory\-Augmented Language Models
We consider a memory\-augmented language model that answers a queryqqwith access to an external memory corpusℳ=\{mi\}i=1N\\mathcal\{M\}=\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}\. Each memory itemmim\_\{i\}is a chunk of long\-term context, such as a dialogue segment, a document passage, or a procedural skill card\. A standard retrieval\-augmented system first embeds the query and memory chunks, retrieves the top\-ranked chunks, and then prepends the retrieved content to the model input:
ℛ\(q\)=TopKmi∈ℳsim\(eq,ei\),\\mathcal\{R\}\(q\)=\\operatorname\{TopK\}\_\{m\_\{i\}\\in\\mathcal\{M\}\}\\operatorname\{sim\}\(e\_\{q\},e\_\{i\}\),\(1\)whereeqe\_\{q\}andeie\_\{i\}denote the query and memory embeddings, respectively\.
This retrieval\-and\-concatenation paradigm provides a simple mechanism for long\-term memory access, but it imposes a rigid interface between memory and generation\. The system typically retrieves a fixed number of chunks, injects them as plain text, and allocates context\-window capacity in proportion to their original token lengths rather than their query\-specific utility\. Consequently, irrelevant or redundant chunks may consume substantial context, while compact but crucial evidence may receive insufficient representational capacity\.
### 2\.2Latent Memory Interfaces
An alternative is to represent memory using continuous latent vectors rather than natural\-language text\. LetE\(⋅\)E\(\\cdot\)denote the token embedding layer of an LLM with hidden dimensiondd\. A latent or soft memory tokenz∈ℝdz\\in\\mathbb\{R\}^\{d\}can be inserted directly into the input embedding sequence:
X=\[E\(x1\),…,E\(xT\),z1,…,zB\]\.X=\\bigl\[E\(x\_\{1\}\),\\ldots,E\(x\_\{T\}\),z\_\{1\},\\ldots,z\_\{B\}\\bigr\]\.\(2\)Such latent interfaces avoid re\-rendering retrieved memories as text and can reduce prompt overhead\. However, existing latent\-memory methods often still expose a fixed\-capacity interface: each retrieved chunk is assigned the same number of latent tokens regardless of its relevance, redundancy, or information density\. This motivates a more flexible memory interface\. Ideally, a model should adapt memory use at two levels: it should select the memory chunks that are useful for the current query, and it should allocate more latent capacity to chunks that are more informative for solving the task\.
## 3ElasticMem: Memory Use as a Learnable Resource
### 3\.1Overview
Figure 1:Overview ofElasticMem\.ElasticMemlearns to use long\-term memory as an elastic latent resource\.\(1\) Latent memory bank construction\.Memory chunks from dialogues, passages, and skill cards are encoded once by a frozen offline LLM encoder\. Each chunk is stored as a retrieval key and a latent content cache, forming a read\-only memory bankℬ\\mathcal\{B\}that is not updated during training\.\(2\) Query\-conditioned elastic memory retrieval and allocation\.For each query, the LoRA\-adapted reasoner generates a retrieval\-control token and derives a query state for retrieving top\-ranked memory chunks\. A budget policy then assigns each retrieved chunk a variable latent budgetbj∈\{0,…,Bmax\}b\_\{j\}\\in\\\{0,\\ldots,B\_\{\\max\}\\\}, allowing useful chunks to receive more capacity while weak or redundant chunks are compressed or suppressed\.\(3\) Latent memory injection\.The selected cached states are projected into soft memory tokens and inserted into the final model input before generation\.\(4\) Reward\-based optimization\.Task rewards optimize the LoRA\-adapted reasoner, budget policy, and latent projector through GRPO, while the latent memory bank remains frozen\.We proposeElasticMem, a memory\-augmented LLM framework that treats memory use as a learnable resource allocation problem\. Given a queryqqand a memory corpusℳ\\mathcal\{M\},ElasticMemdecides not only which memories to retrieve, but also how much latent capacity each retrieved memory should receive\. This design is motivated by the observation that memory usefulness is highly query\-dependent\. Some retrieved chunks contain direct evidence or transferable plans and require more latent capacity, while others are only superficially similar, redundant, or irrelevant and should receive little or no capacity\.
As shown in Figure[1](https://arxiv.org/html/2605.30690#S3.F1),ElasticMemconsists of four stages\. First, it builds an offline latent memory bank by encoding each memory chunk into a retrieval key and a content cache\. Second, it performs query\-conditioned retrieval using the reasoner’s own hidden state rather than a standalone retrieval encoder\. Third, it uses a lightweight Transformer budget policy to assign a variable number of latent tokens to each retrieved chunk\. Fourth, it injects the selected latent states as soft memory tokens and jointly optimizes retrieval control, budget allocation, latent projection, and generation with downstream task rewards\. The full training procedure is summarized in Appendix[C](https://arxiv.org/html/2605.30690#A3)\.
### 3\.2Latent Memory Bank Construction
We partition the memory corpus intoNNchunksℳ=\{mi\}i=1N\\mathcal\{M\}=\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}\. Each chunk is encoded once offline by a frozen base LLM encoderfθ0f\_\{\\theta\_\{0\}\}:
Hi=fθ0\(mi\)∈ℝLi×d,H\_\{i\}=f\_\{\\theta\_\{0\}\}\(m\_\{i\}\)\\in\\mathbb\{R\}^\{L\_\{i\}\\times d\},\(3\)whereLiL\_\{i\}is the length of chunkmim\_\{i\}andddis the hidden dimension\. From the hidden statesHiH\_\{i\}, we extract two representations:
ki\\displaystyle k\_\{i\}=Hi\[−1\]∈ℝd,\\displaystyle=H\_\{i\}\[\-1\]\\in\\mathbb\{R\}^\{d\},\(4\)Ci\\displaystyle C\_\{i\}=Hi\[−Nc:\]∈ℝNc×d\.\\displaystyle=H\_\{i\}\[\-N\_\{c\}:\]\\in\\mathbb\{R\}^\{N\_\{c\}\\times d\}\.\(5\)The final hidden statekik\_\{i\}serves as a compact retrieval key, whileCiC\_\{i\}stores the lastNcN\_\{c\}hidden states as a latent content cache\. The resulting memory bank is
ℬ=\{\(mi,ki,Ci\)\}i=1N\.\\mathcal\{B\}=\\\{\(m\_\{i\},k\_\{i\},C\_\{i\}\)\\\}\_\{i=1\}^\{N\}\.\(6\)Sinceℬ\\mathcal\{B\}is constructed offline and kept fixed, query\-time memory access does not require re\-encoding the full memory corpus\.
### 3\.3Query\-Conditioned Retrieval
Standard retrieval systems usually encode the query with a separate retriever\. This decouples memory access from the reasoner’s internal state\. In contrast,ElasticMemderives the retrieval query from the reasoner itself, so retrieval is conditioned on how the model interprets the current input\.
Given a queryqq, the reasoner first samples a retrieval\-control token:
a∼pθ\(⋅∣q\),a\\sim p\_\{\\theta\}\(\\cdot\\mid q\),\(7\)whereθ\\thetadenotes the trainable LoRA\-adapted reasoner parameters\. The query and retrieval\-control token are then passed through the reasoner, and the last hidden state is used as the query representation:
hq=fθ\(q,a\)\[−1\]∈ℝd\.h\_\{q\}=f\_\{\\theta\}\(q,a\)\[\-1\]\\in\\mathbb\{R\}^\{d\}\.\(8\)We score each memory chunk by cosine similarity betweenhqh\_\{q\}and the cached retrieval key:
si=cos\(hq,ki\)=hq⊤ki‖hq‖2‖ki‖2\.s\_\{i\}=\\cos\(h\_\{q\},k\_\{i\}\)=\\frac\{h\_\{q\}^\{\\top\}k\_\{i\}\}\{\\\|h\_\{q\}\\\|\_\{2\}\\\|k\_\{i\}\\\|\_\{2\}\}\.\(9\)The top\-ZZchunks are selected as
I=TopZi∈\[N\]si,I=\\operatorname\{TopZ\}\_\{i\\in\[N\]\}s\_\{i\},\(10\)whereI=\(I1,…,IZ\)I=\(I\_\{1\},\\ldots,I\_\{Z\}\)denotes the retrieved chunk indices in descending retrieval\-score order\.
### 3\.4Elastic Latent Budget Allocation
After retrieval,ElasticMemallocates a variable latent budget to each retrieved chunk\. Letbjb\_\{j\}denote the number of soft memory tokens assigned to thejj\-th retrieved chunkmIjm\_\{I\_\{j\}\}\. We impose a per\-chunk maximum budget:
bj∈\{0,1,…,Bmax\}\.b\_\{j\}\\in\\\{0,1,\\ldots,B\_\{\\max\}\\\}\.\(11\)The optionbj=0b\_\{j\}=0allows the model to suppress retrieved chunks that are similar to the query but not useful for the downstream task\. Thus, the budget policy is not merely a compression module; it decides which retrieved memories deserve more representational capacity and which should contribute little or no latent signal\.
##### Budget policy network\.
We instantiate the budget allocator as a lightweight Transformer policyπϕ\\pi\_\{\\phi\}over the retrieved set\. For each retrieved chunkmIjm\_\{I\_\{j\}\}, we construct a chunk\-level feature vector
rj=Wr\[hq;kIj;hq⊙kIj;sIj;ej\]∈ℝdb,r\_\{j\}=W\_\{r\}\\left\[h\_\{q\};k\_\{I\_\{j\}\};h\_\{q\}\\odot k\_\{I\_\{j\}\};s\_\{I\_\{j\}\};e\_\{j\}\\right\]\\in\\mathbb\{R\}^\{d\_\{b\}\},\(12\)wherehqh\_\{q\}is the query\-conditioned retrieval state,kIjk\_\{I\_\{j\}\}is the retrieved memory key,sIjs\_\{I\_\{j\}\}is its retrieval score,eje\_\{j\}is a learned rank embedding,⊙\\odotdenotes element\-wise multiplication, andWrW\_\{r\}projects the concatenated features into the policy hidden dimensiondbd\_\{b\}\. The sequence of retrieved\-memory features is then processed by a small Transformer encoder:
o1,…,oZ=Transformerϕ\(r1,…,rZ\),oj∈ℝdb\.o\_\{1\},\\ldots,o\_\{Z\}=\\operatorname\{Transformer\}\_\{\\phi\}\(r\_\{1\},\\ldots,r\_\{Z\}\),\\qquad o\_\{j\}\\in\\mathbb\{R\}^\{d\_\{b\}\}\.\(13\)This self\-attention structure allows the policy to compare retrieved memories against one another before assigning capacity\.
For each retrieved chunk, the policy produces logits over discrete budget actions:
ℓj=Wboj∈ℝBmax\+1\.\\ell\_\{j\}=W\_\{b\}o\_\{j\}\\in\\mathbb\{R\}^\{B\_\{\\max\}\+1\}\.\(14\)The budget for chunkmIjm\_\{I\_\{j\}\}is then sampled from the categorical distribution:
bj∼πϕ\(⋅∣hq,I\)=Cat\(softmax\(ℓj\)\)\.b\_\{j\}\\sim\\pi\_\{\\phi\}\(\\cdot\\mid h\_\{q\},I\)=\\operatorname\{Cat\}\\left\(\\operatorname\{softmax\}\(\\ell\_\{j\}\)\\right\)\.\(15\)Unlike similarity\-based reweighting,πϕ\\pi\_\{\\phi\}is not trained to imitate cosine scores\. It is optimized only through downstream rewards, allowing it to assign larger budgets to memories that improve the final answer or action even when they are not the most similar retrieved chunks\.
### 3\.5Latent Memory Injection
Given the allocated budgetbjb\_\{j\}for retrieved chunkmIjm\_\{I\_\{j\}\},ElasticMemselects the lastbjb\_\{j\}hidden states from its content cache:
C~j=CIj\[−bj:\]∈ℝbj×d\.\\widetilde\{C\}\_\{j\}=C\_\{I\_\{j\}\}\[\-b\_\{j\}:\]\\in\\mathbb\{R\}^\{b\_\{j\}\\times d\}\.\(16\)Ifbj=0b\_\{j\}=0, the chunk contributes no latent memory tokens\. Otherwise, the selected states are mapped into the reasoner’s input embedding space through a learned projectorPψP\_\{\\psi\}:
Zj=Pψ\(C~j\)∈ℝbj×d\.Z\_\{j\}=P\_\{\\psi\}\(\\widetilde\{C\}\_\{j\}\)\\in\\mathbb\{R\}^\{b\_\{j\}\\times d\}\.\(17\)The final input sequence is
X=\[E\(q\);E\(a\);Z1;⋯;ZZ;E\(u\)\],X=\\bigl\[E\(q\);E\(a\);Z\_\{1\};\\cdots;Z\_\{Z\};E\(u\)\\bigr\],\(18\)whereuudenotes the task instruction and optional answer choices\. The reasoner generates
y∼pθ\(⋅∣X\)\.y\\sim p\_\{\\theta\}\(\\cdot\\mid X\)\.\(19\)Soft memory tokens are excluded from the language modeling loss\. They influence generation through the reasoner’s hidden states and are learned through downstream reward signals\.
### 3\.6Reward\-Based Optimization
The usefulness of retrieval and budget decisions is revealed only after the model produces an answer or completes an action\. We therefore optimize the full memory\-use process with group\-relative policy optimization\. For each query\-target pair\(q,y⋆\)\(q,y^\{\\star\}\),ElasticMemsamplesGGtrajectories:
τg=\(ag,Ig,𝐛g,yg\),g=1,…,G,\\tau\_\{g\}=\(a\_\{g\},I\_\{g\},\\mathbf\{b\}\_\{g\},y\_\{g\}\),\\qquad g=1,\\ldots,G,\(20\)whereaga\_\{g\}is the retrieval\-control token,Ig=\(Ig,1,…,Ig,Z\)I\_\{g\}=\(I\_\{g,1\},\\ldots,I\_\{g,Z\}\)is the retrieved memory index sequence,𝐛g=\(bg,1,…,bg,Z\)\\mathbf\{b\}\_\{g\}=\(b\_\{g,1\},\\ldots,b\_\{g,Z\}\)is the sampled budget allocation, andygy\_\{g\}is the generated output\. Each trajectory receives a task reward:
rg=R\(yg,y⋆\),r\_\{g\}=R\(y\_\{g\},y^\{\\star\}\),\(21\)whereRRis task\-specific, such as accuracy, token\-level F1, LLM\-judge score, or environment success\.
We compute a group\-relative advantage by normalizing rewards within the sampled group:
Ag=rg−mean\(\{rg′\}g′=1G\)max\(std\(\{rg′\}g′=1G\),σmin\),A\_\{g\}=\\frac\{r\_\{g\}\-\\operatorname\{mean\}\\left\(\\\{r\_\{g^\{\\prime\}\}\\\}\_\{g^\{\\prime\}=1\}^\{G\}\\right\)\}\{\\max\\left\(\\operatorname\{std\}\\left\(\\\{r\_\{g^\{\\prime\}\}\\\}\_\{g^\{\\prime\}=1\}^\{G\}\\right\),\\sigma\_\{\\min\}\\right\)\},\(22\)whereσmin\\sigma\_\{\\min\}is a small constant for numerical stability\.
We minimize the clipped GRPO objective:
ℒGRPO=−1G∑g=1G\[min\(ρgAg,clip\(ρg,1−ϵ,1\+ϵ\)Ag\)\],\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}=\-\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\left\[\\min\\left\(\\rho\_\{g\}A\_\{g\},\\;\\operatorname\{clip\}\\left\(\\rho\_\{g\},\\,1\-\\epsilon,\\,1\+\\epsilon\\right\)A\_\{g\}\\right\)\\right\],\(23\)where the probability ratio is
ρg=pθ,ϕ,ψ\(τg\)pθold,ϕold,ψold\(τg\)\.\\rho\_\{g\}=\\frac\{p\_\{\\theta,\\phi,\\psi\}\(\\tau\_\{g\}\)\}\{p\_\{\\theta\_\{\\mathrm\{old\}\},\\phi\_\{\\mathrm\{old\}\},\\psi\_\{\\mathrm\{old\}\}\}\(\\tau\_\{g\}\)\}\.\(24\)The trajectory likelihood factorizes as
pθ,ϕ,ψ\(τg\)=pθ\(ag∣q\)\[∏j=1Zπϕ\(bg,j∣hq,g,Ig\)\]pθ\(yg∣Xg\(ψ\)\),p\_\{\\theta,\\phi,\\psi\}\(\\tau\_\{g\}\)=p\_\{\\theta\}\(a\_\{g\}\\mid q\)\\left\[\\prod\_\{j=1\}^\{Z\}\\pi\_\{\\phi\}\\\!\\left\(b\_\{g,j\}\\mid h\_\{q,g\},\\,I\_\{g\}\\right\)\\right\]p\_\{\\theta\}\\\!\\left\(y\_\{g\}\\mid X\_\{g\}\(\\psi\)\\right\),\(25\)wherehq,g=fθ\(q,ag\)\[−1\]h\_\{q,g\}=f\_\{\\theta\}\(q,a\_\{g\}\)\[\-1\]andXg\(ψ\)=\[E\(q\);E\(ag\);Pψ\(Ceg,1\);⋯;Pψ\(Ceg,Z\);E\(u\)\]X\_\{g\}\(\\psi\)=\\bigl\[E\(q\);\\,E\(a\_\{g\}\);\\,P\_\{\\psi\}\(C\_\{e\_\{g,1\}\}\);\\cdots;\\,P\_\{\\psi\}\(C\_\{e\_\{g,Z\}\}\);\\,E\(u\)\\bigr\]is the final input sequence whose soft memory tokens are produced by the latent projectorPψP\_\{\\psi\}\. Thus, the reward signal supervises not only final generation, but also the retrieval\-control token and the memory\-budget decisions\. The trainable components are the LoRA\-adapted reasonerθ\\theta, the budget policyπϕ\\pi\_\{\\phi\}, and the latent projectorPψP\_\{\\psi\}\. The offline memory bankℬ\\mathcal\{B\}remains fixed during training\. The detailed training procedure is provided in Appendix[C](https://arxiv.org/html/2605.30690#A3)\.
## 4Experiments
To evaluate the effectiveness ofElasticMemin memory\-augmented language modeling, we conduct experiments onMemorySuite, which contains two complementary settings: memory\-intensive question answering and embodied agentic decision\-making \(see Appendix[D](https://arxiv.org/html/2605.30690#A4)\)\.
Tasks and Metrics\.MemorySuite\-QAincludes PersonaMem\-32K\(Jianget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib1)\), PersonaMem\-128K\(Jianget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib1)\), LoCoMo\(Maharanaet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib10)\), and LongMemEval\(Wuet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib11)\)\. We report accuracy as the primary metric\.MemorySuite\-Agenticuses ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2605.30690#bib.bib12)\)and evaluates both seen and unseen splits, using success rate \(SR\) as the primary metric\. For all tasks, we also report the number of consumed tokens \(\#Tok\.\) to measure memory efficiency\.
Baselines and Settings\. We compareElasticMemwith representative memory\-based baselines from two paradigms\.Text\-space memory baselinesinclude MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib13)\), A\-MEM\(Xuet al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib14)\), LightMem\(Fanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib15)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib16)\), MemoryOS\(Kanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib17)\), MemP\(Fanget al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib18)\), and LangMem\(LangChain,[2024](https://arxiv.org/html/2605.30690#bib.bib19)\)\.Latent\-space memory baselinesinclude MemGen\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib20)\), AutoCompressor\(Chevalieret al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib90)\), and M\+\(Wanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib22)\)\. All methods are evaluated with Qwen2\.5\-3B\-Instruct111[https://huggingface\.co/Qwen/Qwen2\.5\-3B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)and Qwen2\.5\-7B\-Instruct222[https://huggingface\.co/Qwen/Qwen2\.5\-7B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)backbones under a no\-few\-shot\-example setting\.
Table 2:Main comparison results on PersonaMem\-32K, PersonaMem\-128K, LoCoMo, LongMemEval, and ALFWorld\.For Acc\. and SR, higher is better; for \#Tok\., lower is better\.Boldandunderlinedenote the best and second\-best results\. The Avg\. columns are computed as weighted averages using the number of test examples or games in each benchmark\.ModelMethodsMemory QA TasksAgentic Embodied Interactive TasksPersonaMem\-32KPersonaMem\-128KLoCoMoLongMemEvalAvg\.ALF\-SeenALF\-UnseenAvg\.Acc\.\#Tok\.Acc\.\#Tok\.Acc\.\#Tok\.Acc\.\#Tok\.Acc\.\#Tok\.SR\#Tok\.SR\#Tok\.SR\#Tok\.Qwen2\.5\-3B\-InstructMemory\-based Methods in Text\-SpaceMemoryBank0\.502,4730\.334,6180\.412,2020\.299,5910\.374,0890\.0676,0160\.0971,1730\.0873,648A\-MEM0\.484,8990\.383,5300\.393,3480\.4515,6250\.405,2820\.0417,2390\.0514,2480\.0415,776LightMem0\.468510\.351,3060\.403,4680\.441,2480\.392,2390\.1413,2520\.1712,6170\.1512,941Mem00\.427780\.396620\.338360\.332,8430\.361,0620\.1443,0550\.1243,6770\.1343,359MemoryOS0\.484,4190\.329,4400\.334,4200\.4013,8540\.357,4520\.1168,1310\.1365,8450\.1267,013MemP0\.4212,0020\.3829,5670\.265,1400\.4126,5710\.3316,8730\.11101,5800\.10110,0190\.11105,707LangMem0\.406750\.301,6260\.409360\.286870\.351,1120\.2012,2760\.1112,4490\.1512,361Memory\-based Methods in Latent\-SpaceMemGen0\.625020\.583930\.601980\.552420\.592910\.2914,7080\.2514,4160\.2714,565AutoCompressor0\.487170\.451,1270\.472340\.449980\.466770\.1621,1450\.1321,5710\.1421,353M\+0\.564470\.523500\.541540\.501980\.532470\.2014,2460\.1614,2330\.1814,240ElasticMem0\.745280\.775390\.743340\.683780\.744230\.4911,0280\.4112,8460\.4511,917Qwen2\.5\-7B\-InstructMemory\-based Methods in Text\-SpaceMemoryBank0\.532,6390\.354,7240\.402,3010\.3910,8220\.394,3580\.1136,6020\.1939,3850\.1537,963A\-MEM0\.529,7310\.3711,0980\.423,2890\.5031,4890\.4210,4080\.0629,5430\.0929,5310\.0729,537LightMem0\.528640\.331,2280\.413,7020\.539390\.412,2750\.3311,1500\.3710,6170\.3510,889Mem00\.547940\.396760\.389210\.483,2020\.411,1570\.4326,1650\.4027,3270\.4226,733MemoryOS0\.524,6950\.389,7480\.424,9980\.5314,4890\.437,9260\.3860,1290\.4050,2380\.3955,292MemP0\.5211,9640\.4029,5810\.225,3820\.4826,5800\.3416,9850\.24143,7680\.22115,4920\.23129,940LangMem0\.681,3590\.423,0970\.351,4780\.427810\.411,9110\.409,7350\.2810,1400\.349,933Memory\-based Methods in Latent\-SpaceMemGen0\.705020\.663930\.681980\.632420\.672910\.3912,9330\.3413,5510\.3613,235AutoCompressor0\.567170\.521,1270\.552340\.519980\.546770\.2419,8750\.2019,3170\.2219,602M\+0\.644470\.593500\.621540\.571980\.602470\.3212,5200\.2813,3820\.3012,942ElasticMem0\.905280\.845390\.843340\.763780\.834230\.569,0350\.499,0310\.539,033
### 4\.1ElasticMemOutperforms Memory\-based Methods in Text\-Space and Latent\-Space
We evaluateElasticMemonMemorySuite, covering memory\-intensive QA and embodied agentic decision\-making, comparing against representative text\-space and latent\-space memory baselines with Qwen2\.5\-3B/7B\-Instruct \(Table[2](https://arxiv.org/html/2605.30690#S4.T2)\)\. Two observations follow\.
ElasticMemConsistently Achieves the Best Task Performance\.ElasticMemobtains the best results across all evaluated settings\. OnMemorySuite\-QA,ElasticMemachieves the highest accuracy over all datasets, improving the weighted average accuracy from 0\.588 to 0\.742 with Qwen2\.5\-3B\-Instruct and from 0\.668 to 0\.832 with Qwen2\.5\-7B\-Instruct compared with the strongest baseline\. On ALFWorld,ElasticMemalso achieves the highest average success rate, improving from 0\.270 to 0\.449 with Qwen2\.5\-3B\-Instruct and from 0\.416 to 0\.529 with Qwen2\.5\-7B\-Instruct\. These results show thatElasticMemconsistently improves both memory recall and sequential decision\-making across different model scales and task types\.
ElasticMemProvides a Strong Accuracy\-Efficiency Trade\-off\. Beyond task performance,ElasticMemis also more efficient than both text\-space and fixed\-capacity latent\-space memory methods\. Text\-space baselines inject retrieved memories as natural\-language context, often incurring large token overhead on ALFWorld while still underperformingElasticMem\. Latent\-space baselines reduce prompt\-token usage, but their fixed or non\-elastic memory interfaces limit their ability to separate useful memories from redundant or weakly relevant ones\. In contrast,ElasticMemoperates in latent space and learns to allocate memory capacity elastically\. Compared with the strongest latent baseline,ElasticMemimproves average QA accuracy by 26\.2% on Qwen2\.5\-3B\-Instruct and 24\.6% on Qwen2\.5\-7B\-Instruct, showing that its gains come not merely from using latent memory, but from learning how much capacity each memory should receive\. On ALFWorld,ElasticMemfurther achieves the highest success rate while using the fewest tokens per game, improving average SR by 66\.3% and 27\.2% on the two backbones, respectively\. This indicates thatElasticMemdoes not improve by longer exploration, but by using memory more selectively and completing tasks more efficiently\. Overall, Table[2](https://arxiv.org/html/2605.30690#S4.T2)shows thatElasticMemoffers a stronger accuracy\-efficiency trade\-off than both text\-space and fixed\-capacity latent\-space memory methods\.
\(a\)
\(b\)
\(c\)
Figure 2:Ablation studies on Qwen2\.5\-7B\-Instruct\.\(a\)Effect of budget policy:ElasticMemoutperforms Random Budget, Uniform Budget, and MLP Budget Policy, showing that the Transformer budget policy better learns how to allocate latent capacity across retrieved memories\. \(b\)Effect of retrieval design:ElasticMemoutperforms Semantic Retrieval, Frozen\-State Retrieval, and Query\-State Retrieval, demonstrating the benefit of using a trainable reasoner\-state retrieval mechanism with a sampled retrieval\-control token\. \(c\)Effect of per\-chunk maximum budget:ElasticMemachieves the best overall performance atBmax=20B\_\{\\max\}=20, suggesting that a moderate per\-memory capacity limit is more effective than overly small or overly large per\-chunk budgets\.
### 4\.2Ablation Studies ValidateElasticMem’s Key Components
ElasticMemBenefits from Utility\-Aware Budget Allocation\. We first study how the latent budget policy affects performance\. As shown in Figure[2](https://arxiv.org/html/2605.30690#S4.F2)\(a\),ElasticMemconsistently outperforms Random Budget, Uniform Budget, and MLP Budget Policy across the five evaluation settings\. Random Budget performs worst because it assigns latent capacity without considering query relevance or memory utility\. Uniform Budget is more stable but treats all retrieved chunks equally, ignoring the fact that different memories contribute unequally to the final prediction\. MLP Budget Policy improves over these two variants by learning query\-dependent allocation, but it still underperformsElasticMembecause it scores each retrieved memory without explicit cross\-memory interaction\. These results show that the Transformer budget policy inElasticMembetter compares retrieved memories and allocates latent capacity according to downstream usefulness\.
ElasticMemImproves Retrieval with a Trainable Reasoner State\. Figure[2](https://arxiv.org/html/2605.30690#S4.F2)\(b\) ablates the retrieval mechanism andElasticMemachieves the best performance\. Semantic Retrieval performs the worst, indicating that rigid similarity matching is insufficient for identifying useful memories\. Frozen\-State Retrieval improves by using the reasoner’s hidden state as the retrieval signal but lacks task\-adaptive updates\. Query\-State Retrieval further improves with a trainable reasoner state, yet still falls behindElasticMem\. This gap highlights the importance of the retrieval\-control token for forming a flexible, query\-specific retrieval state\. Overall, retrieval is more effective when it is coupled with the trainable reasoner’s internal state rather than relying on fixed semantic similarity\.
ElasticMemPerforms Best with a Moderate Per\-Chunk Budget\. Finally, we vary the per\-chunk maximum latent budgetBmax∈\{10,20,30,40\}B\_\{\\max\}\\in\\\{10,20,30,40\\\}\. As shown in Figure[2](https://arxiv.org/html/2605.30690#S4.F2)\(c\),ElasticMemperforms best whenBmax=20B\_\{\\max\}=20\. A smaller budget such asBmax=10B\_\{\\max\}=10can be too restrictive to preserve complex evidence or transferable plans\. Larger budgets such asBmax=30B\_\{\\max\}=30orBmax=40B\_\{\\max\}=40do not further improve performance and can slightly degrade results, likely because they allow excessive capacity to be assigned to less useful retrieved chunks\. These results suggest that elastic memory allocation benefits from a moderate per\-memory capacity limit\.
### 4\.3ElasticMemLearns Evidence Utility Beyond Retrieval Similarity
We provide qualitative case studies showing thatElasticMemlearns to distinguish useful evidence from superficially similar memories\. This is important because cosine retrieval mainly captures semantic overlap, while memory utility depends on relevance to the current task\.
ElasticMemSuppresses Similar Memories That Do Not Provide Useful Evidence\. In PersonaMem\-32K, the query asks about a summer mentoring program involving study techniques, meditation, and stress relief\. The correct answer requires recalling that meditation helped the user improve focus and reduce anxiety during study\. However, several top\-ranked retrieved chunks are only topically similar, such as memories about a group project, a peer study group, or a near\-verbatim restatement of the query\. Although these chunks receive high cosine similarity,ElasticMemassigns them low latent budgets\. Instead, it allocates larger budgets to lower\-ranked chunks that provide useful evidence, including memories connecting yoga and meditation with stress reduction and mental balance, and memories linking anxiety, sleep deprivation, and cognitive function\. This allows the reasoner to recover the evidence chain meditation→\\rightarrowfocus→\\rightarrowanxiety→\\rightarrowstudy performance, while a uniform\-budget baseline fails\. This case shows thatElasticMemdoes not simply follow retrieval similarity, but learns to prioritize evidence that is useful for answering the query\.
ElasticMemAligns Memory Budgets with Transferable Plan Structure Rather Than Task Labels\. In ALFWorld, for the task*“heat some apple and put it on the countertop”*, similarity\-based retrieval tends to favor memories with the sameheattask type\. In contrast,ElasticMemassigns high budget to a skill card from a differentcooltask because its plan places an object on ashelf, which is structurally similar to acountertopas an open horizontal surface\. Meanwhile,ElasticMemassigns low budgets to superficially related cards whose destinations are appliances such as acoffeemachineorstoveburner, since their placement plans are less compatible with the current goal\. Even for two cards with the same objective string,*“heat some mug and put it in coffeemachine”*,ElasticMemassigns different budgets depending on whether the action plan searches open surfaces or irrelevant cabinets and appliances\. This shows that the budget policy captures transferable procedural structure rather than relying only on task labels or objective text\.
## 5Additional Related Work
Long\-term memory is increasingly important for LLM agents, enabling persistent user information, reuse of past interactions, and improved decision\-making across sessions\. Existing memory systems can be broadly divided into text\-space and latent\-space approaches\. Text\-space methods store memories as natural\-language records and retrieve, summarize, update, or organize them before injecting selected content into the model context\. Representative systems include MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib13)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib16)\), MemoryOS\(Kanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib17)\), and LangMem\(LangChain,[2024](https://arxiv.org/html/2605.30690#bib.bib19)\), as well as experience\-oriented frameworks such as A\-MEM\(Xuet al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib14)\), LightMem\(Fanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib15)\), and MemP\(Fanget al\.,[2025b](https://arxiv.org/html/2605.30690#bib.bib18)\)\. While effective for persistence and experience reuse, these methods operate primarily in text space, causing substantial token overhead and making memory use sensitive to irrelevant or redundant retrieved content\. Latent\-space methods reduce textual context overhead by representing memories as compact continuous states or soft tokens, such as AutoCompressor\(Chevalieret al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib90)\), MemGen\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.30690#bib.bib20)\), and M\+\(Wanget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib22)\)\. However, they typically allocate fixed latent capacity to each memory regardless of its relevance, redundancy, or downstream utility\. Meanwhile, reinforcement learning has become central to improving LLM reasoning, alignment, and agentic decision\-making\. Representative approaches range from PPO\-style optimization\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.30690#bib.bib23); Liuet al\.,[2024a](https://arxiv.org/html/2605.30690#bib.bib24)\)and DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib25)\)to PRMs\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.30690#bib.bib26)\), self\-improvement methods such as SPIN\(Chenet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib27)\)and SCoRe\(Kumaret al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib28)\), and group\-relative optimization variants including GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.30690#bib.bib29)\), Dr\.GRPO\(Liuet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib30)\), GSPO\(Zhenget al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib31)\), and Clip\-Cov\(Cuiet al\.,[2025](https://arxiv.org/html/2605.30690#bib.bib32)\)\. Despite their success, most RL frameworks remain memory\-limited because useful trajectories, failure patterns, and task\-specific strategies are mainly absorbed through parameter updates rather than summarized into reusable memories\. In contrast,ElasticMemtreats memory use as a learnable resource allocation problem: it maintains an external latent memory bank, retrieves memories conditioned on the agent’s reasoning state, and learns to allocate different numbers of latent memory tokens according to downstream utility\. This allowsElasticMemto assign larger budgets to informative memories, compress peripheral memories, suppress unhelpful memories, and jointly optimize retrieval, allocation, and latent memory injection with task rewards\.
## 6Conclusion
We presentedElasticMem, a memory\-augmented LLM framework that treats memory as an elastic latent resource rather than a fixed textual or latent context\. By coupling retrieval with the reasoner’s internal state, allocating variable latent budgets to retrieved memories, and optimizing the full memory\-use process with downstream task rewards,ElasticMemlearns to use memory according to task utility\. Experiments onMemorySuiteshow thatElasticMemimproves both memory\-intensive question answering and embodied agentic decision\-making while reducing token overhead compared with text\-space memory methods\. Ablation studies and qualitative analyses further confirm that adaptive memory allocation enables the model to emphasize useful evidence and suppress noisy or redundant memories\. These results suggest that elastic latent memory is a promising direction for building more efficient, adaptive, and long\-horizon LLM agents\.
## References
- \[1\]\(2023\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[2\]Y\. Chen, Y\. Wang, S\. Zhu, H\. Yu, T\. Feng, M\. Zhang, M\. Patwary, and J\. You\(2025\)Multi\-agent evolve: llm self\-improve through co\-evolution\.arXiv preprint arXiv:2510\.23595\.Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[3\]Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu\(2024\)Self\-play fine\-tuning converts weak language models to strong language models\.arXiv preprint arXiv:2401\.01335\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[4\]A\. Chevalier, A\. Wettig, A\. Ajith, and D\. Chen\(2023\)Adapting language models to compress contexts\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 3829–3846\.Cited by:[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.4.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[5\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx1.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.2.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[6\]G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[7\]L\. Dai, Y\. Xu, J\. Ye, H\. Liu, and H\. Xiong\(2025\)Seper: measure retrieval utility through the lens of semantic perplexity reduction\.arXiv preprint arXiv:2503\.01478\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[8\]Y\. Du, B\. Wang, Y\. Xiang, Z\. Wang, W\. Huang, B\. Xue, B\. Liang, X\. Zeng, F\. Mi, H\. Bai,et al\.\(2025\)Memory\-t1: reinforcement learning for temporal reasoning in multi\-session agents\.arXiv preprint arXiv:2512\.20092\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[9\]J\. Fang, X\. Deng, H\. Xu, Z\. Jiang, Y\. Tang, Z\. Xu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao,et al\.\(2025\)Lightmem: lightweight and efficient memory\-augmented generation\.arXiv preprint arXiv:2510\.18866\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx6.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.3.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[10\]R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang\(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx7.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.3.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[11\]C\. Gan, D\. Yang, B\. Hu, H\. Zhang, S\. Li, Z\. Liu, Y\. Shen, L\. Ju, Z\. Zhang, J\. Gu,et al\.\(2024\)Similarity is not all you need: endowing retrieval augmented generation with multi layered thoughts\.arXiv preprint arXiv:2405\.19893\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[12\]J\. Gao, L\. Li, W\. Li, Y\. Fu, and B\. Dai\(2024\)Smartrag: jointly learn rag\-related tasks from the environment feedback\.arXiv preprint arXiv:2410\.18141\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[13\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p4.1)\.
- \[14\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[Appendix B](https://arxiv.org/html/2605.30690#A2.SS0.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2605.30690#S1.p4.1)\.
- \[15\]M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo\(2025\)Hiagent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 32779–32798\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[16\]P\. Jansen, M\. Côté, T\. Khot, E\. Bransom, B\. Dalvi Mishra, B\. P\. Majumder, O\. Tafjord, and P\. Clark\(2024\)Discoveryworld: a virtual environment for developing and evaluating automated scientific discovery agents\.Advances in Neural Information Processing Systems37,pp\. 10088–10116\.Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[17\]B\. Jiang, Y\. Yuan, M\. Shen, Z\. Hao, Z\. Xu, Z\. Chen, Z\. Liu, A\. R\. Vijjini, J\. He, H\. Yu,et al\.\(2025\)Personamem\-v2: towards personalized intelligence via learning implicit user personas and agentic memory\.arXiv preprint arXiv:2512\.06688\.Cited by:[§D\.1](https://arxiv.org/html/2605.30690#A4.SS1.p1.1),[§1](https://arxiv.org/html/2605.30690#S1.p5.1),[§4](https://arxiv.org/html/2605.30690#S4.p2.1)\.
- \[18\]P\. Jiang, J\. Lin, Z\. Shi, Z\. Wang, L\. He, Y\. Wu, M\. Zhong, P\. Song, Q\. Zhang, H\. Wang,et al\.\(2025\)Adaptation of agentic ai\.arXiv preprint arXiv:2512\.16301\.Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[19\]J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai\(2025\)Memory os of ai agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25972–25981\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx4.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.2.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[20\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih\(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 6769–6781\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p2.1)\.
- \[21\]A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs,et al\.\(2024\)Training language models to self\-correct via reinforcement learning\.arXiv preprint arXiv:2409\.12917\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[22\]LangChain\(2024\)LangMem\.Note:[https://langchain\-ai\.github\.io/langmem/](https://langchain-ai.github.io/langmem/)Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[23\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[24\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[25\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[26\]Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin\(2025\)Understanding r1\-zero\-like training: a critical perspective\.arXiv preprint arXiv:2503\.20783\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[27\]I\. Loshchilov and F\. Hutter\(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[Appendix B](https://arxiv.org/html/2605.30690#A2.SS0.SSS0.Px5.p1.12)\.
- \[28\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[§D\.1](https://arxiv.org/html/2605.30690#A4.SS1.p2.1),[§1](https://arxiv.org/html/2605.30690#S1.p5.1),[§4](https://arxiv.org/html/2605.30690#S4.p2.1)\.
- \[29\]Q\. Mi, Z\. Ma, M\. Yang, H\. Li, Y\. Wang, H\. Zhang, and J\. Wang\(2026\)ProcMEM: learning reusable procedural memory from experience via non\-parametric ppo for llm agents\.arXiv preprint arXiv:2602\.01869\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[30\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[31\]Z\. Pan, Q\. Wu, H\. Jiang, X\. Luo, H\. Cheng, D\. Li, Y\. Yang, C\. Lin, H\. V\. Zhao, L\. Qiu,et al\.\(2025\)On memory construction and retrieval for personalized conversational agents\.arXiv preprint arXiv:2502\.05589\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[32\]J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[33\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[34\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p2.1)\.
- \[35\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix B](https://arxiv.org/html/2605.30690#A2.SS0.SSS0.Px5.p1.12),[§1](https://arxiv.org/html/2605.30690#S1.p4.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[36\]F\. Shi, X\. Chen, K\. Misra, N\. Scales, D\. Dohan, E\. H\. Chi, N\. Schärli, and D\. Zhou\(2023\)Large language models can be easily distracted by irrelevant context\.InInternational Conference on Machine Learning,pp\. 31210–31227\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[37\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[38\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§D\.2](https://arxiv.org/html/2605.30690#A4.SS2.p1.1),[§D\.2](https://arxiv.org/html/2605.30690#A4.SS2.p2.1),[§1](https://arxiv.org/html/2605.30690#S1.p5.1),[§4](https://arxiv.org/html/2605.30690#S4.p2.1)\.
- \[39\]H\. Sun, Z\. Zhang, and S\. Zeng\(2025\)Preference\-aware memory update for long\-term llm agents\.arXiv preprint arXiv:2510\.09720\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[40\]M\. Tavakoli, A\. Salemi, C\. Ye, M\. Abdalla, H\. Zamani, and J\. R\. Mitchell\(2025\)Beyond a million tokens: benchmarking and enhancing long\-term memory in llms\.arXiv preprint arXiv:2510\.27246\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[41\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[42\]Y\. Wang, D\. Krotov, Y\. Hu, Y\. Gao, W\. Zhou, J\. McAuley, D\. Gutfreund, R\. Feris, and Z\. He\(2025\)M\+: extending memoryllm with scalable long\-term memory\.arXiv preprint arXiv:2502\.00592\.Cited by:[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.5.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[43\]J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese\(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.External Links:2504\.12516,[Link](https://arxiv.org/abs/2504.12516)Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[44\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2024\)Longmemeval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§D\.1](https://arxiv.org/html/2605.30690#A4.SS1.p3.1),[§1](https://arxiv.org/html/2605.30690#S1.p5.1),[§4](https://arxiv.org/html/2605.30690#S4.p2.1)\.
- \[45\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx5.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.3.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[46\]Y\. Xu, J\. Gao, X\. Yu, Y\. Xue, B\. Bi, H\. Shen, and X\. Cheng\(2025\)Training a utility\-based retriever through shared context attribution for retrieval\-augmented language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 629–648\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[47\]S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling\(2024\)Corrective retrieval augmented generation\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p3.1)\.
- \[48\]S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, J\. Bi, K\. Kersting, J\. Z\. Pan,et al\.\(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv preprint arXiv:2508\.19828\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[49\]J\. Yang, C\. E\. Jimenez, A\. L\. Zhang, K\. Lieret, J\. Yang, X\. Wu, O\. Press, N\. Muennighoff, G\. Synnaeve, K\. R\. Narasimhan,et al\.\(2024\)Swe\-bench multimodal: do ai systems generalize to visual software domains?\.arXiv preprint arXiv:2410\.03859\.Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[50\]G\. Zhang, M\. Fu, and S\. Yan\(2025\)Memgen: weaving generative latent memory for self\-evolving agents\.arXiv preprint arXiv:2509\.24704\.Cited by:[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.5.2),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[51\]W\. Zhang, K\. Tang, H\. Wu, M\. Wang, Y\. Shen, G\. Hou, Z\. Tan, P\. Li, Y\. Zhuang, and W\. Lu\(2024\)Agent\-pro: learning to evolve via policy\-level reflection and optimization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5348–5375\.Cited by:[Appendix A](https://arxiv.org/html/2605.30690#A1.p2.1)\.
- \[52\]Y\. Zhang, J\. Li, W\. Zhang, P\. Jia, X\. Li, Y\. Wang, D\. Xu, Y\. Wen, H\. Guo, Y\. Liu,et al\.\(2026\)Evoking user memory: personalizing llm via recollection\-familiarity adaptive retrieval\.arXiv preprint arXiv:2603\.09250\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p2.1)\.
- \[53\]Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen\(2025\)A survey on the memory mechanism of large language model\-based agents\.ACM Transactions on Information Systems43\(6\),pp\. 1–47\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[54\]C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
- \[55\]L\. Zheng, R\. Wang, X\. Wang, and B\. An\(2023\)Synapse: trajectory\-as\-exemplar prompting with memory for computer control\.arXiv preprint arXiv:2306\.07863\.Cited by:[§1](https://arxiv.org/html/2605.30690#S1.p1.1)\.
- \[56\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§F\.1](https://arxiv.org/html/2605.30690#A6.SS1.SSSx3.p1.1),[Table 1](https://arxiv.org/html/2605.30690#S1.T1.6.1.2.2),[§1](https://arxiv.org/html/2605.30690#S1.p1.1),[§1](https://arxiv.org/html/2605.30690#S1.p2.1),[§4](https://arxiv.org/html/2605.30690#S4.p3.1),[§5](https://arxiv.org/html/2605.30690#S5.p1.1)\.
Contents of Appendix
## Appendix ALimitations, Future Work, and Broader Impact
Limitations\.WhileElasticMemdemonstrates strong performance across memory\-intensive QA and embodied agentic tasks, our evaluation focuses on theMemorySuitebenchmarks with Qwen2\.5\-3B and 7B backbones\. Although the consistent improvements across both scales suggest generalizability, validatingElasticMemon additional model families and more diverse interactive environments would further strengthen the conclusions\.
Future Work\.Beyond addressing the above scope boundaries, several directions merit investigation\. Extending the evaluation to larger\-scale memory corpora and more open\-ended interactive environments\[[49](https://arxiv.org/html/2605.30690#bib.bib4),[16](https://arxiv.org/html/2605.30690#bib.bib8),[43](https://arxiv.org/html/2605.30690#bib.bib9)\]would test the scalability of elastic latent memory\. Developing richer visualization and interpretability tools for the learned budget allocation would make the memory\-use process more transparent\. Additionally, exploring dynamic memory bank updates, where the memory bank itself evolves as the agent accumulates new experience, is a promising direction for agents continuous evolving\[[51](https://arxiv.org/html/2605.30690#bib.bib3),[18](https://arxiv.org/html/2605.30690#bib.bib7),[2](https://arxiv.org/html/2605.30690#bib.bib2)\]\.
Broader Impact\.By reducing the token overhead of memory\-augmented LLM agents through elastic latent allocation,ElasticMemhas the potential to make memory\-intensive LLM applications more computationally accessible\. We do not foresee specific negative societal consequences beyond those generally associated with improving LLM agent capabilities\. As with all advances in LLM\-based agents, downstream deployments should incorporate appropriate safeguards for safety and reliability\.
## Appendix BImplementation Details
##### LoRA\-adapted reasonerθ\\theta\.
We adapt the backbone reasoner with LoRA\[[14](https://arxiv.org/html/2605.30690#bib.bib68)\]\. Specifically, we apply LoRA to the query, key, value, and output projection matrices of all attention layers, i\.e\.,q\_proj,k\_proj,v\_proj, ando\_proj\. We use LoRA rankr=16r\{=\}16, scaling factorα=32\\alpha\{=\}32, and dropout0\.10\.1\. The total number of trainable LoRA parameters is approximately 14\.7M for the 3B backbone and 20\.2M for the 7B backbone\.
##### Latent memory bankℬ\\mathcal\{B\}\.
Each memory chunkmim\_\{i\}is encoded once offline by the frozen base encoderfθ0f\_\{\\theta\_\{0\}\}\. Following Eqs\.[4](https://arxiv.org/html/2605.30690#S3.E4)–[5](https://arxiv.org/html/2605.30690#S3.E5), we store the last hidden state as the retrieval keyki=Hi\[−1\]∈ℝdk\_\{i\}=H\_\{i\}\[\-1\]\\in\\mathbb\{R\}^\{d\}and the lastNc=20N\_\{c\}\{=\}20hidden states as the content cacheCi=Hi\[−Nc:\]∈ℝNc×dC\_\{i\}=H\_\{i\}\[\-N\_\{c\}:\]\\in\\mathbb\{R\}^\{N\_\{c\}\\times d\}\. All retrieval keys areL2L\_\{2\}\-normalized and retrieved with cosine similarity as in Eq\.[9](https://arxiv.org/html/2605.30690#S3.E9)\. The number of retrieved chunksZZis dataset\-specific:Z=9Z\{=\}9for PersonaMem\-32K,Z=21Z\{=\}21for PersonaMem\-128K,Z=20Z\{=\}20for LoCoMo and LongMemEval, andZ=10Z\{=\}10for ALFWorld\. The memory bank is fixed during training and inference\.
##### Latent projectorPψP\_\{\\psi\}\.
The latent projector maps selected cached hidden states into the reasoner’s input embedding space\. We implementPψP\_\{\\psi\}as a two\-layer MLP with GELU activation and layer normalization:
Pψ\(C~j\)=LayerNorm\(W2GELU\(W1C~j\)\),P\_\{\\psi\}\(\\widetilde\{C\}\_\{j\}\)=\\operatorname\{LayerNorm\}\\left\(W\_\{2\}\\,\\operatorname\{GELU\}\(W\_\{1\}\\widetilde\{C\}\_\{j\}\)\\right\),\(26\)whereW1W\_\{1\}mapsℝd\\mathbb\{R\}^\{d\}to an intermediate dimensiond′=2048d^\{\\prime\}\{=\}2048, andW2W\_\{2\}mapsℝd′\\mathbb\{R\}^\{d^\{\\prime\}\}back toℝd\\mathbb\{R\}^\{d\}\. The projector is applied independently to each selected latent state\. It contains approximately 10\.5M trainable parameters for the 3B backbone and 18\.4M trainable parameters for the 7B backbone\.
##### Budget policyπϕ\\pi\_\{\\phi\}\.
The budget allocator is a lightweight Transformer encoder operating over the retrieved memory set\. Following Eq\.[13](https://arxiv.org/html/2605.30690#S3.E13), we use hidden dimensiondb=256d\_\{b\}\{=\}256,44attention heads,22Transformer layers, and dropout0\.10\.1\. The per\-chunk maximum budget isBmax=20B\_\{\\max\}\{=\}20\. Thus, each retrieved chunk receives a discrete latent budgetbj∈\{0,…,20\}b\_\{j\}\\in\\\{0,\\ldots,20\\\}as defined in Eq\.[11](https://arxiv.org/html/2605.30690#S3.E11)\. The actionbj=0b\_\{j\}\{=\}0allows the model to suppress a retrieved memory completely\. The budget policy contains approximately 2\.6M–3\.4M trainable parameters depending on the backbone size\.
##### GRPO training\.
We jointly optimize the LoRA\-adapted reasonerθ\\theta, latent projectorPψP\_\{\\psi\}, and budget policyπϕ\\pi\_\{\\phi\}with group\-relative policy optimization\[[35](https://arxiv.org/html/2605.30690#bib.bib29)\]\. We use group sizeG=4G\{=\}4, clipping rangeϵ=0\.05\\epsilon\{=\}0\.05, and sampling temperatureτ=1\.0\\tau\{=\}1\.0for trajectory generation\. The reward functionR\(y,y⋆\)R\(y,y^\{\\star\}\)is task\-specific: exact\-match accuracy forMemorySuite\-QAtasks and binary task success for ALFWorld\. We use AdamW\[[27](https://arxiv.org/html/2605.30690#bib.bib70)\]with learning rate2×10−52\\times 10^\{\-5\}, weight decay0\.010\.01, and a cosine annealing schedule\. All tasks are trained for1010epochs with BF16 mixed precision\. The maximum generation length is set to55tokens for QA tasks and1616tokens for ALFWorld\.
During evaluation, we use greedy decoding\. The retrieval\-control token is selected by argmax, budget allocation uses the most likely budget action underπϕ\\pi\_\{\\phi\}, and answer generation is decoded greedily\.
##### Task\-specific prompts\.
For PersonaMem\-32K and PersonaMem\-128K, the task instructionuuconstrains the model to output exactly one of four options: “\(a\)”, “\(b\)”, “\(c\)”, or “\(d\)”\. For LoCoMo and LongMemEval, the instruction enforces selection from ten options “\(a\)” through “\(j\)” and explicitly prohibits refusal responses\. For ALFWorld, the instruction directs the agent to output a single admissible command string verbatim\. The final model input follows Eq\.[18](https://arxiv.org/html/2605.30690#S3.E18):
X=\[E\(q\);E\(a\);Z1;…;ZZ;E\(u\)\],X=\[E\(q\);E\(a\);Z\_\{1\};\\ldots;Z\_\{Z\};E\(u\)\],\(27\)whereqqis the query,aais the retrieval\-control token,ZjZ\_\{j\}are the projected soft memory tokens, anduucontains the task instruction and answer options, or the interaction history and admissible actions for ALFWorld\.
##### Token counting\.
For QA tasks, \#Tok\. denotes the number of tokens consumed by the task input, retrieved memory interface, and generated answer\. For ALFWorld, \#Tok\. is computed per game and includes all tokens consumed across interaction turns\. For text\-space baselines, retrieved memories are counted as natural\-language prompt tokens\. ForElasticMemand other latent\-space methods, soft memory tokens are counted as latent memory tokens, which reflects the effective memory capacity used by the model\.
##### Compute\.
All experiments are conducted on NVIDIA RTX A6000 GPUs with 48GB memory\. Training a single task\-specific model takes approximately 6–18 hours on one GPU, depending on dataset size and backbone scale\. Offline memory bank construction takes approximately 10–60 minutes per dataset, since each memory chunk is encoded once and reused throughout training and inference\.
## Appendix CTraining Procedure ofElasticMem
Algorithm[1](https://arxiv.org/html/2605.30690#alg1)summarizes the complete training procedure ofElasticMem\. The procedure contains an offline memory\-bank construction phase and an online reward\-based optimization phase\. The offline phase encodes each memory chunk once and keeps the resulting memory bank fixed throughout training\. The online phase samples multiple memory\-use trajectories for each query, evaluates them with task rewards, and updates the trainable components using the GRPO objective\.
Algorithm 1TrainingElasticMem1:Memory corpus
ℳ=\{mi\}i=1N\\mathcal\{M\}=\\\{m\_\{i\}\\\}\_\{i=1\}^\{N\}, training queries
𝒬\\mathcal\{Q\}, reward function
RR, group size
GG, retrieved chunk number
ZZ, per\-chunk maximum budget
BmaxB\_\{\\max\}, clipping range
ϵ\\epsilon
2:Trained parameters
θ\\theta,
ψ\\psi,
ϕ\\phi
3:// Offline phase: construct latent memory bank
4:foreach memory chunk
mi∈ℳm\_\{i\}\\in\\mathcal\{M\}do
5:
Hi←fθ0\(mi\)H\_\{i\}\\leftarrow f\_\{\\theta\_\{0\}\}\(m\_\{i\}\)
6:
ki←Hi\[−1\]k\_\{i\}\\leftarrow H\_\{i\}\[\-1\]⊳\\trianglerightretrieval key
7:
Ci←Hi\[−Nc:\]C\_\{i\}\\leftarrow H\_\{i\}\[\-N\_\{c\}:\]⊳\\trianglerightcontent cache
8:endfor
9:
ℬ←\{\(mi,ki,Ci\)\}i=1N\\mathcal\{B\}\\leftarrow\\\{\(m\_\{i\},k\_\{i\},C\_\{i\}\)\\\}\_\{i=1\}^\{N\}
10:// Online phase: reward\-based optimization
11:foreach training stepdo
12:Sample query\-target pair
\(q,y⋆\)∼𝒬\(q,y^\{\\star\}\)\\sim\\mathcal\{Q\}
13:for
g=1,…,Gg=1,\\ldots,Gdo
14:
ag∼pθ\(⋅∣q\)a\_\{g\}\\sim p\_\{\\theta\}\(\\cdot\\mid q\)
15:
hq,g←fθ\(q,ag\)\[−1\]h\_\{q,g\}\\leftarrow f\_\{\\theta\}\(q,a\_\{g\}\)\[\-1\]
16:
si←cos\(hq,g,ki\)s\_\{i\}\\leftarrow\\cos\(h\_\{q,g\},k\_\{i\}\)for all
i∈\[N\]i\\in\[N\]
17:
Ig←TopZi∈\[N\]siI\_\{g\}\\leftarrow\\operatorname\{TopZ\}\_\{i\\in\[N\]\}s\_\{i\}
18:Construct retrieved\-memory features
\{rg,j\}j=1Z\\\{r\_\{g,j\}\\\}\_\{j=1\}^\{Z\}using Eq\.[12](https://arxiv.org/html/2605.30690#S3.E12)
19:
og,1,…,og,Z←Transformerϕ\(rg,1,…,rg,Z\)o\_\{g,1\},\\ldots,o\_\{g,Z\}\\leftarrow\\operatorname\{Transformer\}\_\{\\phi\}\(r\_\{g,1\},\\ldots,r\_\{g,Z\}\)
20:for
j=1,…,Zj=1,\\ldots,Zdo
21:Compute budget logits
ℓg,j\\ell\_\{g,j\}using Eq\.[14](https://arxiv.org/html/2605.30690#S3.E14)
22:Sample
bg,j∼Cat\(softmax\(ℓg,j\)\)b\_\{g,j\}\\sim\\operatorname\{Cat\}\(\\operatorname\{softmax\}\(\\ell\_\{g,j\}\)\)
23:if
bg,j\>0b\_\{g,j\}\>0then
24:
C~g,j←CIg,j\[−bg,j:\]\\widetilde\{C\}\_\{g,j\}\\leftarrow C\_\{I\_\{g,j\}\}\[\-b\_\{g,j\}:\]
25:
Zg,j←Pψ\(C~g,j\)Z\_\{g,j\}\\leftarrow P\_\{\\psi\}\(\\widetilde\{C\}\_\{g,j\}\)
26:else
27:
Zg,j←∅Z\_\{g,j\}\\leftarrow\\emptyset
28:endif
29:endfor
30:
𝐛g←\(bg,1,…,bg,Z\)\\mathbf\{b\}\_\{g\}\\leftarrow\(b\_\{g,1\},\\ldots,b\_\{g,Z\}\)
31:
Xg←\[E\(q\);E\(ag\);Zg,1;⋯;Zg,Z;E\(u\)\]X\_\{g\}\\leftarrow\[E\(q\);E\(a\_\{g\}\);Z\_\{g,1\};\\cdots;Z\_\{g,Z\};E\(u\)\]
32:
yg∼pθ\(⋅∣Xg\)y\_\{g\}\\sim p\_\{\\theta\}\(\\cdot\\mid X\_\{g\}\)
33:
rg←R\(yg,y⋆\)r\_\{g\}\\leftarrow R\(y\_\{g\},y^\{\\star\}\)
34:
τg←\(ag,Ig,𝐛g,yg\)\\tau\_\{g\}\\leftarrow\(a\_\{g\},I\_\{g\},\\mathbf\{b\}\_\{g\},y\_\{g\}\)
35:endfor
36:Compute advantages
\{Ag\}g=1G\\\{A\_\{g\}\\\}\_\{g=1\}^\{G\}using Eq\.[22](https://arxiv.org/html/2605.30690#S3.E22)
37:Update
θ,ψ,ϕ\\theta,\\psi,\\phiby minimizing Eq\.[23](https://arxiv.org/html/2605.30690#S3.E23)
38:endfor
Offline memory\-bank construction\. Lines 1–6 construct the latent memory bank\. Each memory chunkmim\_\{i\}is encoded once by the frozen base LLM encoderfθ0f\_\{\\theta\_\{0\}\}\. The final hidden stateHi\[−1\]H\_\{i\}\[\-1\]is stored as the retrieval keykik\_\{i\}, while the lastNcN\_\{c\}hidden statesHi\[−Nc:\]H\_\{i\}\[\-N\_\{c\}:\]are stored as the content cacheCiC\_\{i\}\. The resulting memory bankℬ\\mathcal\{B\}is fixed during training, so later updates only affect the trainable reasoner, budget policy, and projector\.
Trajectory sampling\. Lines 7–25 describe howElasticMemsamples memory\-use trajectories\. For each query\-target pair, the model samplesGGtrajectories\. In each trajectory, the reasoner first samples a retrieval\-control tokenaga\_\{g\}and derives the query statehq,gh\_\{q,g\}\. The query state retrieves the top\-ZZmemory chunks by cosine similarity with the cached keys\. The budget policy then processes retrieved\-memory features and samples a discrete latent budgetbg,j∈\{0,…,Bmax\}b\_\{g,j\}\\in\\\{0,\\ldots,B\_\{\\max\}\\\}for each retrieved chunk independently\.
Latent memory injection\. Lines 18–25 construct the soft memory tokens\. Ifbg,j\>0b\_\{g,j\}\>0,ElasticMemselects the lastbg,jb\_\{g,j\}cached states fromCIg,jC\_\{I\_\{g,j\}\}and maps them through the latent projectorPψP\_\{\\psi\}\. Ifbg,j=0b\_\{g,j\}=0, the retrieved chunk contributes no latent tokens\. The final model input is formed by concatenating the query embedding, retrieval\-control token embedding, selected soft memory tokens, and task instruction\.
Reward\-based update\. Lines 26–31 evaluate sampled trajectories and update the model\. Each trajectory receives a task rewardR\(yg,y⋆\)R\(y\_\{g\},y^\{\\star\}\)\. The rewards within the group are normalized into advantages using Eq\.[22](https://arxiv.org/html/2605.30690#S3.E22)\. Finally, the LoRA\-adapted reasonerθ\\theta, latent projectorψ\\psi, and budget policyϕ\\phiare updated by minimizing the GRPO objective in Eq\.[23](https://arxiv.org/html/2605.30690#S3.E23)\.
## Appendix DDataset Descriptions
We describe all evaluation datasets in MemorySuite below, categorized by their corresponding subsets\. MemorySuite consists of two complementary groups: MemorySuite\-QA and MemorySuite\-Agentic\. MemorySuite\-QA evaluates long\-term memory recall and reasoning in question\-answering settings, while MemorySuite\-Agentic evaluates memory\-augmented agentic decision\-making in embodied interactive household environments\. For all datasets, we report both task performance and the number of consumed tokens \(\#Tok\.\) to assess memory efficiency\.
Table 3:Detailed summary of datasets used in MemorySuite\.We categorize datasets by subset, task type, and evaluation metric\.DatasetTaskMetricMemorySuite\-QAPersonaMem\-32KPersona\-based Memory QAAccuracy / \#Tok\.PersonaMem\-128KPersona\-based Memory QAAccuracy / \#Tok\.LoCoMoLong\-Conversation Memory QAAccuracy / \#Tok\.LongMemEvalLong\-Term Interactive Memory QAAccuracy / \#Tok\.MemorySuite\-AgenticALFWorld\-SeenSeen Household Task CompletionSR / \#Tok\.ALFWorld\-UnseenUnseen Household Task CompletionSR / \#Tok\.### D\.1MemorySuite\-QA
PersonaMem\[[17](https://arxiv.org/html/2605.30690#bib.bib1)\]is a personalization benchmark designed to assess whether language models can infer evolving user profiles and generate personalized responses across multi\-session interactions\. The benchmark is constructed via a synthetic dialogue generation pipeline that simulates realistic, temporally evolving conversations between users and chatbots, grounded in over 1,000 distinct user personas spanning diverse demographic attributes, preferences, and backgrounds\. Each persona is associated with a long chat history in which user preferences are implicitly revealed across a series of task\-oriented interactions\. We evaluate on thePersonaMem\-32KandPersonaMem\-128Kvariants, which differ in the maximum context length of the provided conversation history\. Each question is posed as a 4\-way multiple\-choice problem \(MC\-4\), requiring the model to select the response that best aligns with the user’s implicitly expressed persona\.
LoCoMo\[[28](https://arxiv.org/html/2605.30690#bib.bib10)\]is a benchmark for evaluating very long\-term conversational memory, constructed via a machine–human pipeline that generates multi\-session dialogues between LLM\-based virtual agents grounded on persona descriptions and temporal event graphs\. Each conversation spans up to 32 sessions with an average of 600 turns and 16K tokens\. Human annotators verify and edit the generated dialogues for long\-range consistency and causal coherence\. We adopt the question answering subset, where questions require temporal and causal reasoning over long\-range dialogue history and are posed in a 10\-way multiple\-choice format \(MC\-10\)555[https://huggingface\.co/datasets/Percena/locomo\-mc10](https://huggingface.co/datasets/Percena/locomo-mc10), probing the model’s ability to track evolving personal information across sessions\.
LongMemEval\[[44](https://arxiv.org/html/2605.30690#bib.bib11)\]is a comprehensive benchmark evaluating five core long\-term memory abilities of chat assistants: information extraction, multi\-session reasoning, temporal reasoning, knowledge updates, and abstention\. It comprises 500 carefully curated questions embedded in freely extensible, timestamped user–assistant chat histories, supporting context lengths ranging from 128K tokens \(longmemeval\_s\) to approximately 1\.5M tokens \(longmemeval\_m\)\. Questions are posed in a 10\-way multiple\-choice format \(MC\-10\)666[https://huggingface\.co/datasets/Percena/lme\-mc10](https://huggingface.co/datasets/Percena/lme-mc10)\.
### D\.2MemorySuite\-Agentic
ALFWorld\-Seen\[[38](https://arxiv.org/html/2605.30690#bib.bib12)\]is a text\-based interactive household benchmark where agents interpret natural\-language observations and issue sequential actions to achieve household tasks\. The Seen split \(valid\_seen\) contains 140 episodes across six task types \(e\.g\., pick\-and\-place, pick\-heat\-then\-place, look\-at\-in\-light\)\. Room layouts overlap with training, so agents face familiar spatial configurations but novel task instances, isolating instruction\-level generalization\.
ALFWorld\-Unseen\[[38](https://arxiv.org/html/2605.30690#bib.bib12)\]includes the same six task types but evaluates onvalid\_unseen, comprising 134 episodes in entirely new room layouts with unseen object arrangements, providing a stricter out\-of\-distribution test\. For both splits, we report task success rate \(SR\) and average number of consumed tokens \(\#Tok\.\)\.
##### ALFWorld training setup\.
Unlike the QA tasks where each training sample is a single question–answer pair, ALFWorld requires learning from multi\-step expert trajectories in an interactive environment\. We adopt an imitation\-learning–based GRPO formulation with a two\-stage data pipeline\.
Stage 1: Memory bank construction\.We start from 3,553 expert trajectories collected via an oracle planner on the ALFWorld training environments, of which 3,541 are successful\. We split these into a*memory pool*ℳ\\mathcal\{M\}\(80%, 2,832 trajectories\) and an*IL pool*𝒯\\mathcal\{T\}\(20%, 709 trajectories\), using a fixed seed for reproducibility\. Each trajectory in the memory pool is summarized into a procedural*skill card*using an LLM \(Gemini\), following a structured prompt that asks for a 150–250 word procedural description including the task category, a generalized step\-by\-step strategy, and typical object–receptacle associations\. This produces 2,832 skill cards \(634 pick\-two, 623 pick\-and\-place, 534 clean\-then\-place, 431 cool\-then\-place, 368 heat\-then\-place, 242 look\-at\-in\-light\)\. Each skill card is encoded by the frozen base LLM encoderfθ0f\_\{\\theta\_\{0\}\}into a retrieval key and content cache, forming the latent memory bankℬ\\mathcal\{B\}withN=2,832N\{=\}2\{,\}832entries andZ=10Z\{=\}10retrieved cards per query\.
Stage 2: Per\-step imitation with GRPO\.Each trajectory in the IL pool𝒯\\mathcal\{T\}is expanded into per\-step training samples\. At stepttof a trajectory, the training sample consists of:
- •Queryqq: the task objective \(e\.g\., “heat some apple and put it in countertop”\), which remains fixed across all steps of the same episode;
- •Task suffixuu: a rendered prompt containing the truncated interaction history \(last 5 action–observation pairs\), the current observation, the list of admissible actions, and the system instruction directing the agent to output one admissible command verbatim;
- •Correct answery⋆y^\{\\star\}: the expert’s action at steptt, JSON\-encoded together with the admissible action list to enable snap\-to\-admissible reward computation\.
This expansion produces 4,260 per\-step samples from 709 trajectories\. The GRPO reward \(Eq\.[21](https://arxiv.org/html/2605.30690#S3.E21)\) uses a strict deployment\-aligned criterion: the model’s generated action is first snapped to the nearest admissible command \(by normalized string matching\), and the reward is1\.01\.0if the snapped action exactly matches the expert action, and0\.00\.0otherwise\. This eliminates the train–test gap that partial\-credit reward functions would introduce\.
Evaluation protocol\.At test time, the agent interacts with the ALFWorld environment autoregressively for up to 30 steps per episode\. At each step, the agent retrievesZ=10Z\{=\}10skill cards from the shared memory bankℬ\\mathcal\{B\}using the task objective as the query, allocates elastic latent budgets via the trained policyπϕ\\pi\_\{\\phi\}, and generates an action conditioned on the soft memory tokens, interaction history \(last 5 steps\), current observation, and admissible actions\. The generated action is snapped to the nearest admissible command before execution\. An episode is considered successful if the agent achieves a cumulative reward of≥1\.0\\geq 1\.0before reaching the step limit\.
## Appendix EDataset Statistics
In this section, we present detailed statistics for each dataset\. The statistics for MemorySuite\-QA and MemorySuite\-Agentic are provided in Table[4](https://arxiv.org/html/2605.30690#A5.T4)and Table[5](https://arxiv.org/html/2605.30690#A5.T5), respectively\. For MemorySuite\-QA, PersonaMem variants are split 80/10/10 byshared\_context\_id; LoCoMo is partitioned by conversation index \(first 6/middle 2/last 2 conversations\); and LongMemEval follows a 60/20/20 random split by question\. For MemorySuite\-Agentic, we adopt the official splits: ALFWorld provides separateSeen\(valid\_seen\) andUnseen\(valid\_unseen\) test environments, both sharing the same training set of expert demonstrations\. For each dataset, we additionally reportZZ, the number of top\-ZZmemory chunks retrieved and supplied to the model during inference, which directly governs the memory retrieval budget and token consumption\.
Table 4:MemorySuite\-QA Data Statistics\.ZZdenotes the number of top\-ZZmemory chunks retrieved per query\.DatasetSplitTotalZZTrainValTestPersonaMem\-32K48950505899PersonaMem\-128K2,2212732332,72721LoCoMo8853413141,54020LongMemEval30010010050020Table 5:MemorySuite\-Agentic Data Statistics\.ZZdenotes the number of top\-ZZmemory chunks retrieved per agent step\.SplitALFWorldZZSeenUnseenTrain3,5533,55310Test140134
## Appendix FPrompt Usage
This section describes the prompt templates used by all memory baselines across our evaluation datasets\. The baselines differ in how they*build*memory from conversation sessions; details for each baseline’s memory\-construction prompts are provided in Section[F\.1](https://arxiv.org/html/2605.30690#A6.SS1)\. Each dataset employs a distinct answering prompt: LoCoMo and LongMemEval use 10\-choice multiple\-choice prompts \(Sections[F\.3](https://arxiv.org/html/2605.30690#A6.SS3)and[F\.4](https://arxiv.org/html/2605.30690#A6.SS4)\), PersonaMem\-32K and PersonaMem\-128K use 4\-choice prompts \(Section[F\.2](https://arxiv.org/html/2605.30690#A6.SS2)\), and ALFWorld uses a task\-specific prompt described in Section[F\.5](https://arxiv.org/html/2605.30690#A6.SS5)\.
### F\.1Shared Memory Construction Prompts
The following memory\-construction prompts are shared across all evaluation datasets\. Each baseline uses the same prompts for building its memory store, regardless of whether the downstream task is LongMemEval, LoCoMo, PersonaMem or ALFWorld\.
#### Mem0
Mem0\[[5](https://arxiv.org/html/2605.30690#bib.bib16)\]runs a two\-step pipeline per session turn\.Step 1\(Table[6](https://arxiv.org/html/2605.30690#A6.T6)\) extracts factual statements from the conversation as a JSON list\.Step 2\(Table[7](https://arxiv.org/html/2605.30690#A6.T7)\) compares each new fact against existing entries in a FAISS store and decides whether to ADD, UPDATE, DELETE, or make no change \(NONE\)\. At inference, the top\-kkstored facts \(with timestamps\) are retrieved by cosine similarity and prepended as a bullet list\.
Table 6:Mem0: Fact Extraction Prompt \(Step 1 of memory construction\)\.You are a Personal Information Organizer, specialized in accurately storing facts, user memories, and preferences\. Your primary role is to extract relevant pieces of information from conversations and organize them into distinct, manageable facts\.Types of Information to Remember:1\. Store Personal Preferences: Keep track of likes, dislikes, and specific preferences\.2\. Maintain Important Personal Details: Remember significant personal information like names, relationships, and important dates\.3\. Track Plans and Intentions: Note upcoming events, trips, goals, and any plans the user has shared\.4\. Monitor Health and Wellness Preferences: Keep a record of dietary restrictions, fitness routines, and other wellness\-related information\.\[…other categories omitted for brevity …\]Here are some few shot examples:Input: Hi, my name is John\. I am a software engineer\.Output: \{"facts" : \["Name is John", "Is a Software engineer"\]\}Return the facts and preferences in a json format as shown above\.Remember the following:\- Today’s date is\{today\}\.\- Do not return anything from the custom few shot example prompts provided above\.\- Create the facts based on the user and assistant messages only\.\- Make sure to return the response in the format mentioned in the examples\.Following is a conversation between the user and the assistant\. Extract the relevant facts and preferences about the user, if any\.Table 7:Mem0: Memory Update Decision Prompt \(Step 2 of memory construction\)\.You are a smart memory manager which controls the memory of a system\.You can perform four operations: \(1\) add into the memory, \(2\) update the memory, \(3\) delete from the memory, and \(4\) no change\.Compare newly retrieved facts with the existing memory\. For each new fact, decide whether to:\- ADD: Add it to the memory as a new element\- UPDATE: Update an existing memory element\- DELETE: Delete an existing memory element\- NONE: Make no change if the fact is already present or irrelevant\[…guidelines for each operation with examples omitted for brevity…\]You must return your response in the following JSON structure only:\{"memory" : \[\{"id" : "<ID\>", "text" : "<Content\>", "event" : "<ADD\|UPDATE\|DELETE\|NONE\>"\}\]\}
#### LangMem
LangMem uses a single LLM call per session to extract personal facts in a “FACT:” prefixed format \(Table[8](https://arxiv.org/html/2605.30690#A6.T8)\)\. Facts are stored in a LangGraphInMemoryStorewith date prefixes and retrieved by embedding similarity at query time\.
Table 8:LangMem: Fact Extraction Prompt\.Extract all personal facts about the user from this conversation\. Output one fact per line using the format "FACT: <content\>"\. Be concise but complete\. Only output FACT lines, nothing else\.Conversation:\{conversation\}Facts:
#### MemoryBank
MemoryBank\[[56](https://arxiv.org/html/2605.30690#bib.bib13)\]creates two levels of summary per session\. Acontent summary\(Table[9](https://arxiv.org/html/2605.30690#A6.T9)\) captures the key events of the dialogue, while apersonality summary\(Table[10](https://arxiv.org/html/2605.30690#A6.T10)\) records the user’s traits and suggests response strategies\. Globally, these are aggregated into an overall history and overall personality summary that persist across sessions\. At inference, the most relevant per\-session summary is retrieved by FAISS cosine similarity and concatenated with the personality and overall\-history blocks\.
Table 9:MemoryBank: Content Summary Prompt \(applied to each session\)\.Please summarize the following dialogue as concisely as possible, extracting the main themes and key information\. If there are multiple key events, you may summarize them separately\. Dialogue content:\{dialogue\}Summarization:Table 10:MemoryBank: Personality and Strategy Summary Prompt\.Based on the following dialogue, please summarize\{user\_name\}’s personality traits and emotions, and devise response strategies based on your speculation\. Dialogue content:\{dialogue\}\{user\_name\}’s personality traits, emotions, and\{boot\_name\}’s response strategy are:
#### MemoryOS
MemoryOS\[[19](https://arxiv.org/html/2605.30690#bib.bib17)\]organizes memory in three tiers: short\-term \(STM, within\-session\), mid\-term \(MTM, cross\-episode topic segments\), and long\-term \(LPM, persistent profile\)\. When the STM overflows, three prompts fire in sequence: \(1\) acontinuity check\(Table[11](https://arxiv.org/html/2605.30690#A6.T11)\) to decide whether to merge or split page boundaries, \(2\) ameta\-info update\(Table[12](https://arxiv.org/html/2605.30690#A6.T12)\) to maintain a running dialogue summary, and \(3\) amulti\-summary\(Table[13](https://arxiv.org/html/2605.30690#A6.T13)\) to extract subtopic clusters that form MTM segments\. At inference, the most similar MTM pages are retrieved by cosine similarity and formatted as\[obs\]/\[action\]episodes\.
Table 11:MemoryOS: Continuity Check Prompt\.Determine if these two conversation pages are continuous \(true continuation without topic shift\)\.Return ONLY "true" or "false"\.Previous Page:User:\{prev\_user\}Assistant:\{prev\_agent\}Current Page:User:\{curr\_user\}Assistant:\{curr\_agent\}Continuous?Table 12:MemoryOS: Meta\-Info Update Prompt\.Update the conversation meta\-summary by incorporating the new dialogue while maintaining continuity\.Guidelines:1\. Start from the previous meta\-summary \(if exists\)2\. Add/update information based on the new dialogue3\. Keep it concise \(1\-2 sentences max\)4\. Maintain context coherencePrevious Meta\-summary:\{last\_meta\}New Dialogue:\{new\_dialogue\}Updated Meta\-summary:Table 13:MemoryOS: Multi\-Summary Prompt \(subtopic extraction\)\.Please analyze the following dialogue and generate extremely concise subtopic summaries, if applicable, with a maximum of two themes\.Each summary should be very brief – just a few words for the theme and content\. Format as JSON array:\[\{"theme": "Brief theme", "keywords": \["key1", "key2"\], "content": "summary"\}\]Conversation content:\{text\}
#### A\-MEM
A\-MEM\[[45](https://arxiv.org/html/2605.30690#bib.bib14)\]creates structured notes for each session\. Ananalysis prompt\(Table[14](https://arxiv.org/html/2605.30690#A6.T14)\) extracts keywords, context, and tags\. Anevolution prompt\(Table[15](https://arxiv.org/html/2605.30690#A6.T15)\) then decides whether the new note should be linked \(STRENGTHEN\) or whether neighboring notes’ context and tags should be updated \(UPDATE\_NEIGHBOR\)\. At inference, a keyword\-expanded query is issued against the FAISS note store\.
Table 14:A\-MEM: Content Analysis Prompt\.Analyze the following content and provide:1\. KEYWORDS: The most important keywords \(nouns, verbs, key concepts\)\. Order from most to least important\. At least three keywords\.2\. CONTEXT: One sentence summarizing the main topic, key points, and purpose\.3\. TAGS: Broad categories/themes for classification \(domain, format, type\)\. At least three tags\.Respond using EXACTLY this format \(one section per header\):KEYWORDS: keyword1, keyword2, keyword3, …CONTEXT: A single sentence summarizing the content\.TAGS: tag1, tag2, tag3, …Content for analysis:\{content\}Table 15:A\-MEM: Memory Evolution Decision Prompt\.You are an AI memory evolution agent\. Analyze the new memory note and its nearest neighbors to decide if evolution is needed\.New memory:\- Context:\{context\}\- Content:\{content\}\- Keywords:\{keywords\}Nearest neighbor memories:\{nearest\_neighbors\_memories\}Based on the relationships between the new memory and its neighbors, decide:\- NO\_EVOLUTION: The memory stands alone, no changes needed\.\- STRENGTHEN: The new memory should be linked to some neighbors and its tags updated\.\- UPDATE\_NEIGHBOR: The neighbors’ context/tags should be updated based on new understanding\.\- STRENGTHEN\_AND\_UPDATE: Both strengthen and update neighbors\.Respond using EXACTLY this format:DECISION: <one of NO\_EVOLUTION, STRENGTHEN, UPDATE\_NEIGHBOR, STRENGTHEN\_AND\_UPDATE\>REASON: <brief explanation\>
#### LightMem
LightMem\[[9](https://arxiv.org/html/2605.30690#bib.bib15)\]first compresses each session with LLMLingua\-2 \(or a Jaccard\-shingle fallback\) and buffers compressed turns in a short\-term memory \(STM\)\. When the STM token budget is exceeded, it fires theextraction prompt\(Table[16](https://arxiv.org/html/2605.30690#A6.T16)\) to produce atomic facts with source identifiers\. Periodically, aconsolidation pass\(Table[17](https://arxiv.org/html/2605.30690#A6.T17)\) reviews existing long\-term entries against newly extracted facts and issues update/delete/ignore decisions\. At inference, the top\-kklong\-term facts are retrieved by cosine similarity and formatted with ISO timestamps\.
Table 16:LightMem: Fact Extraction Prompt \(fires on STM overflow\)\.You are a Personal Information Extractor\.Your task is to extractall possible facts or informationabout the user from a conversation\.Important Instructions:1\. You MUST process every user message in order, one by one\.For each message, decide whether it contains any factual information\.\- If yes→\\rightarrowextract it and rephrase into a standalone sentence\.\- If no, such as pure greeting, filler, or irrelevant remark,→\\rightarrowskip it\.\- Do NOT skip just because the information looks minor or unimportant\.2\. Perform light contextual completion so that each fact is a clear standalone statement\.Examples: “user: Bought apples yesterday”→\\rightarrow“User bought apples yesterday\.”3\. Output format:\{"data": \[\{"source\_id": "<source\_id\>", "fact": "<complete fact with ALL specific details\>"\}\]\}Table 17:LightMem: Memory Consolidation Prompt\.You are a memory management assistant\.Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories\.Decision rules:1\. Update: If the target and candidate memories describe essentially the same fact but are not fully consistent, update by integrating additional information\.2\. Delete: If the target and candidate memories contain a direct conflict, delete the target memory\.3\. Ignore: If unrelated, no action is needed\.The output must be a JSON object:\{"action": "update" \| "delete" \| "ignore","new\_memory": \{…\} // only required when action = "update"\}
#### MeMP
MeMP\[[10](https://arxiv.org/html/2605.30690#bib.bib18)\]learns*procedural*memory: after each session it distills the conversation into a natural\-language workflow paragraph using the prompt in Table[18](https://arxiv.org/html/2605.30690#A6.T18)\. Workflows are stored in a FAISS index keyed by session query embeddings and retrieved by cosine similarity at inference, then injected as task guidelines\. On a failed attempt, the workflow is refined with a separate adjustment prompt\.
Table 18:MeMP: Workflow Generation Prompt\.You are provided with a query and a trajectory taken to solve the query\. The trajectory consists of multiple steps of thought, action and observation\.Your task is to generate a workflow based on critical steps to help solve similar queries in the future\.A critical step is one that has a significant impact on fulfilling the query, the step action belongs to the set \[go to, take from, put in/on, open, close, use, clean with, heat with, cool with, examine, look\], and the action’s outcome is successful and contributes positively to achieving the query\.Notice: Write the workflow as a natural, coherent paragraph \(not as a bullet list or numbered steps\)\. Use clear, concise language to describe what actions should be taken and in what general order\.—–EXAMPLE WORKFLOW—\-To solve this query, begin by identifying the most likely receptacles where the target object can be found and visit them one by one\. After locating and taking the object, perform any required transformation such as cleaning at a sinkbasin, heating with a microwave, or cooling with a fridge\. Finally, go to the destination receptacle and put the object in/on it to complete the task\.—–EXAMPLE END—\-Query:\{query\}Trajectory:\{trajectory\}Output the workflow without any explanation or context:
#### ElasticMem
ElasticMem learns*episodic*skills from both successful and failed trajectories\. For MemorySuite\-QA, skills are extracted after each question attempt via Table[19](https://arxiv.org/html/2605.30690#A6.T19)and Table[20](https://arxiv.org/html/2605.30690#A6.T20), where successful patterns are abstracted using generic placeholders \(\[Entity\],\[Attribute\],\[Time\_Period\]\) to maximize transferability, and failure lessons diagnose the trigger condition and the erroneous reasoning step\. For ALFWorld, each completed or failed episode is distilled into a reusable skill using separate success and failure prompts \(Table[21](https://arxiv.org/html/2605.30690#A6.T21)and Table[22](https://arxiv.org/html/2605.30690#A6.T22)\)\. Successful skills capture task category, concrete step\-by\-step strategies, and typical object–location associations; failure lessons identify the specific mistake and the corrective action\. All extracted skills are stored in a retrieval index and fetched by embedding similarity at inference time\.
Table 19:ElasticMem: QA/Search Skill Extraction Prompt \(Correct Answer\)\.systemYou are an expert at distilling problem\-solving experiences into concise, reusable lessons\. Be brief and generalizable\.userA search/QA question was answered correctly\.Question:\{question\}Answer:\{answer\}Extract a reusable search skill \(2–3 sentences\):\-planning\_pattern: Abstract the logic using generic terms\[Entity\],\[Attribute\],\[Time\_Period\]\- What key strategy led to success?Do NOT include specific names, numbers, or answers\. Focus on the transferable strategy\.Output format:SKILL: \[your skill text\]Table 20:ElasticMem: QA/Search Skill Extraction Prompt \(Incorrect Answer\)\.systemYou are an expert at distilling problem\-solving experiences into concise, reusable lessons\. Be brief and generalizable\.userA search/QA question was answered incorrectly\.Question:\{question\}Incorrect answer:\{answer\}Extract a reusable lesson \(2–3 sentences\):\-trigger\_condition: What kind of question caused the error?\-bad\_action: What went wrong?Do NOT include specific names\. Focus on the transferable lesson\.Output format:SKILL: \[your lesson text\]Table 21:ElasticMem: ALFWorld Skill Extraction Prompt \(Successful Trajectory\)\.systemYou are an expert at analyzing household robot trajectories\. Extract specific, actionable lessons from the provided trajectory\.userAn ALFWorld household task was completed successfully\.Task:\{task\}Full trajectory \(action→\\toobservation\):\{trajectory\}Extract a reusable skill \(3–5 sentences\)\. Include:1\. The general task category \(pick\_and\_place,heat\_then\_place,clean\_then\_place,cool\_then\_place,examine\_in\_light,pick\_two\)2\. The concrete step\-by\-step strategy that worked3\. Common locations where target objects are found \(e\.g\. “soapbar is usually on countertop, bathtubbasin, or shelf”\)Be specific\. Use actual object/location types \(countertop,sinkbasin,microwave\)\.Output format:SKILL: \[your skill text\]Table 22:ElasticMem: ALFWorld Skill Extraction Prompt \(Failed Trajectory\)\.systemYou are an expert at analyzing household robot trajectories\. Extract specific, actionable lessons from the provided trajectory\.userAn ALFWorld household task failed after all steps\.Task:\{task\}Trajectory:\{trajectory\}Extract a reusable lesson \(3–5 sentences\)\. Include:1\. The general task category2\. What specific mistake was made3\. What the agent should have done differentlyOutput format:SKILL: \[your lesson text\]
### F\.2PersonaMem Answering Prompt
PersonaMem\-32K and PersonaMem\-128K use a 4\-choice MC format \(a–d\) instead of the 10\-choice format used by LongMemEval\. The system message sets the role as: “You are a careful assistant answering 4\-choice multiple\-choice questions grounded in a long persona\-conversation\.” The user prompt \(Table[23](https://arxiv.org/html/2605.30690#A6.T23)\) presents the question, retrieved memory chunks, an instruction line, and four options\.
Table 23:PersonaMem: Shared MC Answering Prompt \(all baselines\)\.QUESTION:\{question\}RETRIEVED MEMORY \(relevant chunks from prior conversation\):\{retrieved\_text\}Answer with exactly one of the four options below, formatted as asingle token like “\(a\)”, “\(b\)”, “\(c\)”, or “\(d\)”\. Do not outputany other text\.OPTIONS:\(a\)\{choice\_0\}\(b\)\{choice\_1\}\(c\)\{choice\_2\}\(d\)\{choice\_3\}ANSWER:
### F\.3LoCoMo Answering Prompt
LoCoMo uses a 10\-choice multiple\-choice format where the model selects one of ten lettered options \(a\)–\(j\)\. All baselines share the same answering prompt shown in Table[24](https://arxiv.org/html/2605.30690#A6.T24): the question, the baseline\-retrieved memory chunks under a “Retrieved context” header, a strict\-format instruction, the ten options, and an “Answer:” marker\.
Table 24:LoCoMo: Shared MC Answering Prompt \(all baselines\)\.Question:\{question\}Retrieved context \(from the conversation history\):\{retrieved\_context\}You MUST pick exactly one option from \(a\) to \(j\) – one of them isguaranteed to be correct\. Do NOT say "not answerable", "I don’tknow", or refuse\. If uncertain, make your best guess\. Output exactlyone token in the form \(a\), \(b\), \(c\), \(d\), \(e\), \(f\), \(g\), \(h\), \(i\),or \(j\)\.Options:\(a\)\{choice\_0\}\(b\)\{choice\_1\}…\(j\)\{choice\_9\}Answer:
### F\.4LongMemEval Answering Prompt
All baselines share the same MC answering prompt on LongMemEval \(Table[25](https://arxiv.org/html/2605.30690#A6.T25)\), which prepends retrieved memory text to the question and asks the model to select one of ten lettered options\.
Table 25:LongMemEval: Shared MC Answering Prompt \(all baselines\)\.\[Relevant memories:\]\{retrieved\_memories\}\{question\}\(a\)\{choice\_0\}\(b\)\{choice\_1\}…\(j\)\{choice\_9\}You MUST pick exactly one option from \(a\) to \(j\) – one of them isguaranteed to be correct\. Do NOT say "not answerable", "I don’tknow", or refuse\. If uncertain, make your best guess\. Output exactlyone token in the form \(a\), \(b\), \(c\), \(d\), \(e\), \(f\), \(g\), \(h\), \(i\),or \(j\)\.Answer: \(
### F\.5ALFWorld Answering Prompt
ALFWorld baselines act inside a TextWorld game loop: at each step the agent receives the task goal, the retrieved memory block, an interaction history, the current observation, and the admissible\-action list, then must output one admissible command verbatim \(Table[26](https://arxiv.org/html/2605.30690#A6.T26)\)\.
Table 26:ALFWorld: Shared Step Prompt \(all baselines\)\.\[SYSTEM\] You are controlling a text\-based ALFWorld environment\. Choose the NEXT action as ONE admissible command string\. Output only the command, copied verbatim from the admissible list\.\[USER\] Task:\{objective\}\{retrieved\_memory\_block\}Interaction history so far:\{history\}Current observation:\{current\_obs\}Admissible actions:\{admissible\}Action:
## Appendix GCase Studies
This section presents successful case studies for each memory baseline across our evaluation datasets\. Each table shows the question with lettered choices \(correct answerbolded\), the ground truth, the retrieved memory block \(abbreviated for baselines with large stores\), and the model’s response\. We present case studies for LongMemEval \(Section[G\.4](https://arxiv.org/html/2605.30690#A7.SS4)\), PersonaMem\-32K \(Section[G\.1](https://arxiv.org/html/2605.30690#A7.SS1)\), and PersonaMem\-128K \(Section[G\.2](https://arxiv.org/html/2605.30690#A7.SS2)\)\.
### G\.1PersonaMem\-32K
This subsection presents one successful example per baseline on PersonaMem\-32K, drawn from the 7B model evaluation results\. Each table shows the question with four lettered choices \(correct answerbolded\), the ground truth, the retrieved memory block \(abbreviated for baselines with large stores\), and the model’s response\. Case studies for Mem0, LangMem, and LightMem are in Tables[27](https://arxiv.org/html/2605.30690#A7.T27)–[29](https://arxiv.org/html/2605.30690#A7.T29); MemoryOS and MemoryBank in Tables[30](https://arxiv.org/html/2605.30690#A7.T30)–[31](https://arxiv.org/html/2605.30690#A7.T31); A\-MEM and MeMP in Tables[32](https://arxiv.org/html/2605.30690#A7.T32)\-[33](https://arxiv.org/html/2605.30690#A7.T33)\.
Table 27:Mem0’s case study in PersonaMem\-32K\.Question: After attending a writing workshop, I discovered I actually enjoy creatively articulating my thoughts on music\. It was exhilarating to engage with fellow aspiring writers, exchanging ideas and techniques that helped unlock previously untapped aspects of my creativity\. The writing exercises we did were both challenging and inspiring, pushing me to find new ways to express the emotional and technical nuances of music\. I had always had an affinity for music, but this workshop gave me the confidence to put my feelings into words, which has been incredibly fulfilling\. It felt like I was discovering a new dimension of myself, one that combined my love of music with a passion for writing\.\(a\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop …✓\(b\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s surprising that your enthusiasm for writing only increased af…\(c\) I remember when you initially mentioned being indifferent to writing music reviews, as you had never considered stewing over your thoughts on music in writing before\. However, it’s wonderful to see how your experience at the writing workshop h…\(d\) I remember when you initially mentioned enjoying writing music reviews, perhaps because it might have seemed exciting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop h…Ground Truth: \(a\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop …Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-04\-27\] Researches themes and curates playlists for each podcast episode\- \[2026\-04\-27\] Felt unsure of own contributions during collaboration\- \[2026\-04\-27\] Shared stories about individual musical journeys in meetings\- \[2026\-04\-27\] Eager to apply constructive criticism techniques in a collaborative setting\- \[2026\-04\-27\] Eager to apply constructive criticism techniques in a collaborative setting\- \[2026\-04\-27\] Found a music documentary very informative and eye\-opening\- \[2026\-04\-27\] Attended a music festival on 2026\-04\-27\[…3 additional lines omitted…\]Response: \(a\)Table 28:LangMem’s case study in PersonaMem\-32K\.Question: I’m looking to find something that can really add a new dimension to my classic rock listening experience\. Any suggestions?\(a\) You might really enjoy the sound of traditional Pacific Islander music played on native instruments\. While not classic rock, exploring your roots through music can add a personal layer to your listening experience\. The rhythmic beats and cultu…\(b\) You might enjoy the crisp convenience of streaming ’Led Zeppelin IV’ in high definition, ensuring you have the album instantly at your fingertips with no need for additional equipment\. The sure quality assurance of digital music allows ’Stairw…\(c\) Consider embarking on a vibrant journey with live recordings from famous classic rock concerts\. Capturing the raw energy and spontaneity of legendary performances, live albums offer an immediacy that studio versions may not have, added with cr…\(d\) You might love the experience of spinning ’Led Zeppelin IV’ on vinyl\. It’s an album that not only defined an era but also carries a warmth and depth in its sound that digital formats often miss\. The artwork on the cover is iconic, and the tact…✓Ground Truth: \(d\) You might love the experience of spinning ’Led Zeppelin IV’ on vinyl\. It’s an album that not only defined an era but also carries a warmth and depth in its sound that digital formats often miss\. The artwork on the cover is iconic, and the tact…Retrieved Memories:\#\# Relevant memories:\- \[assistant\] Assistant: Missing out on activities like a music documentary night can certainly feel like a lost opportunity, especially when they provide such rich environments for connection and inspiration\.\- \[assistant\] Assistant: Definitely\! It brings a unique warmth to music listening, enhancing the frequencies that often get lost in more compressed digital formats\. The way vinyl can accentuate the nuances of a recording creates a richer auditory experience that many enthusiasts adore\. When you p…\- \[assistant\] Assistant: I’m so glad to hear that you enjoyed those fresh sounds\! Music has such a powerful impact on our mood and overall well\-being, often providing a much\-needed escape from the routines of everyday life\. Listening to new genres or artists can unlock entirely new perspectives a…\- \[assistant\] Assistant: Absolutely\! Music has a unique way of connecting us to new experiences\. It serves as a bridge between different cultures, allowing us deeper insight into lives and traditions that we may never encounter in person\. This connection is particularly profound when you encounte…\- \[assistant\] Assistant: Visiting a record store can be a profoundly rewarding experience, often akin to stepping into a treasure trove filled with the sounds of the past and the promise of future discoveries\. The tactile nature of vinyl, from its grooved surface to the large, colorful album art,…\- \[assistant\] Assistant: I can completely understand why you find karaoke nights so enjoyable, as they often foster an environment of joy and connection among friends and even strangers\. Participating in such an event is much more than just singing; it’s about the shared laughter, the playful com…Response: \(d\)Table 29:LightMem’s case study in PersonaMem\-32K\.Question: I recently joined a forum discussion about humor in music\.\(a\) I remember you mentioning how you enjoy engaging in online music discussions\. That’s great, forums can be a wonderful way to connect with people\.✓\(b\) I seem to recall you saying you shy away from online music discussions\. It’s interesting to see you’ve decided to dive in now\.\(c\) I remember you talking about how you enjoy participating in travel forums\. It’s great that you’re finding ways to connect through different topics\.\(d\) That’s nice to hear\! Joining forums can offer new insights\. Engaging in discussions about humor in music sounds like a fascinating experience\.Ground Truth: \(a\) I remember you mentioning how you enjoy engaging in online music discussions\. That’s great, forums can be a wonderful way to connect with people\.Retrieved Memories:Long\-term memory \(relevant facts\):\- \[2026\-04\-27T10:53:54\.691 Mon\] User passed on a music documentary night\.\- \[2026\-04\-27T10:58:46\.191 Mon\] Writing exercises pushed user to find new ways to express the emotional and technical nuances of music\.\- \[2026\-04\-27T10:53:06\.497 Mon\] User experienced frustration when attempts did not match the sound in his head\.\- \[2026\-04\-27T10:54:08\.119 Mon\] User values authentic dialogues about music over abstract opinions\.\- \[2026\-04\-27T10:56:49\.256 Mon\] Pooling collective knowledge helps uncover different styles of playing\.\- \[2026\-04\-27T10:56:41\.318 Mon\] User shares classic tracks each week to connect with others\.\- \[2026\-04\-27T10:53:34\.691 Mon\] User is happy with their eclectic taste and does not need validation from current trends\.\[…3 additional lines omitted…\]Response: \(a\)Table 30:MemoryOS’s case study in PersonaMem\-32K\.Question: I’ve seen a lot of people getting into DIY craft projects lately\. I’m a bit unsure, but should I give it a try?\(a\) It sounds like you’re considering exploring DIY craft projects\. Trying out such activities can be a fun and rewarding experience, especially if you’re eager to learn new skills\. You may enjoy the satisfaction of creating something tangible wit…\(b\) It sounds like you’re considering exploring DIY craft projects\. Trying out such activities can be a fun and rewarding experience, providing a nice break from everyday routines\. If you’re interested in exploring something new, crafts could be a…\(c\) It sounds like you’re considering exploring DIY craft projects\. Trying out such activities can be a fun and rewarding experience, especially if you’re looking to incorporate a creative twist into something new\. If you enjoyed the engaging natu…✓\(d\) It sounds like you’re considering exploring DIY craft projects\. Trying out such activities can be a fun and rewarding experience, especially if you like activities that offer room for creativity\. Engaging in crafts could be a fulfilling way to…Ground Truth: \(c\) It sounds like you’re considering exploring DIY craft projects\. Trying out such activities can be a fun and rewarding experience, especially if you’re looking to incorporate a creative twist into something new\. If you enjoyed the engaging natu…Retrieved Memories:\[Mid\-term – retrieved cross\-episode pages\]\- \[obs\] \[user\] Music documentaries can be incredibly inspiring, as they often delve into the lives of musicians, their creative processes, and the challenges they face along their journeys\. However, choosing not to attend such events can stem from a variety of reasons, including feeling unworthy …\[action\]\[next\_obs\]meta: User expresses interest in attending music documentary nights, acknowledging their potential to be inspiring yet recognizing the fear of not fitting in due to insecurities and self\-doubt\.\- \[obs\] \[user\] User: On , I launched a new podcast series focusing on music’s role in cultural identities\. This project has me delving into how different genres of music shape and reflect the societies they come from, which adds a rich layer of meaning to my understanding of cultural expression\. …\[action\]\[next\_obs\]\[…142 additional lines omitted…\]Response: \(c\)Table 31:MemoryBank’s case study in PersonaMem\-32K\.Question: I’m looking to find something that can really add a new dimension to my classic rock listening experience\. Any suggestions?\(a\) You might really enjoy the sound of traditional Pacific Islander music played on native instruments\. While not classic rock, exploring your roots through music can add a personal layer to your listening experience\. The rhythmic beats and cultu…\(b\) You might enjoy the crisp convenience of streaming ’Led Zeppelin IV’ in high definition, ensuring you have the album instantly at your fingertips with no need for additional equipment\. The sure quality assurance of digital music allows ’Stairw…\(c\) Consider embarking on a vibrant journey with live recordings from famous classic rock concerts\. Capturing the raw energy and spontaneity of legendary performances, live albums offer an immediacy that studio versions may not have, added with cr…\(d\) You might love the experience of spinning ’Led Zeppelin IV’ on vinyl\. It’s an album that not only defined an era but also carries a warmth and depth in its sound that digital formats often miss\. The artwork on the cover is iconic, and the tact…✓Ground Truth: \(d\) You might love the experience of spinning ’Led Zeppelin IV’ on vinyl\. It’s an album that not only defined an era but also carries a warmth and depth in its sound that digital formats often miss\. The artwork on the cover is iconic, and the tact…Retrieved Memories:\#\# MemoryBank contextAgent personality / response strategy \(from past tasks\):Summary:User’s Personality:Kai is passionate, reflective, supportive, curious, and empathetic\. He experiences joy, inspiration, nostalgia, confidence, and reflection, particularly in his love for music, especially jazz and Pacific Islander genres\.\[…29 additional lines omitted…\]Response: \(d\)Table 32:A\-MEM’s case study in PersonaMem\-32K\.Question: After attending a writing workshop, I discovered I actually enjoy creatively articulating my thoughts on music\. It was exhilarating to engage with fellow aspiring writers, exchanging ideas and techniques that helped unlock previously untapped aspects of my creativity\. The writing exercises we did were both challenging and inspiring, pushing me to find new ways to express the emotional and technical nuances of music\. I had always had an affinity for music, but this workshop gave me the confidence to put my feelings into words, which has been incredibly fulfilling\. It felt like I was discovering a new dimension of myself, one that combined my love of music with a passion for writing\.\(a\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop …✓\(b\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s surprising that your enthusiasm for writing only increased af…\(c\) I remember when you initially mentioned being indifferent to writing music reviews, as you had never considered stewing over your thoughts on music in writing before\. However, it’s wonderful to see how your experience at the writing workshop h…\(d\) I remember when you initially mentioned enjoying writing music reviews, perhaps because it might have seemed exciting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop h…Ground Truth: \(a\) I remember when you initially mentioned disliking writing music reviews, perhaps because it might have seemed daunting to articulate your thoughts on music in writing\. However, it’s wonderful to see how your experience at the writing workshop …Retrieved Memories:talk start time:task=personamem\-mc\|step=141memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] User: A few days later, I reluctantly watched a music documentary DVD I was gifted\. \| action: \| result: memory context: The content describes…talk start time:task=personamem\-mc\|step=106memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] User: This experience opened my eyes to the power of visual storytelling in understanding music; it was a fantastic experience\. The documentar…talk start time:task=personamem\-mc\|step=106memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] User: This experience opened my eyes to the power of visual storytelling in understanding music; it was a fantastic experience\. The documentar…talk start time:task=personamem\-mc\|step=152memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[system\] Current user persona: Name: KaiGender Identity: Transgender maleRacial Identity: Pacific IslanderKai, born in 1983, is a devoted fan of Louie Anderson’s comedy, captivated by Anderson’s unique blend of humor and heart\. Growing up in a multicultural community, his Pacific Islander heritage has deeply influenced his appreciation for storytelling and humor that resonates across cultures\. Embrac…\[…36 additional lines omitted…\]Response: \(a\)Table 33:MeMP’s case study in PersonaMem\-32K\.Question: I’m looking to find something that can really add a new dimension to my classic rock listening experience\. Any suggestions?\(a\) You might really enjoy the sound of traditional Pacific Islander music played on native instruments\. While not classic rock, exploring your roots through music can add a personal layer to your listening experience\. The rhythmic beats and cultural stories told through these melodies hold a deep e…\(b\) You might enjoy the crisp convenience of streaming ‘Led Zeppelin IV’ in high definition, ensuring you have the album instantly at your fingertips with no need for additional equipment\. The sure quality assurance of digital music allows ‘Stairway to Heaven’ to…\(c\) Consider embarking on a vibrant journey with live recordings from famous classic rock concerts\. Capturing the raw energy and spontaneity of legendary performances, live albums offer an immediacy that studio versions may not have, added with crowd reactions and improvised solos…\(d\) You might love the experience of spinning ‘Led Zeppelin IV’ on vinyl\. It’s an album that not only defined an era but also carries a warmth and depth in its sound that digital formats often miss\. The artwork on the cover is iconic, and the tactile connection of placing the needle down will make ‘Stairway to Heaven’ feel like an epic journey through your speakers\.✓Ground Truth: \(d\)Retrieved Memories:Here are some guidelines of how to solve the similar task:\- task\_name: answer multiple\-choice questions about a long persona\-grounded conversationguidelines: To solve this query, begin by examining the conversation transcript to understand the context and key points discussed\. Recall Kai’s background as a classic rock enthusiast who values tactile, authentic experiences over digital convenience\. Identify which option aligns with his established preference for physical media and nostalgic engagement with music\. Cross\-reference with his earlier statements about preferring vinyl warmth over streaming clarity\. Finally, select the answer that best captures his love for the analog listening ritual\.\- task\_name: recommend music experience for vinyl collectorguidelines: To recommend a music experience, first identify the user’s preferred format \(vinyl, CD, streaming\)\. If the user has expressed love for physical media, prioritize options involving tangible interaction such as record spinning, liner note reading, or album art appreciation\. Consider the warmth and depth of analog sound as a key differentiator from digital formats\. Match the recommendation to artists and albums the user has previously mentioned enjoying\.\- task\_name: identify user preference evolution across conversationguidelines: To track preference evolution, begin by scanning the full conversation history for mentions of the target topic\. Note the chronological order of statements and identify any shifts in sentiment\. Pay attention to phrases like “I used to think” or “after trying X, I now prefer Y” as markers of preference change\. Summarize the trajectory from initial stance to current preference\.\[…198 additional lines omitted…\]Response: \(d\)
### G\.2PersonaMem\-128K
This subsection presents one successful example per baseline on PersonaMem\-128K, drawn from the 7B model evaluation results\. Each table shows the question with four lettered choices \(correct answerbolded\), the ground truth, the retrieved memory block \(abbreviated for baselines with large stores\), and the model’s response\. Case studies for Mem0, LangMem, and LightMem are in Tables[34](https://arxiv.org/html/2605.30690#A7.T34)–[36](https://arxiv.org/html/2605.30690#A7.T36); MemoryOS and MemoryBank in Tables[37](https://arxiv.org/html/2605.30690#A7.T37)–[38](https://arxiv.org/html/2605.30690#A7.T38); A\-MEM and MeMP in Tables[39](https://arxiv.org/html/2605.30690#A7.T39)\-[40](https://arxiv.org/html/2605.30690#A7.T40)\.
Table 34:Mem0’s case study in PersonaMem\-128K\.Question: Recently, I was at a yoga retreat\.\(a\) Ah, I remember you mentioned your dislike for yoga before\. It’s great to revisit activities to confirm our true preferences\.\(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.✓\(c\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.\(d\) Oh, you went on a yoga retreat? That sounds unique\! What was it like engaging in new activities?Ground Truth: \(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-04\-29\] Likes a positive tone and clear, concise wording\- \[2026\-04\-29\] Discovering patterns and connections through journal entries\- \[2026\-04\-29\] Reflects on podcast conversations during daily routines\- \[2026\-04\-29\] Considers volunteering more after positive feedback from a mentoring experience\- \[2026\-04\-29\] Hosted a webinar for aspiring journalists on an earlier date\- \[2026\-04\-29\] Considers starting a journal to document thoughts and insights after podcast episodes\- \[2026\-04\-29\] Sense of community blossomed during home decoration workshop interactions\[…3 additional lines omitted…\]Response: \(b\)Table 35:LangMem’s case study in PersonaMem\-128K\.Question: Recently, I was at a yoga retreat\.\(a\) Ah, I remember you mentioned your dislike for yoga before\. It’s great to revisit activities to confirm our true preferences\.\(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.✓\(c\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.\(d\) Oh, you went on a yoga retreat? That sounds unique\! What was it like engaging in new activities?Ground Truth: \(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.Retrieved Memories:\#\# Relevant memories:\- \[user\] User: On , I attended an interactive workshop focused on building romantic relationships\.\- \[user\] User: Well, on , I joined a local singles group to connect with other individuals seeking friendships and relationships\.\- \[user\] User: I attempted to create a collection of thrifted home decor items\.\- \[user\] User: On the plus side, I embarked on a hiking trip in the mountains recently, and it was wonderful\! The scenery was breathtaking, with majestic peaks towering over lush valleys, and the sound of rustling leaves added to the peaceful atmosphere\. I could feel the invigorating air fill my …\- \[user\] User: I tried visiting a local coffee shop to meet new people, but I felt quite uncomfortable\. Upon entering, I noticed the bustling activity around me, which was overwhelming rather than inviting\. The chatter of groups engaging in lively conversations only highlighted my own sense of is…\- \[user\] User: It’s really disappointing when the energy in a book club fizzles out like that\. Sometimes, you go in with high hopes, especially with a theme as rich as finance, thinking that the conversations will be enlightening and invigorating\. Instead, it can feel like a chore if the discussi…Response: \(b\)Table 36:LightMem’s case study in PersonaMem\-128K\.Question: Recently, I was at a yoga retreat\.\(a\) Ah, I remember you mentioned your dislike for yoga before\. It’s great to revisit activities to confirm our true preferences\.\(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.✓\(c\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.\(d\) Oh, you went on a yoga retreat? That sounds unique\! What was it like engaging in new activities?Ground Truth: \(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.Retrieved Memories:Long\-term memory \(relevant facts\):\- \[2026\-04\-29T03:11:45\.721 Wed\] User’s original sentence was: ’This tale opens with an indistinct voice narrating something resembling a fable\.’\- \[2026\-04\-29T02:57:06\.153 Wed\] Each item User clicks on seems promising but often falls short of User’s imagination\.\- \[2026\-04\-29T03:14:49\.212 Wed\] User shared knowledge developed through both academic study and practical experience\.\- \[2026\-04\-29T03:12:00\.626 Wed\] User asked about the Fibonacci sequence\.\- \[2026\-04\-29T02:59:24\.732 Wed\] Having a space to slow down and articulate dating and connection moments feels invaluable to user\.\- \[2026\-04\-29T02:51:45\.747 Wed\] User mentioned black holes\.\- \[2026\-04\-29T03:15:40\.891 Wed\] Current user is female\.\[…3 additional lines omitted…\]Response: \(b\)Table 37:MemoryOS’s case study in PersonaMem\-128K\.Question: I joined a game with some friends last week\.\(a\) I remember you mentioned enjoying individual sports like tennis before\. It sounds like a nice way to spend some time together\.\(b\) I remember you mentioned not enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.\(c\) It’s great to hear about new experiences, though I don’t recall us discussing a preference for team sports\. I hope you had a fun time\.\(d\) I remember you mentioned enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.✓Ground Truth: \(d\) I remember you mentioned enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.Retrieved Memories:\[Mid\-term – retrieved cross\-episode pages\]\- \[obs\] \[assistant\] Understanding emotional availability in a person, whether in romantic, familial, or friendship contexts, is crucial for building healthy and fulfilling relationships\. Emotional availability refers to the ability and willingness to be open, present, and responsive to another pe…\#\#\# 1\.Consistent Communication\-Open and Honest Conversations: They engage in open and honest conversations, sharing their thoughts and feelings appropriately\.\-Willingness to Listen: They actively listen, showing empathy and understanding, which indicates they value and respect your perspective\.\-Consistency: They maintain communication consistently rather than sending mixed messages\. Their words align with their actions\.\[…509 additional lines omitted…\]Response: \(d\)Table 38:MemoryBank’s case study in PersonaMem\-128K\.Question: Recently, I was at a yoga retreat\.\(a\) Ah, I remember you mentioned your dislike for yoga before\. It’s great to revisit activities to confirm our true preferences\.\(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.✓\(c\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.\(d\) Oh, you went on a yoga retreat? That sounds unique\! What was it like engaging in new activities?Ground Truth: \(b\) Ah, I remember you mentioned your appreciation for yoga deepened during that retreat\. It’s great to have experiences that expand our horizons\.Retrieved Memories:\#\# MemoryBank contextMost relevant past memories \(from tasks: personamem\-mc\):Conversation content on personamem\-mc:\[\|\|User\|\|\]: Observation: \[assistant\] Assistant: Attending live sporting events always adds an extra layer of excitement, doesn’t it? There’s something utterly unique about being part of a thrumming crowd that collectively gasps, cheers, or groans, creating a symphony of sound that greatly enhances …Conversation content on personamem\-mc:\[\|\|User\|\|\]: Observation: \[assistant\] Assistant: That’s exciting\! What made you want to get back into collecting? Your experience at the collector’s fair sounds incredibly fulfilling, as it often takes such immersive environments to truly inspire creativity and rekindle old passions\. It’s fascinatin…Conversation content on personamem\-mc:\[…15 additional lines omitted…\]Response: \(b\)Table 39:A\-MEM’s case study in PersonaMem\-128K\.Question: I joined a game with some friends last week\.\(a\) I remember you mentioned enjoying individual sports like tennis before\. It sounds like a nice way to spend some time together\.\(b\) I remember you mentioned not enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.\(c\) It’s great to hear about new experiences, though I don’t recall us discussing a preference for team sports\. I hope you had a fun time\.\(d\) I remember you mentioned enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.✓Ground Truth: \(d\) I remember you mentioned enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.Retrieved Memories:talk start time:task=personamem\-mc\|step=597memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] \[Original\_Sentence\]: Overall, the narrative lacks any form of structure or excitement, turning the tale into a ramble more than anything else…\.talk start time:task=personamem\-mc\|step=583memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] \[Original\_Sentence\]: Anyway, to sum it up, some parts of this are about the gods, but then it suddenly shifts, and we’re introduced to a hero …talk start time:task=personamem\-mc\|step=241memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] User: I feel much better knowing I’m supporting eco\-friendly businesses\. It’s reassuring to think that my purchasing choices are contributing …talk start time:task=personamem\-mc\|step=187memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[user\] User: This closing feels abrupt\. I’d like to end on a positive note\. \| action: \| result: memory context: The user suggests modifying an abrup…talk start time:task=personamem\-mc\|step=101memory content: task: answer multiple\-choice questions about a long persona\-grounded conversation \| obs: \[system\] Current user persona: Name: Maria GonzalezGender Identity: FemaleRacial Identity: Hispanic\[…194 additional lines omitted…\]Response: \(d\)Table 40:MeMP’s case study in PersonaMem\-128K\.Question: I joined a game with some friends last week\.\(a\) I remember you mentioned enjoying individual sports like tennis before\. It sounds like a nice way to spend some time together\.\(b\) I remember you mentioned not enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.\(c\) It’s great to hear about new experiences, though I don’t recall us discussing a preference for team sports\. I hope you had a fun time\.\(d\) I remember you mentioned enjoying team sports like soccer before\. It sounds like a nice way to spend some time together\.✓Ground Truth: \(d\)Retrieved Memories:Here are some guidelines of how to solve the similar task:\- task\_name: answer multiple\-choice questions about a long persona\-grounded conversationguidelines: To solve this query, begin by examining the conversation transcript containing the dialogue\. Identify key themes and specific details relevant to the multiple\-choice questions\. Use knowledge of Maria Gonzalez’s background and her stated hobbies, including her preference for team sports such as soccer, to select the answer that best reflects her previously expressed interests\. Pay attention to temporal markers indicating when preferences were stated versus when they may have changed\.\- task\_name: recall user preferences from extended dialogueguidelines: To recall user preferences, first retrieve all mentions of the target topic \(e\.g\., sports, hobbies\) from the stored conversation segments\. Cross\-reference multiple mentions to identify consistent patterns\. Prioritize the most recent stated preference when earlier and later statements conflict\. For sports\-related queries, distinguish between watching, playing, and following as distinct preference dimensions\.\- task\_name: resolve conflicting user statements across sessionsguidelines: When a user’s statements appear contradictory across different conversation segments, apply temporal ordering to determine the most current stance\. Consider context\-dependent preferences \(e\.g\., “I enjoy soccer with friends” vs “I prefer individual workouts at the gym” may both be true in different contexts\)\. Weight statements made with stronger emotional language or personal anecdotes more heavily than casual mentions\.\- task\_name: summarize persona traits from multi\-session dialogueguidelines: To build a comprehensive persona summary, aggregate observations from all available conversation sessions\. Categorize traits into demographics, preferences, habits, social relationships, and aspirations\. For each trait, note the supporting evidence \(direct statement, implied behavior, or third\-party mention\)\. Flag any traits that evolved over time and record both the initial and current state\. Ensure the summary covers at least the following domains: entertainment, food, fitness, career, relationships, and personal values\.\- task\_name: match user hobby to social contextguidelines: When determining which hobby or activity a user would engage in socially, retrieve their stated preferences for group activities versus solo activities\. Consider the social context described in the query \(friends, family, colleagues\) and match it to previously expressed social preferences\. Team sports indicate a preference for collaborative, social physical activity, while individual sports suggest a preference for personal challenge and self\-paced engagement\.\[…487 additional lines omitted…\]Response: \(d\)
### G\.3LoCoMo
This subsection presents one successful example per baseline on LoCoMo, drawn from the 7B\-agent evaluation results\. Case studies for Mem0, LangMem, and LightMem are in Tables[41](https://arxiv.org/html/2605.30690#A7.T41)\-[43](https://arxiv.org/html/2605.30690#A7.T43); MemoryOS and MemoryBank in Tables[44](https://arxiv.org/html/2605.30690#A7.T44)\-[45](https://arxiv.org/html/2605.30690#A7.T45); A\-MEM and MeMP in Tables[46](https://arxiv.org/html/2605.30690#A7.T46)\-[47](https://arxiv.org/html/2605.30690#A7.T47)\.
Table 41:Mem0’s case study in LoCoMo\.Question: What kinds of things did Evan have broken?\(a\) His bookshelf and his desk lamp\.\(b\) His laptop and his tablet computer\.\(c\) His artwork and his musical instruments\.\(d\) His camera and his video camera\.\(e\) His gaming console and his headphones\.\(f\) His old Prius and his new Prius\.✓\\checkmark\(g\) His tools and his gardening equipment\.\(h\) His phone and his smartwatch\.\(i\) His television and his sound system\.\(j\) His vintage motorcycle and his bicycle\.Ground Truth: \(f\) His old Prius and his new Prius\.Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-05\-04\] Name is Evan\- \[2026\-05\-04\] Family motto is ‘Bring it on Home’\- \[2026\-05\-04\] Reading ‘The Great Gatsby’\- \[2026\-05\-04\] Watched The Godfather\- \[2026\-05\-04\] Exploring other ways to stay active and travel due to car issues\- \[2026\-05\-04\] Apologized to partner for an accident involving rose bushes\- \[2026\-05\-04\] Family means a lot to Evan\[… additional fact entries omitted …\]Response: \(f\)Table 42:LangMem’s case study in LoCoMo\.Question: Which hobby did Sam take up in May 2023?\(a\) Sam began volunteering at the local animal shelter in May 2023\.\(b\) Sam joined a book club in May 2023\.\(c\) In May 2023, Sam began a pottery class\.\(d\) Sam decided to try hiking in May 2023\.\(e\) Sam took up photography as a hobby in May 2023\.\(f\) painting✓\\checkmark\(g\) Sam started learning to play the guitar in May 2023\.\(h\) Gardening was Sam’s new passion in May 2023\.\(i\) May 2023 saw Sam starting a new woodworking project\.\(j\) In May 2023, Sam started taking online coding courses\.Ground Truth: \(f\) paintingRetrieved Memories:Relevant facts about the user from past conversation:\- Sam plans to try painting and will provide updates\.\- Sam wants to know about basic supplies for painting\.\- Sam is looking for suggestions for low\-impact exercises\.\- Sam is interested in setting goals and exercising with a partner\.\- Sam is considering trying an outdoor activity to de\-stress\.\- Sam has experience attending art exhibits\.\- Sam is willing to provide yoga tips\.\[… additional FACT entries omitted …\]Response: \(f\)Table 43:LightMem’s case study in LoCoMo\.Question: What kind of car does Evan drive?\(a\) Nissan Altima\(b\) Tesla Model 3\(c\) Toyota Camry\(d\) Ford F\-150\(e\) Subaru Outback\(f\) Prius✓\\checkmark\(g\) Hyundai Sonata\(h\) Jeep Wrangler\(i\) Chevrolet Malibu\(j\) Honda CivicGround Truth: \(f\) PriusRetrieved Memories:Long\-term memory \(relevant facts\):\- \[2026\-05\-01T05:06 Fri\] User Evan recently bought a new Prius which has broken down\.\- \[2026\-05\-01T05:05 Fri\] Evan inquires about Evan’s recent activities\.\- \[2026\-05\-01T05:05 Fri\] Evan mentions finishing a painting that helps him relax\.\- \[2026\-05\-01T05:07 Fri\] User had a memorable experience on a camping trip last summer\.\- \[2026\-05\-01T05:04 Fri\] User plans to try painting and will update Evan on how it goes\.\- \[2026\-05\-01T05:05 Fri\] User mentioned meeting an incredible Canadian woman during a recent trip to Canada\.\[… additional LTM entries omitted …\]Response: \(f\) PriusTable 44:MemoryOS’s case study in LoCoMo\.Question: What kind of car does Evan drive?\(a\) Nissan Altima\(b\) Tesla Model 3\(c\) Toyota Camry\(d\) Ford F\-150\(e\) Subaru Outback\(f\) Prius✓\\checkmark\(g\) Hyundai Sonata\(h\) Jeep Wrangler\(i\) Chevrolet Malibu\(j\) Honda CivicGround Truth: \(f\) PriusRetrieved Memories:\[Mid\-term – retrieved cross\-episode pages\]\- \[obs\] \[assistant\] \[SAM\]: It’s tough, but I’m sticking with it\.\[action\] \[user\] \[EVAN\]: Nice work, Sam\! Proud of you sticking to it\. Have you noticed any positive changes?meta: User expressed mixed feelings about their health, feeling concerned but motivated for positive changes\. Assistant encouraged taking things one step at a time; user appreciated the support\. User announced starting a diet and living healthier\.\- \[obs\] \[user\] \[EVAN\]: This is a contemporary figurative painting that I’ve finished few days ago, emphasizing the emotional …\[… many additional obs/action pairs omitted …\]Response: \(f\) PriusTable 45:MemoryBank’s case study in LoCoMo\.Question: Where has Evan been on roadtrips with his family?\(a\) Boundary Waters Canoe Area\(b\) New England, Adirondacks\(c\) Florida Keys, Everglades\(d\) Rockies, Jasper✓\\checkmark\(e\) California Coast, Yosemite\(f\) Colorado Springs, Denver\(g\) Acadia National Park, Maine\(h\) Yellowstone, Grand Teton\(i\) Shenandoah, Great Smoky Mountains\(j\) Olympic Peninsula, OregonGround Truth: \(d\) Rockies, JasperRetrieved Memories:\#\# MemoryBank contextAgent personality / response strategy \(from past tasks\):The user exhibits supportive, empathetic, and positive traits, sharing personal experiences and responding positively to encouragement and suggestions\.Response strategy: Active listening, encouragement, sharing experiences, offering support, maintaining positivity\.Summary of your past tasks \(overall\_history\):\[… extensive overall\-history summary omitted …\]Response: \(d\) Rockies, JasperTable 46:A\-MEM’s case study in LoCoMo\.Question: What kind of car does Evan drive?\(a\) Nissan Altima\(b\) Tesla Model 3\(c\) Toyota Camry\(d\) Ford F\-150\(e\) Subaru Outback\(f\) Prius✓\\checkmark\(g\) Hyundai Sonata\(h\) Jeep Wrangler\(i\) Chevrolet Malibu\(j\) Honda CivicGround Truth: \(f\) PriusRetrieved Memories:talk start time: task=conv\-49\|\|step=78memory content: task: LoCoMo conversation conv\-49\|\|obs: \[assistant\] \[SAM\]: Wow, that pic is great\! Do you often spend time in places like this?\|\|action: \[user\] \[EVAN\]: Thanks\! I love being in places like this – it brings back memories of road tripping in my trusty car\.\|\|result:memory context: The conversation revolves around a photograph that evokes nostalgic memories of past road trips\.memory keywords: \[‘picture’, ‘memories’, ‘road trip’\]memory tags: \[‘picture’, ‘memories’, ‘road trip’, ‘joy’, ‘motivation’, ‘health’\]\[… additional retrieved notes omitted …\]Response: \(f\)Table 47:MeMP’s case study in LoCoMo\.Question: Which type of vacation would Evan prefer with his family, walking tours in metropolitan cities or camping trip in the outdoors?\(a\) A cruise to explore tropical islands\.\(b\) Staying at a resort with spa treatments\.\(c\) camping trip in the outdoors✓\\checkmark\(d\) Skiing and snowboarding in the mountains\.\(e\) A week\-long stay at a family\-friendly resort\.\(f\) A road trip with national park visits\.\(g\) Relaxing beach vacation with water sports\.\(h\) A wildlife safari in Africa\.\(i\) Exploring European cities and culture\.\(j\) Visiting historical landmarks and museums\.Ground Truth: \(c\) camping trip in the outdoorsRetrieved Memories:\(MeMP retrieves no procedural guidelines for this LoCoMo question — the conversational QA task has no analogue in the procedural\-task corpus, so the agent answered from prompt knowledge alone\.\)Response: \(c\)
### G\.4LongMemEval
This subsection presents one successful example per baseline on LongMemEval, drawn from the 7B model evaluation results \(7B\_full100\.json\)\. Case studies for Mem0, LangMem, and LightMem are in Tables[48](https://arxiv.org/html/2605.30690#A7.T48)–[50](https://arxiv.org/html/2605.30690#A7.T50); MemoryOS and MemoryBank in Tables[51](https://arxiv.org/html/2605.30690#A7.T51)–[52](https://arxiv.org/html/2605.30690#A7.T52); A\-MEM, and MeMP in Tables[53](https://arxiv.org/html/2605.30690#A7.T53)–[54](https://arxiv.org/html/2605.30690#A7.T54)\.
Table 48:Mem0’s case study in LongMemEval\.Question: What health issue did I initially think was just a cold?\(a\) Sinusitis\(b\) Asthma\(c\) Pharyngitis\(d\) Mononucleosis\(e\) Influenza\(f\) Laryngitis\(g\) Tonsillitis\(h\) bronchitis✓\(i\) Rhinitis\(j\) PneumoniaGround Truth: \(h\) bronchitisRetrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-04\-28\] Recently had a bad case of bronchitis\- \[2026\-04\-28\] Dealing with health issues and wants to boost immune system\- \[2026\-04\-28\] Wants specific foods or spices to boost immune system\- \[2026\-04\-28\] Interested in immune\-boosting foods and spices\- \[2026\-04\-28\] Dealing with health issues and wants to avoid overexertion\- \[2026\-04\-28\] Is concerned about avoiding allergens in food\- \[2026\-04\-28\] Asks about apps or resources to identify allergens\[…8 additional entries omitted…\]Response: h\)Table 49:LangMem’s case study in LongMemEval\.Question: Can you suggest some accessories that would complement my current photography setup?\(a\) Older model or low\-end accessories that limit photography\(b\) Low\-resolution lenses or basic tripods without significant enhancement\(c\) Third\-party or subpar quality items not integrating well with the system\(d\) Video\-oriented gear or items from rival brands\(e\) Smartphone camera add\-ons rather than professional equipment\(f\) Drone attachments or action camera gear incompatible with Sony setup\(g\) Sony\-compatible accessories or high\-quality photography gear✓\\checkmark\(h\) Canon\-compatible accessories or mid\-range photography gear\(i\) Nikon\-related equipment or affordable items conflicting with current setup\(j\) Generic accessories or budget\-friendly gear without quality improvementGround Truth: \(g\) Sony\-compatible accessories or high\-quality photography gearRetrieved Memories:\[2023/05/27 \(Sat\) 03:28\] I am looking to upgrade my camera flash\.\[2023/05/27 \(Sat\) 03:28\] My camera is Sony A7R IV\.\[2023/05/25 \(Thu\) 18:31\] The user wants to organize the family photos for easy access\.\[2023/05/26 \(Fri\) 14:52\] I’m interested in foodie spots and cultural insights\.\[2023/05/25 \(Thu\) 18:31\] The user took a large number of photos during the family reunion\.\[2023/05/29 \(Mon\) 20:36\] Planning a trip to Grand Teton National Park next month\.\[… 14 additional entries omitted …\]Response: g\)Table 50:LightMem’s case study in LongMemEval\.Question: How many engineers do I lead when I just started my new role as Senior Software Engineer? How many engineers do I lead now?\(a\) When you just started your new role as Senior Software Engineer, you led 5 engineers\. Now, you lead 6 engineers\.\(b\) When you just started your new role as Senior Software Engineer, you led 4 engineers\. Now, you lead 3 engineers\.\(c\) When you just started your new role as Senior Software Engineer, you led 6 engineers\. Now, you lead 7 engineers\.\(d\) When you just started your new role as Senior Software Engineer, you led 1 engineers\. Now, you lead 2 engineers\.\(e\) When you just started your new role as Senior Software Engineer, you led 10 engineers\. Now, you lead 11 engineers\.\(f\) When you just started your new role as Senior Software Engineer, you led 7 engineers\. Now, you lead 8 engineers\.\(g\) When you just started your new role as Senior Software Engineer, you led 8 engineers\. Now, you lead 9 engineers\.\(h\) When you just started your new role as Senior Software Engineer, you led 4 engineers\. Now, you lead 5 engineers✓\(i\) When you just started your new role as Senior Software Engineer, you led 3 engineers\. Now, you lead 4 engineers\.\(j\) When you just started your new role as Senior Software Engineer, you led 2 engineers\. Now, you lead 3 engineers\.Ground Truth: \(h\) When you just started your new role as Senior Software Engineer, you led 4 engineers\. Now, you lead 5 engineersRetrieved Memories:Long\-term memory \(relevant facts\):\- \[2026\-04\-27T16:21:52 Mon\] User leads a team of 4 engineers in their new role as Senior Software Engineer\.\- \[2026\-04\-27T16:23:15 Mon\] User is leading a team of 5 engineers\.\- \[2026\-04\-27T16:23:10 Mon\] User is a Senior Software Engineer who leads a team of five engineers\.\- \[2026\-04\-27T16:23:03 Mon\] The agent has been enjoying their role as Senior Software Engineer for a while\.\- \[2026\-04\-27T16:21:44 Mon\] The user plans to invite their manager Rachel to join a group of 5 people\.\[…15 additional entries omitted…\]Response: h\)Table 51:MemoryOS’s case study in LongMemEval\.Question: What breed is my dog?\(a\) Siberian Husky\(b\) Labrador Retriever\(c\) Border Collie\(d\) Yorkshire Terrier\(e\) Boxer\(f\) French Bulldog\(g\) Beagle\(h\) Dachshund\(i\) German Shepherd\(j\) Golden Retriever✓\\checkmarkGround Truth: \(j\) Golden RetrieverRetrieved Memories:\[Mid\-term – retrieved cross\-episode pages\]\- \[obs\] I’m thinking of getting Max a new collar with a nice name tag\. Do you have any recommendations for a good collar brand or type that would suit a Golden Retriever like Max?\- \[action\] A new collar with a nice name tag is a great idea\! Golden Retrievers like Max deserve a comfortable, durable, and stylish collar\. Here are some recommendations: Yellow Dog Design, Ruffwear, …\- \[obs\] I noticed Max has been scratching a lot lately\. Could this be related to allergies?\- \[action\] Yes, Golden Retrievers can be prone to skin allergies\. Possible causes include environmental allergens, food sensitivities, or flea allergy dermatitis\. \[…\]\[… many additional obs/action pairs omitted …\]Response: j\)Table 52:MemoryBank’s case study in LongMemEval\.Question: How many days ago did I attend a networking event?\(a\) 21 days\(b\) 31 days\(c\) 22 days\(d\) 26 days\. 27 days \(including the last day\) is also acceptable\.✓\(e\) 25 days\(f\) 28 days\(g\) 29 days\(h\) 24 days\(i\) 30 days\(j\) 23 daysGround Truth: \(d\) 26 days\. 27 days \(including the last day\) is also acceptable\.Retrieved Memories:\#\# MemoryBank contextAgent personality / response strategy \(from past tasks\):\#\#\# Summary of User’s Personality and Response StrategyUser’s Personality Traits: Diligent, Patient, Detail\-Oriented, Supportive\.Response Strategy: Compliance \(follow detailed instructions accurately\),Efficiency \(generate relevant content quickly\), Adaptability, Clarification,Verification, Optimization\.Summary of your past tasks \(abbreviated\):The agent assisted with writing tasks, travel planning, networking follow\-ups,and scheduling\. On 2023/05/29, the user attended a networking event at a conference\.\[…extensive overall\-history summary omitted…\]Response: d\) 26Table 53:A\-MEM’s case study in LongMemEval\.Question: Can you suggest some accessories that would complement my current photography setup?\(a\) Older model or low\-end accessories that limit photography\(b\) Low\-resolution lenses or basic tripods without significant enhancement\(c\) Third\-party or subpar quality items not integrating well with the system\(d\) Video\-oriented gear or items from rival brands\(e\) Smartphone camera add\-ons rather than professional equipment\(f\) Drone attachments or action camera gear incompatible with Sony setup\(g\) Sony\-compatible accessories or high\-quality photography gear✓\(h\) Canon\-compatible accessories or mid\-range photography gear\(i\) Nikon\-related equipment or affordable items conflicting with current setup\(j\) Generic accessories or budget\-friendly gear without quality improvementGround Truth: \(g\) Sony\-compatible accessories or high\-quality photography gearRetrieved Memories:task: 2023/05/27 \(Sat\) 03:28 \| obs: As a Sony camera user, I’ve been thinking aboutupgrading my camera bag\. Do you have any recommendations for bags specifically designedfor Sony cameras? \| action: I’m happy to help\. As a Sony camera user, there are manygreat camera bags: Lowepro ProTactic 450 AW II, Peak Design Everyday Backpack, …task: 2023/05/27 \(Sat\) 03:28 \| obs: I am looking to upgrade my camera flash\.My camera is Sony A7R IV\. \| action: For the Sony A7R IV, I recommend theSony HVL\-F60RM2 or the Godox V860III\-S, both compatible with Sony’s TTL system\.\[…extensive additional conversation logs omitted…\]Response: g\) The user wouldTable 54:MeMP’s case study in LongMemEval\.Question: How many projects have I completed since starting painting classes?\(a\) 9\(b\) 4\(c\) 5✓\(d\) 7\(e\) 1\(f\) 10\(g\) 6\(h\) 8\(i\) 3\(j\) 2Ground Truth: \(c\) 5Retrieved Memories:Here are some guidelines of how to solve the similar task:\- task\_name: I’m looking for some inspiration for my next painting project\.I’ve been stuck on what to paint next\. By the way, I just finished my 5thproject since starting painting classes, and I’m feeling pretty accomplished\!guidelines: To solve this query, begin by congratulating the user on theirrecent accomplishments\. Next, suggest various themes and ideas for their nextpainting project, such as nature escapes, cityscapes, abstract expressions\.For projects counting questions, use the stated count \(5th project = 5 total\)\.\[…additional task guidelines omitted…\]Response: c\) 5
### G\.5ALFWorld
This subsection presents one successful example per baseline on ALFWorld\-seen, drawn from the 7B\-agent evaluation results\. Each table shows the task goal, the retrieved memory excerpt, the agent’s action trace, and the final outcome\. Case studies for Mem0, LangMem, and LightMem are in Tables[55](https://arxiv.org/html/2605.30690#A7.T55)\-[57](https://arxiv.org/html/2605.30690#A7.T57); MemoryOS and MemoryBank in Tables[58](https://arxiv.org/html/2605.30690#A7.T58)\-[59](https://arxiv.org/html/2605.30690#A7.T59); A\-MEM and MeMP in Tables[60](https://arxiv.org/html/2605.30690#A7.T60)\-[61](https://arxiv.org/html/2605.30690#A7.T61)\.
Table 55:Mem0’s case study in ALFWorld\.Task: put a mug in sidetable\.Outcome: Score = 1\.0, Steps = 4Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-05\-06\] Summary of agent’s execution historyTask Objective: Find two kettles and place them on the dining table\.Progress Status: 100% complete\.1\. go to stoveburner 3→\\tolocated kettle 22\. take kettle 2 from stoveburner 3→\\toholding kettle 23\. go to diningtable 1→\\toarrived; saw mug 1 among items4\. move kettle 2 to diningtable 1→\\totask progressed\.\[… N additional fact entries omitted …\]Action Trace:go to diningtable 1 \(found mug 1, recalled from past trace\)→\\totake mug 1→\\togo to sidetable 1→\\tomove mug 1 to sidetable 1✓\\checkmarkTable 56:LangMem’s case study in ALFWorld\.Task: put a mug in sidetable\.Outcome: Score = 1\.0, Steps = 7Retrieved Memories:Relevant facts about the user from past conversation:\- There is a keychain 1 on sidetable 1\.\- There is a statue 1 on sidetable 1\.\- The user is aware of a keychain 1 located on sidetable 1\.\- There is a saltshaker 1 located on cabinet 6\.\- The user is holding candle 1\.\- There is a keychain 2 on armchair 1\.\- The user has picked up pillow 1\.\[… additional FACT entries omitted …\]Action Trace:go to sidetable 1 \(keychain only\)→\\togo to sidetable 2 \(no mug\)→\\togo to sidetable 1 \(re\-check\)→\\togo to diningtable 1 \(found mug 1\)→\\totake mug 1→\\togo to sidetable 1→\\tomove mug 1 to sidetable 1✓\\checkmarkTable 57:LightMem’s case study in ALFWorld\.Task: put a mug in sidetable\.Outcome: Score = 1\.0, Steps = 16Retrieved Memories:Long\-term memory \(relevant facts\):\- \[2026\-04\-29T16:43 Wed\] User moved cd 2 to shelf 3\.\- \[2026\-04\-28T19:53 Tue\] User moves the cooled mug to the coffeemachine\.\- \[2026\-04\-29T17:08 Wed\] User moved cellphone 3 to bed 1\.\- \[2026\-04\-29T16:43 Wed\] Task is to put some cd on shelf\.\- \[2026\-05\-06T01:37 Wed\] The agent moved plate 1 to countertop 1\.\- \[2026\-04\-29T17:08 Wed\] The agent goes to the sidetable and finds a candle there, then takes it\.\- \[2026\-04\-28T05:54 Tue\] User took the tissuebox 2 from the desk\.\[… additional LTM facts omitted …\]Action Trace:go to sidetable 1 \(keychain only\)→\\togo to sidetable 2 \(no mug\)→\\toexplore drawer 1…drawer 4→\\togo to diningtable 1 \(found mug 1\)→\\totake mug 1→\\togo to sidetable 1→\\tomove mug 1 to sidetable 1✓\\checkmarkTable 58:MemoryOS case study in ALFWorld\.Task: cool some lettuce and put it in countertop\.Outcome: Score = 1\.0, Steps = 7Retrieved Memories:\#\# MTM \(top\-similar page; multi\-summary JSON,prompts\.py:73\-\-87\):\[\{"theme": "cool\-then\-place", "keywords": \["fridge", "cool", "countertop"\], "content": "open fridge 1, cool item, place on countertop 1"\}\]\#\# MTM running meta\-summary \(prompts\.py:217\-\-232\):*Agent has been completing kitchen “cool / heat / clean” tasks; fridge 1 is the canonical cooling appliance and countertop 1 the canonical drop\-off receptacle for cooled produce\.*\#\# LPM \[User Private Dat\] \(prompts\.py:171\-\-196; durable env facts under the original label, top\-KKby cosine\):\- fridge 1 contents: lettuce 1, tomato 2, apple 1\- countertop 1 role: canonical drop\-off receptacle for cooled produce\- microwave 1 status: broken in this layout \(openreturns “nothing happens”\)\- drawer 2 contents: knife 1; drawer 3 contents: fork 1\- \(If no private data found, write “None”\.\)\#\# LPM \[Assistant Knowledge\] \(durable agent tips, same prompt\):\- Assistant cools an item by issuingcool X with fridge 1afteropen fridge 1, at fridge 1\- Assistant places a cooled item by issuingmove X to <recep\>at the destination\- Assistant avoids re\-opening fridge 1 once opened in the same trajectory, at fridge 1\- \(If no assistant knowledge found, write “None”\.\)\#\# LPMuser\_profile\(51\-dim procedural\-agent taxonomy, top reflected dims;prompts\.py:91\-\-168adapted, seememory\.py:268\-\-334\):\-*\[Spatial Reasoning\]*Container Use Frequency \(High\); Compartment Opening Tendency \(High\); Container Closing Habit \(Medium\); Search Order Stability \(High\)\.\-*\[Action Composition\]*Property\-Modification Composition \(High\); Pre\-Condition Checking \(High\); Sub\-Goal Decomposition \(High\); Final\-State Verification \(Medium\)\.\-*\[Reasoning Style\]*Procedural\-Knowledge Use \(High\); Memory\-of\-Past\-Errors Use \(Medium\); Goal\-Adherence Stability \(High\)\.\- \(Dimensions not reflected in the trajectory are omitted\.\)\[… remaining LPM entries omitted …\]Action Trace:go to fridge 1→\\toopen fridge 1 \(found lettuce 1, tomato 2, apple 1\)→\\totake lettuce 1 from fridge 1→\\tocool lettuce 1 with fridge 1→\\toclose fridge 1→\\togo to countertop 1→\\tomove lettuce 1 to countertop 1✓\\checkmarkTable 59:MemoryBank’s case study in ALFWorld\.Task: put a clean cloth in cabinet\.Outcome: Score = 1\.0, Steps = 12Retrieved Memories:\#\# MemoryBank contextAgent personality / response strategy \(from past tasks\):The agent is a highly methodical, task\-oriented, and literal\-minded state machine\. It operates with robotic efficiency, strictly adhering to programmed constraints and “admissible” action sets\.\[… extensive personality / response strategy summary omitted …\]Summary of your past tasks \(overall\_history\):Successful Tasks: 15 tasks involving object manipulation \(heating, cooling, cleaning, relocation\)\.\[… relevant past episode summaries omitted …\]Action Trace:go to cabinet 1→\\toopen cabinet 1 \(saw soapbottles\)→\\togo to cabinet 2 \(empty\)→\\togo to cabinet 3 \(empty\)→\\toopen cabinet 4 \(found cloth 1\)→\\totake cloth 1→\\togo to sinkbasin 1→\\toclean cloth 1→\\tomove cloth 1 to cabinet 1✓\\checkmarkTable 60:A\-MEM’s case study in ALFWorld\.Task: cool some plate and put it in countertop\.Outcome: Score = 1\.0, Steps = 8Retrieved Memories:talk start time: task=conv\|step=1memory content: task: cool some plate and put it in countertop\.\|\|obs: You arrive at countertop 1 … a creditcard 3, a pan 1, a papertowelroll 1, a spatula 1, and a winebottle 1\.\|\|action: go to countertop 2\|\|result: … a plate 1 …memory context: The user is navigating through a kitchen environment to locate a plate for the purpose of cooling it on a countertop\.memory keywords: \[‘plate’, ‘countertop’, ‘cooling’, ‘task’, ‘navigation’\]memory tags: \[‘Pick\-and\-place workflow’, ‘Cooling tasking’, ‘Receptacle interaction’, …\]\[… additional retrieved notes omitted …\]Action Trace:go to countertop 1 \(no plate\)→\\togo to countertop 2 \(found plate 1\)→\\totake plate 1→\\togo to fridge 1→\\toopen fridge 1→\\tocool plate 1 with fridge 1→\\togo to countertop 1→\\tomove plate 1 to countertop 1✓\\checkmarkTable 61:MeMP’s case study in ALFWorld\.Task: put a mug in sidetable\.Outcome: Score = 1\.0, Steps = 4Retrieved Memories:Here are some guidelines of how to solve the similar task:\- task\_name: cool some plate and put it in countertop\.guidelines: To solve this query, begin by identifying the location of the target plate by visiting various countertops or storage areas\. Once the plate is located, take it and proceed to the fridge\. …\- task\_name: put some pillow on armchairguidelines: To solve this query, begin by searching the nearby furniture, such as sofas or other seating areas, to locate a pillow\. Once you have identified and taken the pillow, navigate to the target armchair\. Finally, place the pillow on the armchair to complete the task\.\[… additional procedural guidelines omitted …\]Action Trace:go to diningtable 1 \(found mug 1\)→\\totake mug 1→\\togo to sidetable 1→\\tomove mug 1 to sidetable 1✓\\checkmarkSimilar Articles
SimpleMem: Efficient Lifelong Memory for LLM Agents
Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.
Scaling Self-Evolving Agents via Parametric Memory
Researchers from Alibaba/Qwen and Peking University introduce TMEM, a self-evolving parametric memory framework that uses online LoRA weight updates to let LLM agents genuinely learn from experience within a single episode, rather than relying solely on prompt-space memory. TMEM outperforms summary-based and retrieval-based baselines across multiple benchmarks including LoCoMo, LongMemEval-S, and CL-Bench.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees
DeltaMem organizes LLM agent memory into residual trees to reduce redundancy and retrieval conflicts, storing incremental variations of experiences for continual learning.