Scaling Self-Evolving Agents via Parametric Memory

arXiv cs.AI 06/04/26, 04:00 AM Papers
Summary
Researchers from Alibaba/Qwen and Peking University introduce TMEM, a self-evolving parametric memory framework that uses online LoRA weight updates to let LLM agents genuinely learn from experience within a single episode, rather than relying solely on prompt-space memory. TMEM outperforms summary-based and retrieval-based baselines across multiple benchmarks including LoCoMo, LongMemEval-S, and CL-Bench.
arXiv:2606.04536v1 Announce Type: new Abstract: Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $\Delta_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $\pi_{\theta_0+\Delta_t}$, while extraction actions produce supervision that updates $\Delta_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $\theta_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:08 AM
# Scaling Self-Evolving Agents via Parametric Memory
Source: [https://arxiv.org/html/2606.04536](https://arxiv.org/html/2606.04536)
Tao Ren∗, Weiyao Luo, Hui Yang, Rongzhi Zhu, Xiang Huang, Yuchuan Wu\(✉\)\{\}^\{\(\\textrm\{\{\\char 0\\relax\}\}\)\}, Bingxue Chou, Jieping Ye, Jiafeng Liang\(✉\)\{\}^\{\(\\textrm\{\{\\char 0\\relax\}\}\)\}, Yongbin Li\(✉\)\{\}^\{\(\\textrm\{\{\\char 0\\relax\}\}\)\}, Yijie Peng\(✉\)\{\}^\{\(\\textrm\{\{\\char 0\\relax\}\}\)\} Qwen\-Character Team![[Uncaptioned image]](https://arxiv.org/html/2606.04536v1/tongyi.jpg), Alibaba Group Peking University

###### Abstract

Existing memory\-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout\. Such agents can*look up*what they have seen but cannot*learn from*it: their policy is unchanged by experience, and any information dropped from the context is permanently lost\. We introduceTMEM, a self\-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weightsΔt\\Delta\_\{t\}via lightweight online updates, genuinely altering its future behavior within a single episode\. We formalize this as an agentic decision process with fast\-weight rollout dynamics: actions are sampled fromπθ0\+Δt\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\}\}, while extraction actions produce supervision that updatesΔt\\Delta\_\{t\}for subsequent decisions\. This view makes the extraction policy directly optimizable by RL: trainingθ0\\theta\_\{0\}improves not only task actions but also the quality of the data used for online LoRA adaptation\. We further propose SVD\-based initialization of the LoRA subspace to accelerate online convergence\. Experiments on LoCoMo, LongMemEval\-S, multi\-objective search, and CL\-Bench show thatTMEMconsistently outperforms summary\-based and retrieval\-based baselines across different model scales\.

✉✉footnotetext:Corresponding authors: jfliang@ir\.hit\.edu\.cn, pengyijie@pku\.edu\.cn; \* Email to: rtkenny@stu\.pku\.edu\.cn## 1Introduction

*“We are who we are because of what we learn and what we remember\.”*— Eric Kandel

Long\-horizon LLM agents are expected to operate over experiences that far exceed a single context window: multi\-session conversations, extended web\-search trajectories, iterative tool use, and evolving user preferences\(Li et al\.,[2026a](https://arxiv.org/html/2606.04536#bib.bib9);[b](https://arxiv.org/html/2606.04536#bib.bib10); Ren et al\.,[2025a](https://arxiv.org/html/2606.04536#bib.bib20); Yang et al\.,[2026b](https://arxiv.org/html/2606.04536#bib.bib35)\)\. In these settings, memory should do more than preserve a compressed record of past tokens; it should change how future actions are produced\. A useful analogy comes from biological memory: learning and memory are coupled because experience is ultimately reflected in the same substrate that performs computation\(Liang et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib12);[2026](https://arxiv.org/html/2606.04536#bib.bib13)\)\. Current LLM agents largely break this coupling\. They either keep more history in the prompt or store past interactions in external memories such as summaries and retrieval indices, while the underlying model parameters remain frozen throughout the rollout\. As a result, experience influences future behavior only when it is explicitly brought back as input tokens\.

The simplest way to preserve experience is to keep the full interaction history in context, but this strategy fails along two axes\. Computationally, attention cost grows quickly with sequence length, making long episodes expensive\(Yu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib37); Ren et al\.,[2025b](https://arxiv.org/html/2606.04536#bib.bib21);[c](https://arxiv.org/html/2606.04536#bib.bib22); Li et al\.,[2026c](https://arxiv.org/html/2606.04536#bib.bib11)\)\. Statistically, raw histories are dominated by redundant tool outputs, repeated dialogue, and task\-irrelevant details, so the relevant evidence becomes sparse inside a noisy prompt\. Even when the necessary information is technically present, a single forward pass must both locate it and use it correctly\. Once the context is truncated or compressed, any omitted evidence no longer has a path to affect the policy\.

Existing memory agents address this bottleneck mainly through prompt\-space memory\. Summary\-based methods periodically compress the history into textual states, reducing context length but introducing a lossy bottleneck that can discard fine\-grained facts\. Retrieval\-based methods preserve more verbatim evidence in an external index, but their effectiveness depends on embedding quality, query formulation, and index maintenance\. These approaches are useful, yet they share a structural limitation: the learned policy itself does not change during the episode\. The agent can consult stored experience, but it cannot internalize that experience into the computation used for later decisions\. This raises a central question: can an LLM agent write useful experience into its own parameters at test time, so that memory shapes the policy rather than merely occupying the prompt?

We introduceTMEM, a self\-evolving parametric memory framework for answering this question\. During a rollout, the agent maintains a working contexthth\_\{t\}, optional explicit memorymtm\_\{t\}, and fast parametric memoryΔt\\Delta\_\{t\}represented by LoRA weights\. When the context budget is reached, the agent enters a memory\-writing mode: it distills the current session into grounded QA\-style supervision and applies a lightweight online SFT update toΔt\\Delta\_\{t\}\. Subsequent actions are then sampled from the adapted policyπθ0\+Δt\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\}\}, where the base parametersθ0\\theta\_\{0\}remain fixed within the rollout\. In this way, distilled experience can influence future reasoning through fast weights, without requiring the same evidence to be repeatedly reinserted into the prompt\.

This formulation also changes what should be optimized\. Memory extraction is not an auxiliary preprocessing step; it is an action whose quality determines the data used for later online adaptation\. We therefore formalize agent execution as a fast\-weight rollout in which ordinary task actions, memory\-writing actions, explicit\-memory updates, and LoRA updates are part of one decision process\. During RL training,θ0\\theta\_\{0\}is optimized across rollouts while gradients are stopped through the online update operator\. The resulting objective trains the base model not only to solve tasks, but also to produce supervision that makes its own future fast\-weight updates useful\.

Our contributions are as follows:

1. 1\.We formalizeTMEMas a fast\-weight rollout process whose policy depends on working context, explicit memory, and online LoRA weights\. SettingΔt≡0\\Delta\_\{t\}\\equiv 0recovers purely explicit\-memory agents as special cases, unifying existing context\-management designs under a single framework\.
2. 2\.We propose SVD\-based initialization for the LoRA projection matrix, anchoring online updates to high\-energy directions of the pretrained weights\. This accelerates few\-step adaptation and reduces per\-trigger compute by fixing the projection matrix and updating only the coefficient matrix\.
3. 3\.We derive a stop\-gradient policy optimization objective over fast\-weight rollouts, allowing outcome rewards to improve both task behavior and the extraction of supervision used by online LoRA updates\.
4. 4\.We validateTMEMon LoCoMo, LongMemEval\-S, multi\-objective search, and CL\-Bench, showing consistent gains over summary\-based and retrieval\-based memory baselines across Qwen3\-4B and Qwen3\-8B backbones\.

## 2Related Works

#### Summarization\-based context management\.

Summarization\-based methods compress long interaction histories into concise textual memories that fit within a fixed context budget\. Early work explores heuristic or recurrent\-style compression, such asRecurrentGPT’s language\-state simulation andMemoryBank’s forgetting\-inspired memory update\(Zhou et al\.,[2023](https://arxiv.org/html/2606.04536#bib.bib41); Zhong et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib40)\)\. Later methods learn when and how to compress context:MemAgentperforms chunk\-wise memory overwriting for long\-context conversational tasks without tool calls\(Yu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib37)\), whileMEM1applies the same idea to search\-agent tasks that invoke an external search tool\(Zhou et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib42)\);SUPOandContext\-Foldingfurther optimize summary use for long\-horizon agents\(Lu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib15); Sun et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib25)\)\. Recent systems further improve extraction quality or scalability through proactive correction and adaptive stopping\(Yang et al\.,[2026a](https://arxiv.org/html/2606.04536#bib.bib34); Wang et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib29)\)\. However, their memory still lives entirely in context space, whose compression capacity is limited: once fine\-grained evidence is omitted from a summary, later reasoning cannot recover it\.

#### Retrieval\-augmented context management\.

Retrieval\-based methods preserve history in external stores and retrieve relevant snippets when needed\. Standard RAG first augments generation with dense retrieval over documents\(Lewis et al\.,[2020](https://arxiv.org/html/2606.04536#bib.bib8)\); agent systems then extend this idea to interactive memory, includingGenerative Agents’ reflection streams andMemGPT’s managed context cache\(Park et al\.,[2023](https://arxiv.org/html/2606.04536#bib.bib19); Packer et al\.,[2023](https://arxiv.org/html/2606.04536#bib.bib18)\)\. More structured memory systems organize stored experience with entity graphs, scalable long\-term stores, RL\-trained memory management, or agentic linking, such asGraphRAG,Mem0,Memory\-R1, andA\-MEM\(Research,[2024](https://arxiv.org/html/2606.04536#bib.bib23); Chhikara et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib6); Yan et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib33); Xu et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib32)\)\. Retrieval avoids aggressive summarization, but it depends heavily on retriever quality: irrelevant or missed passages directly affect the policy, and constructing a high\-quality memory RAG store with embeddings, metadata, and update rules is often time\-consuming\.

#### Test\-time training\.

Test\-time training \(TTT\) adapts model parameters at inference time to handle distribution shifts or instance\-specific information\(Sun et al\.,[2020](https://arxiv.org/html/2606.04536#bib.bib26)\)\. Recent LLM\-oriented work scales this idea to long contexts or language tasks: end\-to\-end TTT adapts to long\-context inputs,LaCTimproves chunk\-level update efficiency, and methods such asTLMandTT\-SIuse unlabeled test data or self\-generated data for on\-the\-fly improvement\(Tandon et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib27); Zhang et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib38); Hu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib7); Acikgoz et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib1)\)\.Doc\-to\-LoRAfurther internalizes documents into LoRA weights with a hypernetwork\(Charakorn et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib3)\)\. However, existing TTT methods are not optimized for agentic memory: they typically train on the whole context or document rather than selectively learning from agent\-relevant experience, making updates heavy and outside the agent’s decision process\. We instead treat fast weightsΔt\\Delta\_\{t\}as part of the rollout dynamics and update them only from distilled memory\.

#### Self\-evolving agents\.

Self\-evolving agents progressively improve their behavior by reusing feedback, failures, or self\-generated experience\. Early systems mainly evolve through non\-parametric experience accumulation:Reflexionstores verbal self\-critiques,SPRINGconverts papers into game\-playing strategies, andVoyagergrows a code\-based skill library across Minecraft episodes\(Shinn et al\.,[2023](https://arxiv.org/html/2606.04536#bib.bib24); Wu et al\.,[2023](https://arxiv.org/html/2606.04536#bib.bib31); Wang et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib28)\)\. Later methods strengthen this loop with distilled experience or self\-generated training data, such asExpeL’s textual lessons andSPIN’s self\-play fine\-tuning\(Zhao et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib39); Chen et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib5)\)\. Recent work further targets autonomous agent improvement:UI\-Voyagerlearns GUI control from failed trajectories, whileLSEtrains models to refine their own test\-time contexts with improvement\-based rewards\(Lin et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib14); Chen et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib4)\)\. Unlike these methods, which mostly evolve across episodes, tasks, or outer\-loop updates,TMEMperforms*intra\-episode parametric self\-evolution*: distilled experience is written into fast LoRA weightsΔt\\Delta\_\{t\}during a single rollout, so the policy itself changes in real time rather than only consulting external memories or revised prompts\.

## 3Agentic Decision Process with Parametric Memory

We formalizeTMEMas an agentic decision process in which the policy can change during a rollout through fast LoRA updates\. Each episode starts from a task promptq∼Dq\\sim Dand contains at mostTTmodel\-generation events\. Let𝒱\\mathcal\{V\}denote the set of finite token vocabulary and𝒱∗\\mathcal\{V\}^\{\*\}the set of finite token sequences\. At generation eventt∈\{1,…,T\}t\\in\\\{1,\\ldots,T\\\}, the agent maintains a working contextht∈𝒱∗h\_\{t\}\\in\\mathcal\{V\}^\{\*\}, an explicit textual memorymt∈𝒱∗m\_\{t\}\\in\\mathcal\{V\}^\{\*\}, and fast parametric memoryΔt\\Delta\_\{t\}represented by LoRA weights\. The generated output is sampled from the adaptive policy

at∼πθ0\+Δt\(⋅∣ct\),ct∈\{\(q,ht,mt\),\(q,ht,mt,d\)\},a\_\{t\}\\sim\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\}\}\(\\cdot\\mid c\_\{t\}\),\\qquad c\_\{t\}\\in\\\{\(q,h\_\{t\},m\_\{t\}\),\(q,h\_\{t\},m\_\{t\},d\)\\\},\(1\)whereata\_\{t\}denotes the model\-generated action\. Its semantics are determined by the conditioning context: under the ordinary context\(q,ht,mt\)\(q,h\_\{t\},m\_\{t\}\)it is a task/tool action or final response; while under the extraction context\(q,ht,mt,d\)\(q,h\_\{t\},m\_\{t\},d\), in whichddis a memory\-writing prompt, it is a memory\-writing action such as a summary, QA pairs, distilled facts, or instruction–response examples\. The base parametersθ0\\theta\_\{0\}are fixed within a rollout but optimized across RL training;Δt\\Delta\_\{t\}changes within the rollout and acts as fast\-weight memory\. Purely explicit\-memory agents are recovered by settingΔt≡0\\Delta\_\{t\}\\equiv 0\.

Memory\-writing promptddfor QA\-pair extractionTask: Generate grounded SFT QA pairs from the current session\.Given the problem to solve, previous conversation history\. Now you should create high\-quality supervised fine\-tuning \(SFT\) QA pairs grounded on the history\.Requirements:1\.Generate QA pairs adaptively based on how much useful information is present in the session\.•If the session contains rich, concrete facts, generate more QA pairs\.•If the session has limited useful evidence, generate fewer QA pairs\.•If there is no usable evidence, return an empty JSON array\.2\.You can generate QA pairs that capture the lessons learned from the session to help improve future interactions, such as preferences, plans, events, and temporal details, rather than just factual questions\.3\.Each question must be answerable using explicit information from the session\.4\.Each answer must be concise, factual, and directly supported by the session\.5\.Cover diverse types when possible: who/what/when/where, preferences, plans, events, and temporal details\.6\.Avoid duplicate or near\-duplicate QA pairs, and keep wording natural and clear\.Return ONLY a JSON array\. Each item must be:\{"instruction": "<question\>","output": "<answer\>"\}Output the generated SFT QA pairs in the specified JSON format\. Do not include any explanations or additional text\.Figure 1:Memory\-writing promptddused when the working context exceeds the preset length\. The prompt instructs the agent to extract grounded SFT QA pairs from the current session for online LoRA updates\.### 3\.1Fast\-Weight Rollout Dynamics

LetLmaxL\_\{\\max\}be the context budget and letℓ\(⋅\)\\ell\(\\cdot\)denote token length\. For compactness, writest=\(q,ht,mt,Δt\)s\_\{t\}=\(q,h\_\{t\},m\_\{t\},\\Delta\_\{t\}\)and let\{\(st,at\)\}t=1T\\\{\(s\_\{t\},a\_\{t\}\)\\\}\_\{t=1\}^\{T\}be the complete sequence of model\-generation events in a rollout\. Memory operations are triggered at indices\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}whereℓ\(hti\)\+ℓ\(mti\)\>Lmax\\ell\(h\_\{t\_\{i\}\}\)\+\\ell\(m\_\{t\_\{i\}\}\)\>L\_\{\\max\}before the final response is emitted\. We sett0=0t\_\{0\}=0andtN\+1=Tt\_\{N\+1\}=T, with empty ranges ignored\. Between two consecutive boundaries, fori=1,…,N\+1i=1,\\ldots,N\+1andt=ti−1\+1,…,ti−1t=t\_\{i\-1\}\+1,\\ldots,t\_\{i\}\-1, the agent performs normal interaction:

at\\displaystyle a\_\{t\}∼πθ0\+Δt\(⋅∣q,ht,mt\),\\displaystyle\\sim\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\}\}\(\\cdot\\mid q,h\_\{t\},m\_\{t\}\),\(2\)ot\\displaystyle o\_\{t\}∼ℙenv\(⋅∣q,ht,mt,at\),\\displaystyle\\sim\\mathbb\{P\}\_\{env\}\(\\cdot\\mid q,h\_\{t\},m\_\{t\},a\_\{t\}\),\(3\)ht\+1\\displaystyle h\_\{t\+1\}=\(ht,at,ot\),mt\+1=mt,Δt\+1=Δt,\\displaystyle=\(h\_\{t\},a\_\{t\},o\_\{t\}\),\\qquad m\_\{t\+1\}=m\_\{t\},\\qquad\\Delta\_\{t\+1\}=\\Delta\_\{t\},\(4\)whereoto\_\{t\}denotes the environment feedback or observation returned after executingata\_\{t\}, such as a tool result, user response, or task\-state update, andht\+1=\(ht,at,ot\)h\_\{t\+1\}=\(h\_\{t\},a\_\{t\},o\_\{t\}\)denotes concatenation\. At each trigger indextit\_\{i\}for1≤i≤N1\\leq i\\leq N, the agent enters an extraction mode by appending the memory\-writing promptddin Figure[1](https://arxiv.org/html/2606.04536#S3.F1), which asks it to distill the accumulated context into QA\-pair supervision:

ati∼πθ0\+Δti\(⋅∣q,hti,mti,d\),a\_\{t\_\{i\}\}\\sim\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\_\{i\}\}\}\(\\cdot\\mid q,h\_\{t\_\{i\}\},m\_\{t\_\{i\}\},d\),\(5\)where the appended promptddmakesatia\_\{t\_\{i\}\}a memory\-writing action rather than a normal environment action\. The explicit memory and fast weights are then updated according to the memory strategy:

\(mti\+1,Δti\+1,hti\+1\)=\{\(∅,𝒯\(Δti,ati\),∅\),TMEM,\(ati,Δti,∅\),summary\-only baseline\.\(m\_\{t\_\{i\}\+1\},\\Delta\_\{t\_\{i\}\+1\},h\_\{t\_\{i\}\+1\}\)=\\begin\{cases\}\(\\varnothing,\\mathcal\{T\}\(\\Delta\_\{t\_\{i\}\},a\_\{t\_\{i\}\}\),\\varnothing\),&\\text\{\{TMEM\}\},\\\\ \(a\_\{t\_\{i\}\},\\Delta\_\{t\_\{i\}\},\\varnothing\),&\\text\{summary\-only baseline\}\.\\end\{cases\}\(6\)Here𝒯\\mathcal\{T\}is a lightweight online SFT/LoRA update that writes the extracted supervision into fast weights\. InTMEM, the extraction actionatia\_\{t\_\{i\}\}is structured as QA\-style supervision: it is absorbed intoΔti\+1\\Delta\_\{t\_\{i\}\+1\}through𝒯\\mathcal\{T\}, while no textual memory is carried forward\. In the summary\-only baseline, the fast weights remain unchanged and the extracted summaryatia\_\{t\_\{i\}\}is carried forward as explicit prompt\-space memory\. At the final boundarytN\+1=Tt\_\{N\+1\}=T, the actionaTa\_\{T\}is the terminal task response sampled from the ordinary context\(q,hT,mT\)\(q,h\_\{T\},m\_\{T\}\); it receives the episode reward and produces no subsequent environment transition\.

![Refer to caption](https://arxiv.org/html/2606.04536v1/x1.png)Figure 2:Architecture ofTMEM\.
### 3\.2Parametric Memory Initialization via SVD

Standard LoRA initializesΔ0=BA\\Delta\_\{0\}=BAwithB=0B=0andAAdrawn randomly, so that the adapted model starts identical to the base modelθ0\\theta\_\{0\}while the low\-rank subspace explored during training is arbitrary\. In our setting, each TTT trigger has only a small number of gradient steps, so random subspaces waste early updates discovering where useful adaptation should occur\. We instead initialize the projection matrixAAfrom the pretrained weight spectrum and train only the coefficient matrixBBonline\.

InTMEM, the LoRA adapters are applied only to the feed\-forward network \(FFN\) projections in the Transformer backbone\. For each selected target weightW∈ℝdout×dinW\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, letW=UΣV⊤W=U\\Sigma V^\{\\top\}and denote the top\-rrright singular vectors and singular values byVr∈ℝdin×rV\_\{r\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}andΣr∈ℝr×r\\Sigma\_\{r\}\\in\\mathbb\{R\}^\{r\\times r\}\. We set

B0=0,A0=ΣrVr⊤;B\_\{0\}=0,\\qquad A\_\{0\}=\\Sigma\_\{r\}V\_\{r\}^\{\\top\};\(7\)BecauseB0=0B\_\{0\}=0, the initializationΔ0=B0A0=0\\Delta\_\{0\}=B\_\{0\}A\_\{0\}=0and the rollout begins from the original base policy\. The difference from standard LoRA is thatA0A\_\{0\}already spans the rank\-rrsubspace associated with the largest singular values ofWW\. During online TTT, we keepA=A0A=A\_\{0\}fixed and update onlyBB, so the few available gradient steps learn coefficients within a pretrained, high\-energy subspace rather than jointly searching for both the subspace and the coefficients\.

###### Theorem 1\(Approximation advantage of SVD initialization\)\.

Under the setup and assumptions in Appendix[A](https://arxiv.org/html/2606.04536#A1), SVD initialization achieves no larger approximation error than random Gaussian initialization:

ℰ\(ASVD\)≤𝔼\[ℰ\(Arand\)\],\\mathcal\{E\}\(A\_\{\\textup\{SVD\}\}\)\\mskip 5\.0mu plus 5\.0mu\\leq\\mskip 5\.0mu plus 5\.0mu\\mathbb\{E\}\\bigl\[\\mathcal\{E\}\(A\_\{\\textup\{rand\}\}\)\\bigr\],and the inequality is strict wheneverρr\(Δ∗,W\)\>r/din\\rho\_\{r\}\(\\Delta^\{\*\},W\)\>r/d\_\{\\textup\{in\}\}\.

This theorem justifies using SVD to initialize the LoRA subspace: before any online update, it already provides a better \(or equal\) low\-rank approximation target than a random subspace in expectation\.

## 4Policy Optimization with Fast\-Weight Rollouts

The fast\-weight rollout couples two kinds of actions sampled from the same policy: ordinary actions that interact with the environment and extraction actions, triggered by the memory promptdd, that become training data for later LoRA updates\. We optimize the base model parametersθ0\\theta\_\{0\}so that both kinds of actions improve the final task reward\. During each sampled rollout,θ0\\theta\_\{0\}is held fixed whileΔt\\Delta\_\{t\}evolves through𝒯\\mathcal\{T\}; across RL updates,θ0\\theta\_\{0\}is trained to become a better initialization for this self\-evolving process\.

Letτ\\taudenote the full rollout, including normal actions, observations, extraction actions, explicit\-memory updates, and fast\-weight updates\. We use an outcome rewardR\(τ\)R\(\\tau\)and optimize

J\(θ0\)=𝔼τ∼pθ0,𝒯,ℙenv\[R\(τ\)\]\.J\(\\theta\_\{0\}\)=\\mathbb\{E\}\_\{\\tau\\sim p\_\{\\theta\_\{0\},\\mathcal\{T\},\\mathbb\{P\}\_\{env\}\}\}\\left\[R\(\\tau\)\\right\]\.\(8\)In our formulation, the LoRA fast weightsΔt\\Delta\_\{t\}are test\-time memory states produced by the online operator𝒯\\mathcal\{T\}, rather than RL parameters to optimize directly\. During RL, we therefore treat𝒯\\mathcal\{T\}as part of the rollout transition and apply stop\-gradient through the update path that producesΔt\\Delta\_\{t\}\. This design makes the optimization target explicit: train the base parametersθ0\\theta\_\{0\}to produce better ordinary actions and better QA\-style extraction actions, so that the induced test\-time updates are more useful for later decisions\. Using a segment\-wise likelihood\-ratio decomposition over the fast\-weight rollout, define the boundary context

c¯i=\{\(q,hti,mti,d\),1≤i≤N,\(q,hT,mT\),i=N\+1\.\\bar\{c\}\_\{i\}=\\begin\{cases\}\(q,h\_\{t\_\{i\}\},m\_\{t\_\{i\}\},d\),&1\\leq i\\leq N,\\\\ \(q,h\_\{T\},m\_\{T\}\),&i=N\+1\.\\end\{cases\}The corresponding policy gradient estimation is

∇θ0J\(θ0\)≈𝔼τ\[R\(τ\)\(∑i=1N\+1\[∑t=ti−1\+1ti−1∇θ0log⁡πθ0\+sg⁡\(Δt\)\(at∣q,ht,mt\)\+∇θ0log⁡πθ0\+sg⁡\(Δti\)\(ati∣c¯i\)\]\)\],\\nabla\_\{\\theta\_\{0\}\}J\(\\theta\_\{0\}\)\\approx\\mathbb\{E\}\_\{\\tau\}\\left\[R\(\\tau\)\\left\(\\sum\_\{i=1\}^\{N\+1\}\\left\[\\sum\_\{t=t\_\{i\-1\}\+1\}^\{t\_\{i\}\-1\}\\nabla\_\{\\theta\_\{0\}\}\\log\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\}\)\}\(a\_\{t\}\\mid q,h\_\{t\},m\_\{t\}\)\+\\nabla\_\{\\theta\_\{0\}\}\\log\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\_\{i\}\}\)\}\(a\_\{t\_\{i\}\}\\mid\\bar\{c\}\_\{i\}\)\\right\]\\right\)\\right\],\(9\)wheresg⁡\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes stop\-gradient andddis the extraction prompt used to elicit QA\-pair supervision\. For each segment, the inner sum covers ordinary actions before the next boundary, and the boundary log\-probability trains either the extraction action attit\_\{i\}wheni≤Ni\\leq Nor the final response wheni=N\+1i=N\+1\. A short derivation is provided in Appendix[B](https://arxiv.org/html/2606.04536#A2)\.

Gradients from normal actions improve reasoning and tool\-use behavior, while gradients from extraction actions improve the model’s ability to produce supervision that can be absorbed by𝒯\\mathcal\{T\}\. Since later rewards are generated under adapted policiesπθ0\+Δt\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\}\}, the RL signal favors base models that are not only capable at inference time but also easy to specialize through a small number of online LoRA updates\.

## 5Experiments

### 5\.1Experimental Setup

We evaluate four memory strategies—no memory, summary\-based memory, retrieval\-based memory, andTMEM—with Qwen3\-4B and Qwen3\-8B backbones across four task families: LoCoMo, LongMemEval\-S, multi\-objective search, and CL\-Bench\. Within each benchmark, all memory strategies use the same backbone model, while the remaining task\-specific settings follow the original benchmark protocols\. We report three\-run evaluations under fixed settings and use the run average as the main score\.

#### Baselines\.

The no\-memory baseline keeps only the current working context\(Yao et al\.,[2022](https://arxiv.org/html/2606.04536#bib.bib36)\)\. The summary\-based baseline uses prompt\-space memory rewriting and is instantiated differently by task family: for conversational and context\-learning tasks \(LoCoMo, LongMemEval\-S, and CL\-Bench\), which require no external tool calls, we followMemAgent\(Yu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib37)\), which performs chunk\-wise memory overwriting over long dialogue and document contexts; for search\-agent tasks \(multi\-objective search\), which require an active search tool, we followMEM1\(Zhou et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib42)\), which maintains a running compressed memory across search steps\. The retrieval\-based baseline usesA\-MEM\-style external memory retrieval\(Xu et al\.,[2026](https://arxiv.org/html/2606.04536#bib.bib32)\)\. These baselines do not update model parameters during an evaluation episode\. For all methods, the input examples, answer templates, and scoring rules are shared\. When reporting efficiency, we include the cost of memory\-specific operations, including retrieval/index access for retrieval memory, summary rewriting for summary memory, and online LoRA updates forTMEM\.

#### Metrics\.

For QA\-style benchmarks\(LoCoMo and LongMemEval\), we report token\-level F1 and Exact Match \(EM\) after lowercasing, stripping punctuation, and normalizing whitespace\. Scores are computed per instance and then averaged over the evaluation set\. For multi\-objective search, the final answer is parsed as an ordered answer list; each position is matched to the corresponding objective, and episode\-level F1/EM is averaged over positions before aggregating across episodes\. For CL\-Bench, we use Qwen3\-Max as an LLM\-as\-judge to score each response against the provided rubric, and we report rubric\-based accuracy on the filtered evaluation set; category columns are subset accuracies, while the total column is instance\-level accuracy over all 289 retained examples\. Unless otherwise noted, each reported table number is the mean over 3 independent runs, and the±\\pmterm denotes the standard deviation across runs\.

#### Conversational memory\.

We use LoCoMo\(Maharana et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib17)\)and LongMemEval\-S\(Wu et al\.,[2024](https://arxiv.org/html/2606.04536#bib.bib30)\)to evaluate long\-term conversational memory\. For LoCoMo, we use the official LoCoMo\-10 QA setting, where each example provides a very long multi\-session dialogue history and a memory question\. For LongMemEval\-S, we use the released 500\-instance evaluation set under the full\-haystack protocol: the model receives the full timestamped chat history and the memory question, not oracle evidence sessions\.

#### Search\-agent memory\.

We evaluate search\-oriented memory with the multi\-objective search protocol used by MEM1\(Zhou et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib42)\)\. Here an objective is an independent QA subgoal rather than an optimization objective: each episode combines 4 or 8 QA questions into one long\-horizon search task\. The agent searches for unresolved subquestions over multiple steps under the source task protocol and must finally output all answers in the original order using a fixed answer\-list format\. We compute F1 and EM on the ordered answer list\.

#### Context learning\.

We evaluate context learning on CL\-Bench\(Luo et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib16)\)after filtering the released benchmark\. The motivation for filtering is discriminability: when we run the full benchmark on 4B and 8B models, near\-floor accuracy makes it difficult to distinguish memory strategies from each other or from the no\-memory baseline\. To construct a subset with sufficient signal, we first ran Qwen3\-30A3B on every instance and scored each response against the provided rubric conditions; we then retained the instances that Qwen3\-30A3B answered correctly at least partially, yielding a set where task difficulty is matched to the capability range of the target models rather than too hard for any of them\. This filtering is motivated purely by discriminability: we do not select instances on which any 4B or 8B method performs well\. The resulting set contains 289 instances across Domain Knowledge Reasoning, Procedural Task Execution, Rule System Application, and Empirical Discovery & Simulation\. Each instance provides a system prompt, task context, user task, and rubric; we then use Qwen3\-Max as an LLM\-as\-judge to evaluate whether the response satisfies the rubric conditions, and an answer is counted as correct only if it passes the rubric judgment\. We use this filtered CL\-Bench set as a discriminative context\-learning testbed\.

#### TMEM implementation\.

Across all tasks, the method uses the same online LoRA module construction and TTT training hyperparameters\. We set the LoRA rank tor=6r=6and attach LoRA adapters only to the feed\-forward network \(FFN\) projection matrices \(gate\_proj,up\_proj, anddown\_proj\) in the last 4 Transformer layers; attention projections are left unchanged\. The SVD\-initialized projection matrixA=A0A=A\_\{0\}is frozen, and only the coefficient matrixBBis updated online\. Fast\-weight updates are cumulative across triggers within an episode: each trigger starts from the currentBBrather than resetting to zero\. At each memory trigger, the extracted JSON QA supervision is parsed into instruction–answer pairs and used for online SFT with SGD, learning rate5×10−45\\times 10^\{\-4\}, 5 epochs, and batch size 16\. The trigger counter uses the token length of the current working context plus explicit memory; after a trigger, the working context is cleared, and inTMEMthe extracted content is retained throughΔt\\Delta\_\{t\}rather than as prompt text\. We set the trigger budget toLmax=4096L\_\{\\max\}=4096for LoCoMo,Lmax=12288L\_\{\\max\}=12288for LongMemEval\-S,Lmax=8192L\_\{\\max\}=8192for the search\-agent tasks, andLmax=4096L\_\{\\max\}=4096for CL\-Bench\.

#### RL data construction and training configuration\.

For the RL phase, we directly use the existing RL data and task protocols from prior memory\-agent work\. Conversational RL follows the long\-context QA data used by MemAgent\(Yu et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib37)\), while search\-agent RL follows the multi\-objective task data used by MEM1\(Zhou et al\.,[2025](https://arxiv.org/html/2606.04536#bib.bib42)\)\. Training uses the stop\-gradient outcome\-reward objective described in Section 4: final task reward trains both ordinary response tokens and memory\-writing tokens, while gradients are not back\-propagated through the online LoRA optimization\. We keep the same RL data sources, task formats, train/evaluation separation, rollout budget, and update schedule across memory strategies so that post\-RL comparisons differ only in the memory mechanism being optimized\. We train with GRPO for 200 update steps using batch size 64, mini\-batch size 16, learning rate10−610^\{\-6\}, maximum response length 1024 tokens, andn=8n=8rollouts per prompt\. For search\-agent tasks, each episode allows up to 10 search turns\.

### 5\.2Evaluation before the RL

We first evaluate each memory strategy before applying the RL phase\. This isolates the effect of the memory mechanism itself: no memory, summary\-based memory, retrieval\-based memory, andTMEM\.

Table 1:F1 and Exact Match \(EM\) on conversational memory benchmarks \(LoCoMo and LongMemEval\-S\)\.Boldindicates the best result per column\.LoCoMoLongMemEval\-SMethodModelF1EMF1EMNo MemoryQwen3\-4B23\.33±\\pm0\.418\.30±\\pm0\.225\.30±\\pm0\.190\.00±\\pm0\.00Qwen3\-8B18\.48±\\pm0\.393\.12±\\pm0\.183\.50±\\pm0\.160\.00±\\pm0\.00MemAgent/MEM1Qwen3\-4B20\.74±\\pm0\.3611\.48±\\pm0\.2936\.45±\\pm0\.4423\.20±\\pm0\.38Qwen3\-8B24\.26±\\pm0\.3515\.90±\\pm0\.3131\.66±\\pm0\.4217\.40±\\pm0\.35AMEMQwen3\-4B25\.66±\\pm0\.3411\.33±\\pm0\.2829\.22±\\pm0\.4117\.76±\\pm0\.36Qwen3\-8B20\.92±\\pm0\.338\.64±\\pm0\.2628\.99±\\pm0\.4018\.40±\\pm0\.34TMEMQwen3\-4B25\.72±\\pm0\.3215\.40±\\pm0\.2741\.24±\\pm0\.4525\.54±\\pm0\.39Qwen3\-8B26\.75±\\pm0\.3120\.24±\\pm0\.2941\.87±\\pm0\.4325\.42±\\pm0\.37Table 2:F1 and Exact Match \(EM\) on multi\-objective search memory benchmarks \(4\-objective and 8\-objective\)\.Boldindicates the best result per column\.4\-objective8\-objectiveMethodModelF1EMF1EMNo MemoryQwen3\-4B18\.26±\\pm0\.379\.35±\\pm0\.2818\.22±\\pm0\.359\.28±\\pm0\.27Qwen3\-8B19\.30±\\pm0\.3410\.23±\\pm0\.2517\.89±\\pm0\.329\.12±\\pm0\.24MemAgent/MEM1Qwen3\-4B24\.35±\\pm0\.3314\.72±\\pm0\.3022\.18±\\pm0\.3113\.36±\\pm0\.28Qwen3\-8B25\.23±\\pm0\.3115\.11±\\pm0\.2922\.87±\\pm0\.3013\.42±\\pm0\.27AMEMQwen3\-4B24\.82±\\pm0\.3215\.13±\\pm0\.2922\.65±\\pm0\.3013\.87±\\pm0\.27Qwen3\-8B25\.36±\\pm0\.3016\.22±\\pm0\.2823\.45±\\pm0\.2914\.12±\\pm0\.26TMEMQwen3\-4B26\.74±\\pm0\.2816\.26±\\pm0\.2524\.51±\\pm0\.2715\.03±\\pm0\.24Qwen3\-8B26\.36±\\pm0\.2716\.45±\\pm0\.2425\.12±\\pm0\.2615\.03±\\pm0\.23#### Conversational and search memory results\.

Table[1](https://arxiv.org/html/2606.04536#S5.T1)reports conversational F1 and EM on LoCoMo and LongMemEval\-S, while Table[2](https://arxiv.org/html/2606.04536#S5.T2)reports search F1 and EM on the 4\-objective and 8\-objective splits\. On conversational memory,TMEMobtains the strongest overall results\. The LongMemEval\-S gains are especially clear: over the best explicit\-memory baseline,TMEMimproves by \+4\.79 F1 / \+2\.34 EM with Qwen3\-4B and \+10\.21 F1 / \+7\.02 EM with Qwen3\-8B\. On LoCoMo, the F1 gap for Qwen3\-4B is small \(25\.72 vs\. 25\.66\), but the EM gain is larger \(\+3\.92 over the best baseline\), and Qwen3\-8B shows larger improvements on both F1 and EM\. On search memory,TMEMimproves over the strongest explicit\-memory baseline by \+1\.92 F1 / \+1\.13 EM on the 4\-objective split and \+1\.86 F1 / \+1\.16 EM on the 8\-objective split for Qwen3\-4B\. For Qwen3\-8B, the gains remain positive but smaller on the 4\-objective split \(\+1\.00 F1 / \+0\.23 EM\), so we treat that case as directional rather than decisive\.

Table 3:Accuracy \(%\) on the filtered CL\-Bench evaluation set across four knowledge categories:DK\(Domain Knowledge\),ED\(Empirical Discovery\),PT\(Procedural Task\),RS\(Rule System\)\.Boldindicates the best result per column\.MethodModelDKEDPTRSTotalNo MemoryQwen3\-4B34\.96±\\pm0\.7210\.00±\\pm0\.5825\.71±\\pm0\.8136\.36±\\pm0\.7630\.45±\\pm0\.63Qwen3\-8B34\.96±\\pm0\.696\.67±\\pm0\.5435\.71±\\pm0\.7934\.85±\\pm0\.7332\.18±\\pm0\.61MemAgentQwen3\-4B26\.02±\\pm0\.7410\.00±\\pm0\.5630\.00±\\pm0\.7733\.33±\\pm0\.7426\.99±\\pm0\.60Qwen3\-8B33\.33±\\pm0\.706\.67±\\pm0\.5338\.57±\\pm0\.7534\.85±\\pm0\.7232\.18±\\pm0\.59AMEMQwen3\-4B28\.46±\\pm0\.7110\.00±\\pm0\.5525\.71±\\pm0\.7933\.33±\\pm0\.7326\.99±\\pm0\.58Qwen3\-8B31\.71±\\pm0\.6810\.00±\\pm0\.5630\.00±\\pm0\.7630\.30±\\pm0\.7128\.72±\\pm0\.57TMEMQwen3\-4B34\.96±\\pm0\.6716\.67±\\pm0\.6234\.29±\\pm0\.7434\.85±\\pm0\.7032\.87±\\pm0\.56Qwen3\-8B34\.15±\\pm0\.6610\.00±\\pm0\.5542\.86±\\pm0\.7134\.85±\\pm0\.6933\.91±\\pm0\.55
#### CL\-Bench results\.

Table[3](https://arxiv.org/html/2606.04536#S5.T3)reports rubric accuracy on the filtered CL\-Bench evaluation set\.TMEMremains strongest overall: it achieves 32\.87% for Qwen3\-4B and 33\.91% for Qwen3\-8B, improving over the best explicit\-memory baseline by \+2\.42 and \+1\.73 points, respectively\. The category\-level pattern is still mixed:TMEMhas the largest gain on Empirical Discovery for Qwen3\-4B and the best Procedural Task score for Qwen3\-8B, while the no\-memory baseline remains competitive on Rule System\. We therefore treat these numbers as evidence on the filtered answerable set rather than as a claim about the full CL\-Bench distribution\. We also evaluate on the full unfiltered CL\-Bench split, and detailed results are reported in Appendix Table[8](https://arxiv.org/html/2606.04536#A3.T8)\.

### 5\.3Evaluation after the RL

![Refer to caption](https://arxiv.org/html/2606.04536v1/x2.png)\(a\)LoCoMo
![Refer to caption](https://arxiv.org/html/2606.04536v1/x3.png)\(b\)LongMemEval\-S

Figure 3:F1 and Exact Match \(EM\) improvements after the RL phase on LoCoMo and LongMemEval\-S\. Stacked bars show the base score \(dark\) and RL gain \(light\) for each method\.#### Conversational RL gains\.

Figure[3](https://arxiv.org/html/2606.04536#S5.F3)summarizes the post\-RL uplift on LoCoMo and LongMemEval\-S\. RL improves both F1 and EM for all three memory methods, butTMEMreceives the largest absolute gains\. On LoCoMo,TMEMimproves by \+5\.12 F1 and \+5\.02 EM, exceeding MemAgent / MEM1 \(\+2\.62 F1, \+2\.46 EM\) and AMEM \(\+2\.79 F1, \+3\.54 EM\)\. On LongMemEval\-S,TMEMagain obtains the strongest gains \(\+4\.92 F1, \+4\.84 EM\), compared with \+3\.26 / \+3\.23 for MemAgent / MEM1 and \+3\.34 / \+2\.66 for AMEM\. This pattern suggests that RL benefits are larger when the memory mechanism can adapt fast weights during the episode; all gains are computed from the corresponding three\-run averaged scores\.

![Refer to caption](https://arxiv.org/html/2606.04536v1/x4.png)\(a\)4\-objective
![Refer to caption](https://arxiv.org/html/2606.04536v1/x5.png)\(b\)8\-objective

Figure 4:F1 and Exact Match \(EM\) improvements after the RL phase on search\-agent benchmarks\. Stacked bars show the base score \(dark\) and RL gain \(light\) for each method\.
#### Search\-agent RL gains\.

Figure[4](https://arxiv.org/html/2606.04536#S5.F4)shows the same ordering on search\-agent tasks\. After RL,TMEMreaches 31\.52 F1 / 21\.17 EM on the 4\-objective split and 29\.43 F1 / 19\.95 EM on the 8\-objective split\. These scores improve over its pre\-RL version by \+4\.78 / \+4\.91 and \+4\.92 / \+4\.92 in F1 / EM, respectively, and exceed the strongest explicit\-memory baseline after RL by \+4\.34 / \+3\.29 on the 4\-objective split and \+4\.36 / \+2\.63 on the 8\-objective split\. The gains are consistent with the conversational results: optimizing the memory\-writing behavior is helpful when the agent must preserve evidence for several independent subgoals\.

### 5\.4Ablation

We ablate four design choices that are central to the fast\-weight rollout design using Qwen3\-4B\. In each experiment, we vary one factor while keeping all others at their default values\. These ablations focus on the quantitative controls for which we have complete results: trigger budget, SVD initialization, supervision form, and whether the SVD subspace is frozen\.

#### Effect of the context budgetLmaxL\_\{\\max\}\.

Figure[5](https://arxiv.org/html/2606.04536#S5.F5)reports sensitivity to the context budgetLmaxL\_\{\\max\}on LoCoMo and LongMemEval\-S\. We varyLmaxL\_\{\\max\}while keeping rank, learning rate, TTT epochs, and extraction policy fixed\.

![Refer to caption](https://arxiv.org/html/2606.04536v1/x6.png)\(a\)LoCoMo
![Refer to caption](https://arxiv.org/html/2606.04536v1/x7.png)\(b\)LongMemEval\-S

Figure 5:Effect of context budgetLmaxL\_\{\\max\}onTMEMwith Qwen3\-4B\. Solid bars denote F1 and hatched bars denote EM\.Both datasets exhibit a consistent bell\-curve pattern: performance rises asLmaxL\_\{\\max\}increases toward the optimal value and falls on either side\. An excessively smallLmaxL\_\{\\max\}triggers memory extraction too frequently, causing aggressive compression that can discard fine\-grained details before they are absorbed intoΔt\\Delta\_\{t\}\. An excessively largeLmaxL\_\{\\max\}allows the working context to grow unwieldy, making the extraction prompt harder to ground and reducing the quality of the QA pairs written into parametric memory\. The optimal budgets differ across benchmarks—Lmax=4096L\_\{\\max\}\{=\}4096for LoCoMo andLmax=12288L\_\{\\max\}\{=\}12288for LongMemEval\-S—which is consistent with LongMemEval\-S requiring a larger working window before each TTT update\.

#### SVD initialization vs\. random initialization\.

Table[4](https://arxiv.org/html/2606.04536#S5.T4)compares SVD\-based subspace initialization against standard random initialization on both benchmarks\.

Table 4:Ablation on initialization strategy \(Qwen3\-4B\) across LoCoMo and LongMemEval\-S\.InitializationLoCoMoLongMemEval\-SF1 \(%\)EM \(%\)F1 \(%\)EM \(%\)Random Gaussian24\.2612\.1538\.5422\.33SVD \(ours\)25\.7215\.4041\.2425\.54SVD initialization consistently outperforms random Gaussian initialization on both benchmarks\. The gain is most pronounced on EM \(\+3\.25 on LoCoMo, \+3\.21 on LongMemEval\-S\), suggesting that anchoring the LoRA subspace to the principal directions of the pretrained weights improves precise recall rather than just broader coverage\. This aligns with the theoretical prediction of Theorem[1](https://arxiv.org/html/2606.04536#Thmtheorem1): by pre\-selecting the row space that captures the largest singular values ofWW, online TTT updates only need to learn the coefficient matrixBB, which is useful in the few\-gradient\-step regime available at each trigger\.

#### Supervision form for𝒯\\mathcal\{T\}\.

Table[5](https://arxiv.org/html/2606.04536#S5.T5)studies how the supervision signal used by𝒯\\mathcal\{T\}affects downstream memory quality across both benchmarks\.

Table 5:Ablation on supervision form \(Qwen3\-4B\) across LoCoMo and LongMemEval\-S\.Supervision signalLoCoMoLongMemEval\-SF1 \(%\)EM \(%\)F1 \(%\)EM \(%\)Raw next\-token prediction21\.1910\.7410\.377\.62Free\-form summary24\.8614\.2835\.4420\.18QA pairs \(default\)25\.7215\.4041\.2425\.54The choice of supervision signal has a substantial effect, particularly on LongMemEval\-S\. Raw next\-token prediction performs the worst by a large margin \(10\.37 F1 on LongMemEval\-S\), as it trains the model to reproduce the raw context verbatim rather than distill task\-relevant knowledge\. Free\-form summary improves over this baseline by providing higher\-level abstractions, but still falls short of QA pairs, which explicitly encode question–answer structure aligned with downstream memory queries\. The gap between free\-form summary and QA pairs is especially large on LongMemEval\-S \(\+5\.80 F1, \+5\.36 EM\), where many questions require precise factual recall over long dialogue histories\.

#### Freezing the SVD subspace\.

Table 6:Ablation on freezing SVD subspace \(Qwen3\-4B\) across LoCoMo and LongMemEval\-S\.Update modeLoCoMoLongMemEval\-SF1 \(%\)EM \(%\)F1 \(%\)EM \(%\)TrainBBonly \(freezeAA\)25\.7215\.4041\.2425\.54TrainAA\+BB25\.8915\.5841\.5125\.66Table[6](https://arxiv.org/html/2606.04536#S5.T6)shows that allowing bothAAandBBto be updated yields only marginal improvements over freezingAA\(\+0\.17 F1 / \+0\.18 EM on LoCoMo; \+0\.27 F1 / \+0\.12 EM on LongMemEval\-S\)\. This is consistent with the SVD\-initialized subspace already covering most useful few\-step adaptation directions in these benchmarks\. FreezingAAreduces the number of online\-updated LoRA factors and preserves nearly the same accuracy, making it the preferred default in our experiments\.

### 5\.5Efficiency comparison with memory strategies

Table[7](https://arxiv.org/html/2606.04536#S5.T7)compares the computational and memory overhead ofTMEMagainst existing memory strategies: no memory, retrieval\-based \(A\-MEM\), and summary\-based \(MemAgent\), across both LoCoMo and LongMemEval\-S\. Each row uses Qwen3\-4B under the same local evaluation pipeline and batch size one\. Wall\-clock time is measured per episode and includes memory operations performed during the episode as well as SVD adapter initialization on the first episode; GPU memory is the peak allocated memory during evaluation\.

Table 7:Efficiency comparison across memory strategies \(Qwen3\-4B\) on LoCoMo and LongMemEval\-S\. Wall\-clock time and GPU memory are measured per episode under the same evaluation pipeline\.MethodLoCoMoLongMemEval\-STime \(s/ep\)GPU Mem \(GB\)Time \(s/ep\)GPU Mem \(GB\)No memory1\.6946\.38\.7378\.5Retrieval \(A\-MEM\)2\.0313\.415\.3313\.2Summary \(MemAgent\)0\.9411\.25\.1212\.7TMEM\(ours\)1\.5321\.36\.5522\.9Several observations stand out\. First, the no\-memory baseline incurs the highest GPU memory cost \(46\.3 GB on LoCoMo, 78\.5 GB on LongMemEval\-S\) because the raw context grows throughout the episode\. Second, retrieval\-based A\-MEM is the slowest on LongMemEval\-S \(15\.33 s/ep\), consistent with the cost of maintaining and querying an external memory store over a large history\. Third, summary\-based MemAgent achieves the lowest wall\-clock time and GPU footprint but also lower accuracy thanTMEMin the main tables\.TMEMsits between retrieval and summary in both time and GPU cost: the per\-trigger online LoRA update adds overhead relative to summary memory, while still remaining faster than retrieval on LongMemEval\-S and substantially lighter than the raw\-context no\-memory baseline\.

## 6Discussion

#### Relation to prior memory agents\.

The fast\-weight rollout view separates working contexthth\_\{t\}, textual memorymtm\_\{t\}, and fast weightsΔt\\Delta\_\{t\}\. Existing context\-management agents are recovered by keepingΔt≡0\\Delta\_\{t\}\\equiv 0and using only the textual updateg\(mt,at\)g\(m\_\{t\},a\_\{t\}\), so the policy remains a frozen model conditioned on rewritten or retrieved prompt\-space memory\.TMEMchanges the rollout by using the memory promptddto make the boundary actionatia\_\{t\_\{i\}\}into QA\-style supervision and then applying𝒯\\mathcal\{T\}to updateΔti\+1\\Delta\_\{t\_\{i\}\+1\}; subsequent actions are therefore sampled fromπθ0\+Δti\+1\\pi\_\{\\theta\_\{0\}\+\\Delta\_\{t\_\{i\}\+1\}\}, not only from a frozen model conditioned on compressed text\.

#### Practical trade\-off\.

The main control knobs are the trigger budgetLmaxL\_\{\\max\}and the update operator𝒯\\mathcal\{T\}\. A smallLmaxL\_\{\\max\}triggers frequent extraction and online updates but risks over\-compressing local evidence; a largeLmaxL\_\{\\max\}reduces update cost but makes the boundary extraction harder and the prompt more expensive\. The role of𝒯\\mathcal\{T\}is to decide how strongly the extracted supervision is written intoΔt\\Delta\_\{t\}\. Thus,TMEMtrades modest online\-training overhead for a memory channel that can influence later reasoning without repeatedly placing all past evidence back into the prompt\.

## 7Conclusion

We presentedTMEM, a self\-evolving parametric memory framework that treats test\-time LoRA updates as part of the agent’s rollout dynamics rather than as an external post\-processing step\. By jointly modeling working contexthth\_\{t\}, explicit memorymtm\_\{t\}, and fast weightsΔt\\Delta\_\{t\}, the framework unifies prompt\-space memory methods with a parametric memory channel that can directly alter future decisions within an episode\. This perspective leads to a stop\-gradient policy optimization objective in which outcome rewards improve both ordinary task actions and memory\-writing actions, so the base model learns to generate supervision that is useful for subsequent online adaptation\. Across conversational memory, search\-agent memory, and context\-learning benchmarks,TMEMconsistently outperforms summary\-based and retrieval\-based baselines while maintaining practical efficiency\. Overall, the results support a simple claim: for long\-horizon agents, memory is most effective when it can both be read from context and written into fast model parameters at test time\.

## References

- Acikgoz et al\. \(2025\)Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani\-Tür, and Gokhan Tur\.Self\-improving llm agents at test\-time\.*arXiv preprint arXiv:2510\.07841*, 2025\.
- Aghajanyan et al\. \(2021\)Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer\.Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics*, 2021\.
- Charakorn et al\. \(2026\)Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, and Robert Tjarko Lange\.Doc\-to\-lora: Learning to instantly internalize contexts\.*arXiv preprint arXiv:2602\.15902*, 2026\.
- Chen et al\. \(2026\)Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, and Yuxiong He\.Learning to self\-evolve\.*arXiv preprint arXiv:2603\.18620*, 2026\.
- Chen et al\. \(2024\)Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu\.Self\-play fine\-tuning converts weak language models to strong language models\.In*Proceedings of the 41st International Conference on Machine Learning*, 2024\.
- Chhikara et al\. \(2025\)Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\.Mem0: Building production\-ready ai agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*, 2025\.
- Hu et al\. \(2025\)Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan\.Test\-time learning for large language models\.*arXiv preprint arXiv:2505\.20633*, 2025\.
- Lewis et al\. \(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, et al\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.*Advances in neural information processing systems*, 33:9459–9474, 2020\.
- Li et al\. \(2026a\)Guanghao Li, Wenhao Jiang, Mingfeng Chen, Yan Li, Hao Yu, Shuting Dong, Tao Ren, Ming Tang, and Chun Yuan\.Scout: Teaching pre\-trained language models to enhance reasoning via flow chain\-of\-thought\.*Advances in Neural Information Processing Systems*, 38:95340–95364, 2026a\.
- Li et al\. \(2026b\)Zehao Li, Tao Ren, Zishi Zhang, Xi Chen, and Yijie Peng\.Optimal low\-rank stochastic gradient estimation for llm training\.*arXiv preprint arXiv:2603\.20632*, 2026b\.
- Li et al\. \(2026c\)Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yiran Ban, and Fuzhen Zhuang\.Adaptive robust estimator for multi\-agent reinforcement learning\.*arXiv preprint arXiv:2603\.21574*, 2026c\.
- Liang et al\. \(2025\)Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhen Zhu, Rui Liu, Tao Ren, and Jie Fu\.Ai meets brain: Memory systems from cognitive neuroscience to autonomous agents\.*arXiv preprint arXiv:2512\.23343*, 2025\.
- Liang et al\. \(2026\)Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See\-Kiong Ng, and Bing Qin\.Perception without engagement: Dissecting the causal discovery deficit in lmms\.*arXiv preprint arXiv:2605\.09422*, 2026\.
- Lin et al\. \(2026\)Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, and Jie Jiang\.Ui\-voyager: A self\-evolving gui agent learning via failed experience\.*arXiv preprint arXiv:2603\.24533*, 2026\.
- Lu et al\. \(2025\)Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen\.Scaling llm multi\-turn rl with end\-to\-end summarization\-based context management\.*arXiv preprint arXiv:2510\.06727*, 2025\.
- Luo et al\. \(2025\)Zhiwei Luo et al\.Cl\-bench: Evaluating continual learning capabilities of language model agents\.*arXiv preprint arXiv:2504\.18978*, 2025\.
- Maharana et al\. \(2024\)Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Locomo: Long\-context conversational memory benchmark\.*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024\.
- Packer et al\. \(2023\)Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez\.Memgpt: towards llms as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Park et al\. \(2023\)Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th annual acm symposium on user interface software and technology*, pp\. 1–22, 2023\.
- Ren et al\. \(2025a\)Tao Ren, Jinyang Jiang, Hui Yang, Wan Tian, Minhao Zou, Guanghao Li, Zishi Zhang, Qinghao Wang, Siyuan Qin, Yuxiang Zhao, and Rui Tao\.Riskpo: Risk\-based policy optimization via verifiable reward for llm post\-training\.*arXiv preprint arXiv:2510\.00911*, 2025a\.
- Ren et al\. \(2025b\)Tao Ren, Zishi Zhang, Jinyang Jiang, Guanghao Li, Zeliang Zhang, Mingqian Feng, and Yijie Peng\.Flops: Forward learning with optimal sampling\.In*International Conference on Learning Representations*, volume 2025, pp\. 88219–88237, may 2025b\.
- Ren et al\. \(2025c\)Tao Ren, Zishi Zhang, Jinyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qi Sun, Yijie Li, Jiafeng Liang, and Xia Li\.Half\-order fine\-tuning for diffusion model: A recursive likelihood ratio optimizer\.*arXiv preprint arXiv:2502\.00639*, 2025c\.
- Research \(2024\)Microsoft Research\.Graphrag: Unlocking llm discovery on narrative private data, 2024\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.
- Sun et al\. \(2025\)Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen\.Scaling long\-horizon LLM agent via context\-folding\.*arXiv preprint arXiv:2510\.11967*, 2025\.
- Sun et al\. \(2020\)Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt\.Test\-time training with self\-supervision for generalization under distribution shifts\.In*International conference on machine learning*, pp\. 9229–9248\. PMLR, 2020\.
- Tandon et al\. \(2025\)Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, and Yu Sun\.End\-to\-end test\-time training for long context\.*arXiv preprint arXiv:2512\.23675*, 2025\.
- Wang et al\. \(2024\)Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar\.Voyager: An open\-ended embodied agent with large language models\.*Transactions on Machine Learning Research*, 2024\.
- Wang et al\. \(2026\)Xinyu Wang, Mingze Li, Peng Lu, Xiao\-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui\.Infmem: Learning system\-2 memory control for long\-context agent\.*arXiv preprint arXiv:2602\.02704*, 2026\.
- Wu et al\. \(2024\)Di Wu et al\.Longmemeval: Benchmarking chat assistants on long\-term interactive memory\.*arXiv preprint arXiv:2410\.10813*, 2024\.
- Wu et al\. \(2023\)Yue Wu, So Yeon Min, Shrimai Prabhumoye, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom M\. Mitchell, and Yuanzhi Li\.Spring: Studying papers and reasoning to play games\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.
- Xu et al\. \(2026\)Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\.A\-mem: Agentic memory for llm agents\.*Advances in Neural Information Processing Systems*, 38:17577–17604, 2026\.
- Yan et al\. \(2025\)Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al\.Memory\-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning\.*arXiv preprint arXiv:2508\.19828*, 2025\.
- Yang et al\. \(2026a\)Chengyuan Yang, Zequn Sun, Wei Wei, and Wei Hu\.Beyond static summarization: Proactive memory extraction for llm agents\.*arXiv preprint arXiv:2601\.04463*, 2026a\.
- Yang et al\. \(2026b\)Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, and Yijie Peng\.Omni\-masked gradient descent: Memory\-efficient optimization via mask traversal with improved convergence\.*arXiv preprint arXiv:2603\.05960*, 2026b\.
- Yao et al\. \(2022\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.*arXiv preprint arXiv:2210\.03629*, 2022\.
- Yu et al\. \(2025\)Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya\-Qin Zhang, Wei\-Ying Ma, Jingjing Liu, Mingxuan Wang, et al\.Memagent: Reshaping long\-context llm with multi\-conv rl\-based memory agent\.*arXiv preprint arXiv:2507\.02259*, 2025\.
- Zhang et al\. \(2025\)Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan\.Test\-time training done right\.*arXiv preprint arXiv:2505\.23884*, 2025\.
- Zhao et al\. \(2024\)Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong\-Jin Liu, and Gao Huang\.Expel: Llm agents are experiential learners\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, pp\. 19632–19642, 2024\.doi:10\.1609/aaai\.v38i17\.29936\.
- Zhong et al\. \(2024\)Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang\.Memorybank: Enhancing large language models with long\-term memory\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 38, pp\. 19724–19731, 2024\.
- Zhou et al\. \(2023\)Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan\.Recurrentgpt: Interactive generation of \(arbitrarily\) long text\.*arXiv preprint arXiv:2305\.13304*, 2023\.
- Zhou et al\. \(2025\)Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang\.Mem1: Learning to synergize memory and reasoning for efficient long\-horizon agents\.*arXiv preprint arXiv:2506\.15841*, 2025\.

## Appendix AApproximation Advantage of the SVD Row Space

We formalize the claim that initializing the LoRA projection matrix via the truncated SVD of the pretrained weight yields no larger an approximation error than random Gaussian initialization \(strictly smaller wheneverρr\>r/din\\rho\_\{r\}\>r/d\_\{\\text\{in\}\}\), under a mild spectral alignment condition\.

#### Setup\.

LetW∈ℝdout×dinW\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}be a pretrained weight matrix with SVDW=UΣV⊤W=U\\Sigma V^\{\\top\}\. WriteVr=\[v1,…,vr\]∈ℝdin×rV\_\{r\}=\[v\_\{1\},\\dots,v\_\{r\}\]\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}for the top\-rrright singular vectors\. We assumer≤rank⁡\(W\)r\\leq\\operatorname\{rank\}\(W\)\(so that the top\-rrsingular values are strictly positive\) and that the optimal full\-rank weight perturbationΔ∗∈ℝdout×din\\Delta^\{\*\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}for the downstream task satisfiesΔ∗≠0\\Delta^\{\*\}\\neq 0\. Under the LoRA parameterizationΔ=BA\\Delta=BAwith a fixed projectionA∈ℝr×dinA\\in\\mathbb\{R\}^\{r\\times d\_\{\\text\{in\}\}\}\(rank⁡\(A\)=r\\operatorname\{rank\}\(A\)=r\) and a learnable coefficient matrixB∈ℝdout×rB\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}, the best achievable approximation error is

ℰ\(A\)=minB∈ℝdout×r⁡‖Δ∗−BA‖F2\.\\mathcal\{E\}\(A\)\\mskip 5\.0mu plus 5\.0mu=\\mskip 5\.0mu plus 5\.0mu\\min\_\{B\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times r\}\}\\\|\\Delta^\{\*\}\-BA\\\|\_\{F\}^\{2\}\.Define the*spectral alignment coefficient*

ρr\(Δ∗,W\)=‖Δ∗Vr‖F2‖Δ∗‖F2∈\[0,1\],\\rho\_\{r\}\(\\Delta^\{\*\},W\)\\mskip 5\.0mu plus 5\.0mu=\\mskip 5\.0mu plus 5\.0mu\\frac\{\\\|\\Delta^\{\*\}V\_\{r\}\\\|\_\{F\}^\{2\}\}\{\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\}\\mskip 5\.0mu plus 5\.0mu\\in\\mskip 5\.0mu plus 5\.0mu\[0,\\,1\],which measures the fraction of energy ofΔ∗\\Delta^\{\*\}that falls into the top\-rrright singular subspace ofWW\.

###### Assumption 1\(Spectral alignment\)\.

The optimal downstream perturbation satisfiesρr\(Δ∗,W\)≥rdin\\rho\_\{r\}\(\\Delta^\{\*\},W\)\\mskip 5\.0mu plus 5\.0mu\\geq\\mskip 5\.0mu plus 5\.0mu\\dfrac\{r\}\{d\_\{\\textup\{in\}\}\}\.

Assumption[1](https://arxiv.org/html/2606.04536#Thmassumption1)requires only thatΔ∗\\Delta^\{\*\}concentrates at least as much energy on the top\-rrsubspace ofWWas would a uniformly randomrr\-dimensional subspace in expectation\. Since pretrained weight spectra decay rapidly \(often as a power lawσj∝j−α\\sigma\_\{j\}\\propto j^\{\-\\alpha\}\) and fine\-tuning updates empirically concentrate on the leading singular directions\(Aghajanyan et al\.,[2021](https://arxiv.org/html/2606.04536#bib.bib2)\), the conditionρr≫r/din\\rho\_\{r\}\\gg r/d\_\{\\text\{in\}\}holds comfortably in practice\.

#### Formal statement of Theorem[1](https://arxiv.org/html/2606.04536#Thmtheorem1)\.

LetASVD=ΣrVr⊤A\_\{\\textup\{SVD\}\}=\\Sigma\_\{r\}\\,V\_\{r\}^\{\\top\}be the SVD initialization derived fromWW, and letArand∈ℝr×dinA\_\{\\textup\{rand\}\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\textup\{in\}\}\}be a random initialization whose rows are drawn i\.i\.d\. from𝒩\(0,σ2Idin\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\_\{d\_\{\\textup\{in\}\}\}\)\. Under Assumption[1](https://arxiv.org/html/2606.04536#Thmassumption1),

ℰ\(ASVD\)≤𝔼\[ℰ\(Arand\)\],\\mathcal\{E\}\(A\_\{\\textup\{SVD\}\}\)\\mskip 5\.0mu plus 5\.0mu\\leq\\mskip 5\.0mu plus 5\.0mu\\mathbb\{E\}\\bigl\[\\mathcal\{E\}\(A\_\{\\textup\{rand\}\}\)\\bigr\],with equality if and only ifρr\(Δ∗,W\)=r/din\\rho\_\{r\}\(\\Delta^\{\*\},W\)=r/d\_\{\\textup\{in\}\}\.

###### Proof\.

Reduction to row\-space projection\.For fixed full\-rankAA, the least\-squares optimum isB∗=Δ∗A⊤\(AA⊤\)−1B^\{\*\}=\\Delta^\{\*\}A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}, giving residual

Δ∗−B∗A=Δ∗\(I−A⊤\(AA⊤\)−1A\)=Δ∗Prow⁡\(A\)⟂,\\Delta^\{\*\}\-B^\{\*\}A=\\Delta^\{\*\}\\bigl\(I\-A^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}A\\bigr\)=\\Delta^\{\*\}\\,P\_\{\\operatorname\{row\}\(A\)\}^\{\\perp\},whererow⁡\(A\)⊆ℝdin\\operatorname\{row\}\(A\)\\subseteq\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\}denotes the row space ofAA\(the span of itsrrrow vectors\),

Prow⁡\(A\)=A⊤\(AA⊤\)−1AP\_\{\\operatorname\{row\}\(A\)\}\\mskip 5\.0mu plus 5\.0mu=\\mskip 5\.0mu plus 5\.0muA^\{\\top\}\(AA^\{\\top\}\)^\{\-1\}Ais the orthogonal projection ontorow⁡\(A\)\\operatorname\{row\}\(A\), andProw⁡\(A\)⟂=I−Prow⁡\(A\)P\_\{\\operatorname\{row\}\(A\)\}^\{\\perp\}=I\-P\_\{\\operatorname\{row\}\(A\)\}projects onto its orthogonal complement\. SinceProw⁡\(A\)P\_\{\\operatorname\{row\}\(A\)\}andProw⁡\(A\)⟂P\_\{\\operatorname\{row\}\(A\)\}^\{\\perp\}are complementary orthogonal projections, the Pythagorean theorem for the Frobenius norm gives

ℰ\(A\)=‖Δ∗Prow⁡\(A\)⟂‖F2=‖Δ∗‖F2−‖Δ∗Prow⁡\(A\)‖F2\.\\mathcal\{E\}\(A\)=\\\|\\Delta^\{\*\}\\,P\_\{\\operatorname\{row\}\(A\)\}^\{\\perp\}\\\|\_\{F\}^\{2\}=\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\-\\\|\\Delta^\{\*\}\\,P\_\{\\operatorname\{row\}\(A\)\}\\\|\_\{F\}^\{2\}\.\(10\)Note thatℰ\(A\)\\mathcal\{E\}\(A\)depends onAAonly through its row space\.

Approximation error under SVD initialization\.SubstitutingASVD=ΣrVr⊤A\_\{\\text\{SVD\}\}=\\Sigma\_\{r\}\\,V\_\{r\}^\{\\top\}into the projection formula, and usingVr⊤Vr=IrV\_\{r\}^\{\\top\}V\_\{r\}=I\_\{r\}\(orthonormality of the right singular vectors\):

Prow⁡\(ASVD\)=VrΣr\(ΣrVr⊤VrΣr\)−1ΣrVr⊤=VrΣrΣr−2ΣrVr⊤=VrVr⊤\.P\_\{\\operatorname\{row\}\(A\_\{\\text\{SVD\}\}\)\}=V\_\{r\}\\,\\Sigma\_\{r\}\\,\\bigl\(\\Sigma\_\{r\}\\,V\_\{r\}^\{\\top\}\\,V\_\{r\}\\,\\Sigma\_\{r\}\\bigr\)^\{\-1\}\\,\\Sigma\_\{r\}\\,V\_\{r\}^\{\\top\}=V\_\{r\}\\,\\Sigma\_\{r\}\\,\\Sigma\_\{r\}^\{\-2\}\\,\\Sigma\_\{r\}\\,V\_\{r\}^\{\\top\}=V\_\{r\}V\_\{r\}^\{\\top\}\.Substituting into equation[10](https://arxiv.org/html/2606.04536#A1.E10):

‖Δ∗VrVr⊤‖F2=tr⁡\(Vr⊤Δ∗⊤Δ∗Vr\)=∑j=1r‖Δ∗vj‖2=ρr‖Δ∗‖F2\.\\\|\\Delta^\{\*\}V\_\{r\}V\_\{r\}^\{\\top\}\\\|\_\{F\}^\{2\}=\\operatorname\{tr\}\\mskip\-3\.0mu\\bigl\(V\_\{r\}^\{\\top\}\\Delta^\{\*\\top\}\\mskip\-3\.0mu\\Delta^\{\*\}V\_\{r\}\\bigr\)=\\sum\_\{j=1\}^\{r\}\\\|\\Delta^\{\*\}v\_\{j\}\\\|^\{2\}=\\rho\_\{r\}\\,\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\.Hence

ℰ\(ASVD\)=\(1−ρr\)‖Δ∗‖F2\.\\mathcal\{E\}\(A\_\{\\text\{SVD\}\}\)=\(1\-\\rho\_\{r\}\)\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\.\(11\)
Approximation error under Gaussian random initialization\.SubstitutingArandA\_\{\\text\{rand\}\}into the projection formula gives

Prow⁡\(Arand\)=Arand⊤\(ArandArand⊤\)−1Arand\.P\_\{\\operatorname\{row\}\(A\_\{\\text\{rand\}\}\)\}=A\_\{\\text\{rand\}\}^\{\\top\}\(A\_\{\\text\{rand\}\}\\,A\_\{\\text\{rand\}\}^\{\\top\}\)^\{\-1\}A\_\{\\text\{rand\}\}\.To simplify, letArand⊤=QRA\_\{\\text\{rand\}\}^\{\\top\}=QRbe the thin QR decomposition, whereQ∈ℝdin×rQ\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times r\}has orthonormal columns andR∈ℝr×rR\\in\\mathbb\{R\}^\{r\\times r\}is invertible\. ThenArand=R⊤Q⊤A\_\{\\text\{rand\}\}=R^\{\\top\}Q^\{\\top\}, soArandArand⊤=R⊤RA\_\{\\text\{rand\}\}\\,A\_\{\\text\{rand\}\}^\{\\top\}=R^\{\\top\}Rand

Prow⁡\(Arand\)=QR\(R⊤R\)−1R⊤Q⊤=QQ⊤\.P\_\{\\operatorname\{row\}\(A\_\{\\text\{rand\}\}\)\}=QR\\,\(R^\{\\top\}R\)^\{\-1\}\\,R^\{\\top\}Q^\{\\top\}=Q\\,Q^\{\\top\}\.When the rows ofArandA\_\{\\text\{rand\}\}are i\.i\.d\.𝒩\(0,σ2I\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\), the column span ofQQ\(i\.e\. the row space ofArandA\_\{\\text\{rand\}\}\) is distributed according to the Haar measure on the GrassmannianGr⁡\(r,din\)\\operatorname\{Gr\}\(r,d\_\{\\text\{in\}\}\), by rotational invariance of the Gaussian distribution\. By the symmetry of the Haar measure,

𝔼\[QQ⊤\]=rdinIdin\.\\mathbb\{E\}\[QQ^\{\\top\}\]=\\frac\{r\}\{d\_\{\\text\{in\}\}\}\\,I\_\{d\_\{\\text\{in\}\}\}\.Taking the expectation of the projected energy in equation[10](https://arxiv.org/html/2606.04536#A1.E10):

𝔼\[‖Δ∗QQ⊤‖F2\]=𝔼\[tr⁡\(QQ⊤Δ∗⊤Δ∗QQ⊤\)\]=𝔼\[tr⁡\(Δ∗⊤Δ∗QQ⊤\)\]=tr⁡\(Δ∗⊤Δ∗⋅𝔼\[QQ⊤\]\)=rdin‖Δ∗‖F2,\\mathbb\{E\}\\bigl\[\\\|\\Delta^\{\*\}QQ^\{\\top\}\\\|\_\{F\}^\{2\}\\bigr\]=\\mathbb\{E\}\\bigl\[\\operatorname\{tr\}\\mskip\-3\.0mu\\bigl\(QQ^\{\\top\}\\Delta^\{\*\\top\}\\mskip\-3\.0mu\\Delta^\{\*\}\\,QQ^\{\\top\}\\bigr\)\\bigr\]=\\mathbb\{E\}\\bigl\[\\operatorname\{tr\}\\mskip\-3\.0mu\\bigl\(\\Delta^\{\*\\top\}\\mskip\-3\.0mu\\Delta^\{\*\}\\,QQ^\{\\top\}\\bigr\)\\bigr\]=\\operatorname\{tr\}\\mskip\-3\.0mu\\bigl\(\\Delta^\{\*\\top\}\\mskip\-3\.0mu\\Delta^\{\*\}\\cdot\\mathbb\{E\}\[QQ^\{\\top\}\]\\bigr\)=\\frac\{r\}\{d\_\{\\text\{in\}\}\}\\,\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\},where the second equality uses the idempotency\(QQ⊤\)2=QQ⊤\(QQ^\{\\top\}\)^\{2\}=QQ^\{\\top\}and the cyclic property of trace, and the third exchangestr\\operatorname\{tr\}and𝔼\\mathbb\{E\}by linearity\. Hence

𝔼\[ℰ\(Arand\)\]=\(1−r/din\)‖Δ∗‖F2\.\\mathbb\{E\}\[\\mathcal\{E\}\(A\_\{\\text\{rand\}\}\)\]=\\bigl\(1\-r/d\_\{\\text\{in\}\}\\bigr\)\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\.\(12\)
Comparing the two initializations\.Subtracting equation[11](https://arxiv.org/html/2606.04536#A1.E11)from equation[12](https://arxiv.org/html/2606.04536#A1.E12):

𝔼\[ℰ\(Arand\)\]−ℰ\(ASVD\)=\(ρr−rdin\)‖Δ∗‖F2≥0,\\mathbb\{E\}\[\\mathcal\{E\}\(A\_\{\\text\{rand\}\}\)\]\-\\mathcal\{E\}\(A\_\{\\text\{SVD\}\}\)=\\Bigl\(\\rho\_\{r\}\-\\frac\{r\}\{d\_\{\\text\{in\}\}\}\\Bigr\)\\\|\\Delta^\{\*\}\\\|\_\{F\}^\{2\}\\mskip 5\.0mu plus 5\.0mu\\geq\\mskip 5\.0mu plus 5\.0mu0,where the inequality follows from Assumption[1](https://arxiv.org/html/2606.04536#Thmassumption1)\. Equality holds if and only ifρr=r/din\\rho\_\{r\}=r/d\_\{\\text\{in\}\}\. ∎

## Appendix BDerivation of Stop\-Gradient Policy Optimization

Letτ\\taudenote a complete fast\-weight rollout\. The stochastic components of the rollout are actionsata\_\{t\}, environment observations, and the randomness of the online update operator𝒯\\mathcal\{T\}\. Let\{ti\}i=1N\\\{t\_\{i\}\\\}\_\{i=1\}^\{N\}be the memory\-trigger indices, witht0=0t\_\{0\}=0andtN\+1=Tt\_\{N\+1\}=T\. Define

c¯i=\{\(q,hti,mti,d\),1≤i≤N,\(q,hT,mT\),i=N\+1\.\\bar\{c\}\_\{i\}=\\begin\{cases\}\(q,h\_\{t\_\{i\}\},m\_\{t\_\{i\}\},d\),&1\\leq i\\leq N,\\\\ \(q,h\_\{T\},m\_\{T\}\),&i=N\+1\.\\end\{cases\}Under the stop\-gradient convention, the distribution ofΔt\\Delta\_\{t\}produced by𝒯\\mathcal\{T\}is treated as fixed when differentiating with respect toθ0\\theta\_\{0\}\. The trajectory likelihood can therefore be written, up to terms independent ofθ0\\theta\_\{0\}, as

p\(τ∣θ0\)∝∏i=1N\+1\[∏t=ti−1\+1ti−1πθ0\+sg⁡\(Δt\)\(at∣q,ht,mt\)⋅πθ0\+sg⁡\(Δti\)\(ati∣c¯i\)\]p\(\\tau\\mid\\theta\_\{0\}\)\\propto\\prod\_\{i=1\}^\{N\+1\}\\left\[\\prod\_\{t=t\_\{i\-1\}\+1\}^\{t\_\{i\}\-1\}\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\}\)\}\(a\_\{t\}\\mid q,h\_\{t\},m\_\{t\}\)\\cdot\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\_\{i\}\}\)\}\(a\_\{t\_\{i\}\}\\mid\\bar\{c\}\_\{i\}\)\\right\]whereddis the extraction prompt used to elicit QA\-pair supervision\. This is the fast\-weight analogue of a segment\-wise rollout decomposition: each segment contains ordinary rollout actions followed by a boundary action, which is a memory\-writing action fori≤Ni\\leq Nand the final task response fori=N\+1i=N\+1\.

For the outcome\-reward objective

J\(θ0\)=𝔼τ∼pθ0,𝒯,ℙenv\[R\(τ\)\],J\(\\theta\_\{0\}\)=\\mathbb\{E\}\_\{\\tau\\sim p\_\{\\theta\_\{0\},\\mathcal\{T\},\\mathbb\{P\}\_\{env\}\}\}\\left\[R\(\\tau\)\\right\],the log\-derivative trick gives

∇θ0J\(θ0\)\\displaystyle\\nabla\_\{\\theta\_\{0\}\}J\(\\theta\_\{0\}\)=𝔼τ\[R\(τ\)∇θ0log⁡p\(τ∣θ0\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\}\\left\[R\(\\tau\)\\nabla\_\{\\theta\_\{0\}\}\\log p\(\\tau\\mid\\theta\_\{0\}\)\\right\]\(13\)≈𝔼τ\[R\(τ\)\(∑i=1N\+1\[∑t=ti−1\+1ti−1∇θ0log⁡πθ0\+sg⁡\(Δt\)\(at∣q,ht,mt\)\+∇θ0log⁡πθ0\+sg⁡\(Δti\)\(ati∣c¯i\)\]\)\]\.\\displaystyle\\approx\\mathbb\{E\}\_\{\\tau\}\\left\[R\(\\tau\)\\left\(\\sum\_\{i=1\}^\{N\+1\}\\left\[\\sum\_\{t=t\_\{i\-1\}\+1\}^\{t\_\{i\}\-1\}\\nabla\_\{\\theta\_\{0\}\}\\log\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\}\)\}\(a\_\{t\}\\mid q,h\_\{t\},m\_\{t\}\)\+\\nabla\_\{\\theta\_\{0\}\}\\log\\pi\_\{\\theta\_\{0\}\+\\operatorname\{sg\}\(\\Delta\_\{t\_\{i\}\}\)\}\(a\_\{t\_\{i\}\}\\mid\\bar\{c\}\_\{i\}\)\\right\]\\right\)\\right\]\.\(14\)The approximation consists exactly of stopping gradients through𝒯\\mathcal\{T\}and through the fast\-weight values it produces\. Thus, gradients are assigned segment by segment to ordinary task behavior, to the QA\-pair extraction actions that provide online LoRA supervision, and to the terminal response, while avoiding back\-propagation through the online optimization itself\.

## Appendix CUnfiltered CL\-Bench Accuracy

For completeness, we report rubric accuracy on the unfiltered CL\-Bench split\. Following the main text presentation, this appendix table reports only accuracy rates\.

Table 8:Accuracy \(%\) on the unfiltered CL\-Bench split acrossDK,ED,PT, andRS\.Boldindicates the best result per column\.MethodModelDKEDPTRSTotalNo MemoryQwen3\-4B6\.491\.513\.824\.244\.63Qwen3\-8B6\.491\.015\.314\.064\.90MemAgentQwen3\-4B4\.831\.514\.463\.894\.11Qwen3\-8B6\.181\.015\.734\.064\.90AMEMQwen3\-4B5\.281\.513\.823\.894\.11Qwen3\-8B5\.881\.514\.463\.534\.37TMEMQwen3\-4B6\.492\.515\.104\.065\.00Qwen3\-8B6\.331\.516\.374\.065\.16The unfiltered split sizes are DK=663, ED=199, PT=471, RS=566, and Total=1899\.

## Appendix Dprompts use in the task

System prompt for LoCoMo and LongMemEvalYou are given one problem to solve, previous extracted QA pairs and one conversation session\.Your task is to create high\-quality supervised fine\-tuning \(SFT\) QA pairsgrounded ONLY in this session\.Question:<question\> \{prompt\} </question\>Previous extracted QA pairs:<qa\_history\> \{qa\_history\} </qa\_history\>Session:<session\> \{chunk\} </session\>Figure 6:System prompt template used in LoCoMo and LongMemEval for extracting grounded SFT QA pairs from the current session\.System prompt for search\-task iterative reasoningYou will answer some complex questions through iterative reasoning,memory updates, and web searches\.At each step, you can see the question, previous interaction history in <history\>\.\.\. <history\>, search query in <search\> \.\.\. </search\>, and the returned information in<information\> \.\.\. </information\> \(except the first step where you will begiven only the question\)\.Then choose one of the following actions:\- If any question remains unanswered, issue a single query for one question inside<search\> \.\.\. </search\>\. The query should consist of keywords or a short phrase\.Only search one question at a time\.\- If all questions are answered, provide the final answers, separated by semicolons,within <answer\> answer1; answer2; \.\.\. </answer\>\. The answers must be concise,contain only essential words, and avoid any explanations\.Your output should be either <search\> \.\.\. </search\> or <answer\> \.\.\. </answer\>\.<question\> \{question\} </question\><history\> \{history\} </history\><search\> \{search\} </search\><information\> \{information\} </information\>Figure 7:System prompt template used in the search task for iterative reasoning, grounded memory updates, and step\-by\-step web search\.
## Appendix ECase Study

### E\.1LoCoMo

LoCoMo case exampleQuestion: What did the charity race raise awareness for?Model answer: Mental healthGround truth: mental health\[\{"instruction": "What is Caroline’s current plan to help with mental health?","output": "She is looking into counseling and mental health jobs"\},\{"instruction": "Why does Melanie run?","output": "To de\-stress and clear her mind"\},\{"instruction": "What activity does Melanie do to improve her mental health?","output": "Running"\},\{"instruction": "What charity event did Melanie participate in last Saturday?","output": "A charity race for mental health"\},\{"instruction": "What did Melanie realize about self\-care after the charity race?","output": "That self\-care is really important and helps her better look after her family"\},\{"instruction": "What activities does Melanie carve out for herself each dayto refresh and stay present?","output": "Running, reading, or playing her violin"\},\{"instruction": "What career areas is Caroline interested in exploring?","output": "Counseling or working in mental health"\}\]Figure 8:A LoCoMo case with retrieved memory QA pairs and the final question\. The model answer matches the ground truth\.LoCoMo case example \(adoption planning\)Question: What are Caroline’s plans for the summer?Model answer: researching adoption with an agency that supports LGBTQ\+ familiesGround truth: researching adoption agencies\[\{"instruction": "When did Caroline pass the adoption agency interviews?","output": "Last Friday"\},\{"instruction": "What does Caroline hope to achieve through adoption?","output": "To build a safe and loving home for kids who haven’t had a roof over their heads"\},\{"instruction": "What did Caroline do this week?","output": "Caroline applied to adoption agencies\."\},\{"instruction": "What event did Caroline attend last Friday?","output": "A council meeting for adoption"\},\{"instruction": "What is Caroline’s goal in the adoption process?","output": "To give kids who need it a loving home"\},\{"instruction": "Why did Caroline choose the adoption agency she is researching?","output": "Because it helps LGBTQ\+ folks with adoption, and its inclusivity and support spoke to her"\},\{"instruction": "What does Caroline feel about the challenges of being a single parent in the adoption process?","output": "She feels it will be tough, but she is up for the challenge"\}\]Figure 9:A LoCoMo case focused on adoption planning\. The prediction is semantically aligned with the ground truth\.LoCoMo case example \(family camping preference\)Question: Would Melanie be more interested in going to a national park or a theme park?Model answer: national parkGround truth: National park; she likes the outdoors\[\{"instruction": "What did Melanie say she loves most about camping with her family?","output": "It’s a chance to be present and together\. We bond over stories, campfires and nature\."\},\{"instruction": "What did Melanie say about the importance of family after the accident?","output": "Family’s super important to me\. They mean the world to me and I’m so thankful to have them\."\},\{"instruction": "What is Melanie’s family’s summer tradition?","output": "A family camping trip"\},\{"instruction": "What was Melanie’s best camping memory?","output": "Seeing the Perseid meteor shower last year"\},\{"instruction": "What did Melanie do with her family last week?","output": "She took her family camping in the mountains\."\},\{"instruction": "What activities did Melanie and her family do during their camping trip?","output": "They explored nature, roasted marshmallows around the campfire, and went on a hike\."\}\]Figure 10:A LoCoMo case about family\-outdoor preference\. The model answer matches the labeled preference\.
### E\.2LongMemEval

LongMemEval case QA pairs as JSON \(part 1\)Question: What is the total amount of money I earned from selling my products at the markets?,Model answer: $225 \+ $150 \+ $120 = $495,Ground truth: $495,\[\{"instruction": "How much money did I earn from selling my homemade jam at the Homemadeand Handmade Market on May 29th?","output": "$225"\},\{"instruction": "What date did I sell 15 jars of my homemade jam at the Homemade and Handmade Market?","output": "May 29th"\},\{"instruction": "What is the name of the gift basket I decided to use for my products?","output": "Harvest Delight"\},\{"instruction": "What is a description I used for my ’Harvest Delight’ gift basket on my website or social media?","output": "Indulge in the flavors of the harvest season with our ’Harvest Delight’ gift basket,filled with artisanal goodies like our signature homemade jam, warm spice blends, and crunchy nuts\."\},\{"instruction": "How much money did I earn from selling 20 potted herb plants at the Summer Solstice Market?","output": "$150"\},\{"instruction": "What was the price per potted herb plant sold at the Summer Solstice Market?","output": "$7\.5"\},\{"instruction": "What is the name of the market where I sold 20 potted herb plants?","output": "Summer Solstice Market"\},\{"instruction": "What are some popular herbal tea flavors mentioned in the session?","output": "Peppermint, Chamomile, Hibiscus, Lemon balm, Rosehip, Ginger, Turmeric"\},\{"instruction": "What are some market trends in the herbal tea industry according to the session?","output": "Wellness and self\-care, functional ingredients like adaptogens and probiotics,fruit and floral infusions, caffeine\-free options, sustainable and eco\-friendly packaging"\},\{"instruction": "What are some market trends in the spice blend industry according to the session?","output": "Global inspiration, wellness\-focused blends, vegan and vegetarian options, artisanaland small\-batch blends, DIY and customization"\},\{"instruction": "What is a recommended approach for highlighting health benefits in marketing for herbal teas?","output": "Focus on functional benefits, avoid medical claims, use credible sources,be transparent about ingredients and sourcing, and emphasize holistic benefits"\},\{"instruction": "What are some potential rewards in the proposed loyalty program?","output": "Discounts on future purchases, free shipping, free products, early access to new products,special promotions, invitations to events"\},\{"instruction": "What is the point system for the loyalty program as described in the session?","output": "1 point for every dollar spent, with bonus points of 10 points for every $50 spent"\},\{"instruction": "What are some ways to promote the loyalty program according to the session?","output": "Email marketing, social media, in\-store promotions, a dedicated webpage, early sign\-up offers,paid ads, checkout signage, and staff training"\},\{"instruction": "How much money did I earn from selling 12 bunches of fresh organic herbs at the farmers’market on May 15th?","output": "$120"\}\]Figure 11:LongMemEval market\-earnings case in JSON form \(part 1\), containing QA pairs 1\-15\.LongMemEval case QA pairs as JSON \(part 2\)\[\{"instruction": "What are some popular herb\-based products I could sell at the Harvest Festival Marketaccording to the session?","output": "Herbal teas, herb\-infused oils, herb salts, herbal jellies and jams, herbal potpourri,skincare products, spice blends, vinegars, syrups, and gift baskets"\},\{"instruction": "What are some tips for drying fresh herbs for use in herbal teas and infused oils?","output": "Air drying, using a dehydrator, oven drying around 150F for 1\-2 hours,or using a desiccant in an airtight container"\},\{"instruction": "What are some pricing strategies for herbal teas and infused oils mentioned in the session?","output": "Competitor research, cost\-based pricing, value\-based pricing, tiered pricing,bundle deals, and seasonal pricing"\},\{"instruction": "What is a recommended price range for herbal teas and infused oils according to the session?","output": "Herbal teas: $5\-15 per ounce; infused oils: $10\-30 per ounce"\},\{"instruction": "What are some herbs suitable for use in herbal skincare products?","output": "Calendula, chamomile, and lavender"\}\]Figure 12:LongMemEval market\-earnings case in JSON form \(part 2\), containing QA pairs 16\-20\.LongMemEval long\-dialogue handling experience \(4\-item JSON\)\[\{"instruction": "When answering a long\-memory question about total earnings, what should the model do first?","output": "Identify all separate revenue events related to the question before doing any calculation\."\},\{"instruction": "How should the model answer when multiple QA pairs contain separate amountsfrom different markets?","output": "It should extract each relevant amount, verify that each amount refers to a distinct sale,then sum them\."\},\{"instruction": "Which QA pairs are necessary to answer the market earnings question in this case?","output": "The QA pairs showing $225 from homemade jam, $150 from potted herb plants,and $120 from fresh organic herbs\."\},\{"instruction": "What should the model avoid when using long conversation QA pairs for a numerical answer?","output": "It should avoid using unrelated product, marketing, pricing, or loyalty\-program details unlessthey contribute directly to the requested total\."\}\]Figure 13:Long\-dialogue handling experience extracted from this LongMemEval case, shown separately as a 4\-item JSON array\.
### E\.3Multi\-Objective Search

Multi\-Objective Search QA pairs as JSON \(part 1\)Question: What color clothing do people of the Netherlands wear during Oranjegekte or tocelebrate the national holiday Koningsdag?; What was the name of the 1996 loose adaptation ofWilliam Shakespeare’s \\"Romeo & Juliet\\" written by James Gunn?; What American professionalHawaiian surfer born 18 October 1992 won the Rip Curl Pro Portugal?; What nationality weresocial anthropologists Alfred Gell and Edmund LeachModel answer: orange for the Netherlands; Tromeo and Juliet; John John Florence; BritishGround truth: "orange; Tromeo and Juliet; John John Florence; British"\[\{"instruction": "What color clothing do people of the Netherlands wear during Oranjegekte orto celebrate Koningsdag?","output": "orange"\},\{"instruction": "What was the name of the 1996 loose adaptation of William Shakespeare’s \\"Romeo & Juliet\\"written by James Gunn?","output": "Tromeo and Juliet"\},\{"instruction": "What American professional Hawaiian surfer born 18 October 1992 won the Rip Curl Pro Portugal?","output": "John John Florence"\},\{"instruction": "What nationality were social anthropologists Alfred Gell and Edmund Leach?","output": "British"\},\{"instruction": "What is Oranjegekte?","output": "Oranjegekte is a Netherlands phenomenon associated with major sporting events and Koningsdag,often involving orange clothing and decorations\."\},\{"instruction": "Why is orange clothing associated with Koningsdag?","output": "Orange is associated with the Dutch royal family, the House of Orange\-Nassau\."\},\{"instruction": "What evidence identifies Tromeo and Juliet as the correct answer?","output": "The evidence says Tromeo and Juliet is a 1996 loose adaptation of Romeo & Juliet and thatJames Gunn co\-wrote the screenplay\."\},\{"instruction": "If a query about a 1996 Romeo and Juliet adaptation returns both Romeo \+ Juliet and Tromeo andJuliet, how should the model choose?","output": "Prefer the result that matches all constraints, especially James Gunn and loose adaptation;that points to Tromeo and Juliet\."\},\{"instruction": "What search query is effective for identifying the James Gunn Romeo and Juliet adaptation?","output": "1996 loose adaptation Romeo Juliet James Gunn"\},\{"instruction": "What evidence supports John John Florence as the surfer answer?","output": "The retrieved text says John John Florence was born October 18, 1992, is a Hawaiian professional surfer,and won the Meo Rip Curl Pro in Portugal\."\},\{"instruction": "What evidence supports Alfred Gell and Edmund Leach being British?","output": "The retrieved documents describe Alfred Gell as a British social anthropologist and Edmund Leach as aBritish social anthropologist\."\},\]Figure 14:The first half of the search\-task QA history in original JSON form\.Multi\-Objective Search QA pairs as JSON \(part 2\)\[\{"instruction": "In a packed multi\-question search task, should the model search all remaining questions at once?","output": "No\. Search one question at a time so each retrieved passage can be tied to a specific answer\."\},\{"instruction": "What should the model store after each useful search result?","output": "Store a concise QA pair that captures the question, the answer, and only directly supported information\."\},\{"instruction": "How should the model handle near\-duplicate retrieved documents?","output": "Use duplicates as confirmation, but avoid adding duplicate QA pairs unless the new document adds auseful constraint\."\},\{"instruction": "How should the final answer be formatted for a packed question with four subquestions?","output": "Return four concise answers in the original question order, separated by semicolons\."\}\]Figure 15:The second half of the search\-task QA history in original JSON form\.
Scaling Self-Evolving Agents via Parametric Memory

Similar Articles

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

SimpleMem: Efficient Lifelong Memory for LLM Agents

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

Submit Feedback

Similar Articles

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
ElasticMem: Latent Memory as a Learnable Resource for LLM Agents
SimpleMem: Efficient Lifelong Memory for LLM Agents
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents