Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Summary
Auto-Dreamer introduces a learned offline memory consolidation method for language agents, decoupling fast memory acquisition from slow cross-session consolidation, and achieving higher performance with smaller memory banks, generalizing to unseen environments.
View Cached Full Text
Cached at: 05/21/26, 06:34 AM
# Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents
Source: [https://arxiv.org/html/2605.20616](https://arxiv.org/html/2605.20616)
Chongrui Ye,1Yuxiang Liu11footnotemark:1,1Yu Wang2Haofei Yu1 Yining Zhao1Ge Liu1Julian McAuley2Jiaxuan You1 1University of Illinois Urbana\-Champaign2University of California San DiegoEqual contribution\. Order determined by coin flip; both authors reserve the right to list themselves as first author\.
###### Abstract
Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge\. Retrieval\-augmented and structured memory methods record per\-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries\. Inspired by complementary learning systems theory, we propose Auto\-Dreamer, a learned offline consolidator for language\-agent memory\. Auto\-Dreamer decouples fast per\-session memory acquisition from slow cross\-session consolidation\. Given a selected working region of a typed memory bank, the consolidator treats the region as read\-only evidence, performs bounded tool\-use to inspect entries and provenance\-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region\. We train Auto\-Dreamer via GRPO, using end\-to\-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience\. Trained on ScienceWorld trajectories alone, Auto\-Dreamer outperforms fixed, RL\-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12×\\timessmaller than the strongest baseline, and continues to lead on held\-out ALFWorld and WebArena without retraining — using 6×\\timesless memory than the strongest baseline on ALFWorld\.
## 1Introduction
Language agents are increasingly deployed over streams of related tasks rather than isolated interactions\[[27](https://arxiv.org/html/2605.20616#bib.bib31),[33](https://arxiv.org/html/2605.20616#bib.bib32)\]\. In such settings, long\-term memory is not merely a retrieval cache for past entities or user preferences; it is the mechanism by which an agent converts raw experience into reusable procedures, environment knowledge, and behavioral priors that improve future decision making\. A memory system must therefore solve two distinct problems: it must rapidly acquire useful information from each new trajectory, and it must periodically reorganize accumulated experience into a form that is compact, non\-redundant, and useful for future tasks\.
Recent work has made substantial progress on individual components of language\-agent memory\[[10](https://arxiv.org/html/2605.20616#bib.bib8),[11](https://arxiv.org/html/2605.20616#bib.bib9)\], including retrieval\-augmented episodic stores\[[43](https://arxiv.org/html/2605.20616#bib.bib46),[21](https://arxiv.org/html/2605.20616#bib.bib12)\], structured memory systems\[[29](https://arxiv.org/html/2605.20616#bib.bib11),[34](https://arxiv.org/html/2605.20616#bib.bib13)\], procedural skill libraries\[[6](https://arxiv.org/html/2605.20616#bib.bib14),[26](https://arxiv.org/html/2605.20616#bib.bib48),[20](https://arxiv.org/html/2605.20616#bib.bib25)\], reflection\-based methods\[[23](https://arxiv.org/html/2605.20616#bib.bib49),[16](https://arxiv.org/html/2605.20616#bib.bib50)\], and RL\-trained memory managers\[[37](https://arxiv.org/html/2605.20616#bib.bib15),[32](https://arxiv.org/html/2605.20616#bib.bib16),[35](https://arxiv.org/html/2605.20616#bib.bib17),[30](https://arxiv.org/html/2605.20616#bib.bib29)\]\. Despite this progress, two challenges remain\. First, a consolidation problem: existing methods typically couple acquisition and consolidation into a single online update process, so each update is made with limited evidence from the current session\. This makes it difficult to discover recurring patterns, abstract reusable procedures that generalize across sessions, resolve contradictions, or prune redundant entries\. Second, a memory\-utility problem: RL\-trained memory methods optimize online construction or retrieval rather than offline consolidation under an explicit downstream utility objective, so they do not directly learn which memories are load\-bearing, which entries are redundant, or how to trade off success against memory compactness\.
We take inspiration from complementary learning systems \(CLS\) theory of human memory, in which a fast hippocampal system encodes individual episodes and a slower neocortical system gradually extracts shared structure across episodes\[[18](https://arxiv.org/html/2605.20616#bib.bib1),[12](https://arxiv.org/html/2605.20616#bib.bib2),[17](https://arxiv.org/html/2605.20616#bib.bib33)\]\. We adopt CLS not as a biological claim about language models, but as an operational design principle for separating fast acquisition from slow cross\-session consolidation\. We introduceAuto\-Dreamer, a learned offline consolidator for language\-agent memory\.111Auto\-Dreamer is distinct from the Dreamer family of world models\[[7](https://arxiv.org/html/2605.20616#bib.bib4),[8](https://arxiv.org/html/2605.20616#bib.bib5)\]; our method operates on memory entries and source trajectories, not the latent environment dynamics\.Auto\-Dreamer is the slow\-timescale counterpart to a fast per\-session writer\. Given a typed memory bank produced by the writer, it performs a multi\-step tool\-use rollout: searching memory, inspecting candidate entries, retrieving raw source trajectories for provenance, and synthesizing new entries that abstract across sessions\. Its core operation is*region rewriting*: the consolidator treats a selected working region as read\-only evidence and synthesizes a fresh replacement set that supersedes the original region\. This replacement semantics makes compactness structural rather than auxiliary: old entries do not persist by default, and information survives only if it is re\-synthesized into the replacement set\. As a result, abstraction, deduplication, contradiction resolution, and omission\-based forgetting become default behaviors\. We train Auto\-Dreamer with GRPO\[[22](https://arxiv.org/html/2605.20616#bib.bib3)\]using a composite reward that combines downstream task performance with a counterfactual utility term estimated by random memory masking, which penalizes redundant entries while rewarding load\-bearing ones\. The task agent and the per\-session writer remain fixed throughout training, isolating the contribution of the consolidator\.
We evaluate Auto\-Dreamer in two regimes: continual\-memory deployment, where the bank starts empty and grows over the task stream, and fixed\-bank consolidation, where a pre\-built bank is rewritten once\. The results support three conclusions\. First, Auto\-Dreamer improves task success while maintaining substantially smaller active memory banks: in continual deployment, it achieves 41\.1% success on ScienceWorld\[[28](https://arxiv.org/html/2605.20616#bib.bib7)\], 7 points above the strongest baseline with 12×\\timesless memory; 60\.2% on held\-out ALFWorld\[[24](https://arxiv.org/html/2605.20616#bib.bib6)\]with 6×\\timesless memory than the strongest baseline; and 52\.3% on held\-out WebArena\[[44](https://arxiv.org/html/2605.20616#bib.bib10)\], leading all baselines\. Second, the learned consolidator transfers beyond its training distribution: although trained only on ScienceWorld trajectories, it improves performance on held\-out ALFWorld and WebArena without further updates, including across a writer\-backbone shift from Qwen3\-14B\[[25](https://arxiv.org/html/2605.20616#bib.bib65)\]to Gemini\-3\.1\-flash\-lite\-preview\[[4](https://arxiv.org/html/2605.20616#bib.bib63)\]\. Third, controlled fixed\-bank experiments and ablations show that the gains come from offline consolidation itself: region rewriting improves the quality of a given memory bank, while the counterfactual utility term suppresses redundant memories without sacrificing task performance\.
Our contributions are summarized as follows:
- •A two\-timescale formulation of language\-agent memory\.We distinguish fast per\-session acquisition from slow cross\-session consolidation and formulate the latter as a learned decision problem over accumulated evidence\.
- •Region rewriting as a compactness\-inducing consolidation primitive\.We formulate offline consolidation as provenance\-grounded region rewriting: a selected working region is treated as read\-only evidence and replaced by a synthesized replacement set\. This differs from per\-entry CRUD by making cross\-session abstraction, deduplication, and omission\-based forgetting the default update semantics\.
- •RL training with region\-local credit\.Because region rewriting produces a self\-contained replacement set, we can evaluate it directly and assign local credit from downstream task performance without supervised memory labels\. We further use counterfactual masking to favor load\-bearing memories and suppress redundant entries, improving the task utility of the compact bank\.
## 2Related works
Memory systems for language agents\.A growing body of work designs memory architectures for language agents\. Early systems organize memory around atomic units or flat stores, such as A\-MEM\[[34](https://arxiv.org/html/2605.20616#bib.bib13)\], Mem0\[[2](https://arxiv.org/html/2605.20616#bib.bib27)\], MemOS\[[13](https://arxiv.org/html/2605.20616#bib.bib58)\], and SimpleMem\[[15](https://arxiv.org/html/2605.20616#bib.bib53)\]\. More recent work introduces richer typed memory spanning episodic, semantic, and procedural stores, including EverMemOS\[[9](https://arxiv.org/html/2605.20616#bib.bib59)\], MIRIX\[[29](https://arxiv.org/html/2605.20616#bib.bib11)\], Nemori\[[19](https://arxiv.org/html/2605.20616#bib.bib52)\], and PlugMem\[[36](https://arxiv.org/html/2605.20616#bib.bib61)\]\. A complementary line focuses on extracting reusable procedures or strategies from trajectories: Memp\[[6](https://arxiv.org/html/2605.20616#bib.bib14)\]and Voyager\[[26](https://arxiv.org/html/2605.20616#bib.bib48)\]build procedural skill libraries, ExpeL\[[42](https://arxiv.org/html/2605.20616#bib.bib26)\]extracts cross\-task insights from successful and failed trajectories, ReasoningBank\[[20](https://arxiv.org/html/2605.20616#bib.bib25)\]distills high\-level reasoning strategies, and ReMem\[[32](https://arxiv.org/html/2605.20616#bib.bib16)\]studies test\-time memory evolution\. These systems improve how agents store experience, but their memory updates are governed by prompted heuristics applied within or immediately after each session, without explicit cross\-session consolidation\.
RL\-trained memory managers\.Recent work explores training language models to construct memory using reinforcement learning\. MEM1\[[45](https://arxiv.org/html/2605.20616#bib.bib54)\]and MemAgent\[[38](https://arxiv.org/html/2605.20616#bib.bib55)\]train models to update simple, text\-only memories\. Memory\-R1\[[35](https://arxiv.org/html/2605.20616#bib.bib17)\], Learn\-to\-Memorize\[[41](https://arxiv.org/html/2605.20616#bib.bib56)\], REMEMBER\[[39](https://arxiv.org/html/2605.20616#bib.bib57)\], and Mem\-α\\alpha\[[30](https://arxiv.org/html/2605.20616#bib.bib29)\]introduce richer memory representations and teach agents to manage complex memory systems through interaction and feedback\. However, these methods primarily focus on teaching the model to extract and organize knowledge from its input, rather than on improving downstream agentic task performance\. Later methods bridge this gap: UMEM\[[37](https://arxiv.org/html/2605.20616#bib.bib15)\]jointly trains memory extraction and management with GRPO under an online single\-step interface, and MemRL\[[40](https://arxiv.org/html/2605.20616#bib.bib62)\]trains the agent to retrieve the correct memory at decision time\. Nevertheless, all of these operate online: memory updates are interleaved with task execution, so consolidation evidence is limited to the current session\. Auto\-Dreamer addresses a complementary problem, operating offline over a bank accumulated across many sessions with access to the full memory bank and raw source trajectories\.
Offline computation and sleep\-time memory\.Sleep\-time compute\[[14](https://arxiv.org/html/2605.20616#bib.bib18)\]pre\-computes over persistent context before queries arrive, amortizing reasoning across future interactions\. LightMem\[[5](https://arxiv.org/html/2605.20616#bib.bib19)\]combines an online writer with periodic offline consolidation, but implements consolidation as a fixed prompted pipeline with per\-entry CRUD decisions\. Auto\-Dreamer instead performs*region rewriting*: it treats a selected working region as read\-only evidence, then uses a learned multi\-step tool\-using consolidator to synthesize a fresh compact replacement set that abstracts across sessions and supersedes the original region\. The replacement set is grounded in re\-readable source trajectories and trained with downstream task reward\.
## 3Preliminaries
Figure 1:Memory primitives and operations\.\(A\)The memory bankℬ\\mathcal\{B\}holds typed entries \(semantic or procedural\); each entry has a short namenin\_\{i\}, a bodysis\_\{i\}, and provenance links to source trajectories in the trajectory log𝒯\\mathcal\{T\}\.\(B\)The read operator retrieves the top\-KKentries by cosine similarity between a frozen sentence encoderϕ\\phiapplied to the query and to each entry’s name\-body text\.\(C\)The write operator applies a learnable consolidatorCθC\_\{\\theta\}to a working regionℛ⊆ℬ\\mathcal\{R\}\\subseteq\\mathcal\{B\}and its provenance\-linked trajectories𝒯ℛ\\mathcal\{T\}\_\{\\mathcal\{R\}\}, producing a replacement set𝒮\\mathcal\{S\}that supersedesℛ\\mathcal\{R\}in the post\-consolidation bankℬ⋆\\mathcal\{B\}^\{\\star\}\.Task setup\.A frozen task agent operates over a stream of sessionsτ\\tau, each yielding an action–observation trace and final outcome\. The agent has access to a typed long\-term memory bankℬ\\mathcal\{B\}through a fixed retriever, and a trajectory log𝒯\\mathcal\{T\}that records raw sessions for provenance\. Offline consolidation leaves the task agent, retriever, and memory schema fixed; it transformsℬ\\mathcal\{B\}into a post\-consolidation bankℬ⋆\\mathcal\{B\}^\{\\star\}\. Raw trajectories in𝒯\\mathcal\{T\}are not retrieved by the task agent at decision time but remain available to the consolidator as provenance evidence\.
Memory bank\.A memory entry is a typed textual abstraction, with a short namenn, a bodyss, and provenance links to source trajectories in𝒯\\mathcal\{T\}\. Each entry is either*semantic*\(factual environment knowledge, e\.g\.,“the toiletpaperhanger is typically on the bathroom wall”\) or*procedural*\(reusable how\-to skills, e\.g\.,“to cool an object, place it in the fridge, wait, then retrieve it”\)\. The memory bankℬ\\mathcal\{B\}is a set of such entries\.
Memory operations\.The bank supports two complementary operations: an online read operator used by the frozen task agent, and an offline write operator learned by the consolidator:
Read\(q;ℬ\)=Top\-Ke∈ℬcos\(ϕ\(q\),ϕ\(ne⊕se\)\),Write\(ℬ,ℛ,𝒯ℛ\)=\(ℬ∖ℛ\)∪Cθ\(ℛ,𝒯ℛ\)\\mathrm\{Read\}\(q;\\mathcal\{B\}\)=\\mathrm\{Top\}\\text\{\-\}K\_\{e\\in\\mathcal\{B\}\}\\cos\\bigl\(\\phi\(q\),\\phi\(n\_\{e\}\\oplus s\_\{e\}\)\\bigr\),\\;\\mathrm\{Write\}\(\\mathcal\{B\},\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\)=\(\\mathcal\{B\}\\setminus\\mathcal\{R\}\)\\cup C\_\{\\theta\}\(\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\)\(1\)Hereϕ\\phiis a frozen sentence encoder,KKis the largest ranked prefix fitting the token budget,⊕\\oplusdenotes string concatenation,ℛ⊆ℬ\\mathcal\{R\}\\subseteq\\mathcal\{B\}is the rewritten memory region, and𝒯ℛ\\mathcal\{T\}\_\{\\mathcal\{R\}\}collects the source trajectories linked from entries inℛ\\mathcal\{R\}, accessible to the consolidator via provenance withCθ\(ℛ,𝒯ℛ\)=𝒮C\_\{\\theta\}\(\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\)=\\mathcal\{S\}producing a replacement set, yieldingℬ⋆=\(ℬ∖ℛ\)∪𝒮\\mathcal\{B\}^\{\\star\}=\(\\mathcal\{B\}\\setminus\\mathcal\{R\}\)\\cup\\mathcal\{S\}\. We refer to this write operation as*region rewriting*: entries inℛ\\mathcal\{R\}are treated as read\-only evidence rather than edited in place, and entries outsideℛ\\mathcal\{R\}are left unchanged\.
Evaluation tasks\.At training time, the consolidator is scored on an evaluation task set𝒱\\mathcal\{V\}randomly drawn from the training environment, disjoint from the trajectories used to construct training memory regions and from all tasks reported in Section[5](https://arxiv.org/html/2605.20616#S5)\.
## 4Methodology
Figure 2:Auto\-Dreamer overview\.\(A\)A frozen writer appends typed entries from each trajectoryτt\\tau\_\{t\}to the memory bankℬ\\mathcal\{B\}\.\(B\)Everykksessions, the consolidatorCθC\_\{\\theta\}rewrites a working regionℛ\\mathcal\{R\}into a replacement set𝒮\\mathcal\{S\}via tool\-use rollout over memory and provenance trajectories\.\(C\)Training:GGgroup rollouts produce candidates\{𝒮g\}\\\{\\mathcal\{S\}\_\{g\}\\\}, scored on evaluation tasks𝒱\\mathcal\{V\}; GRPO updatesθ\\thetausing rewardrg=R𝒱\(𝒮g\)\+αrcf\(𝒮g;𝒱\)r\_\{g\}=R\_\{\\mathcal\{V\}\}\(\\mathcal\{S\}\_\{g\}\)\+\\alpha\\,r\_\{\\mathrm\{cf\}\}\(\\mathcal\{S\}\_\{g\};\\mathcal\{V\}\)\.\(D\)The counterfactual termrcfr\_\{\\mathrm\{cf\}\}compares each𝒮g\\mathcal\{S\}\_\{g\}against masked variants, assigning high credit to useful entries, low credit to duplicates, and negative credit to harmful entries\.Auto\-Dreamerinstantiates the offline consolidation problem \(Section[3](https://arxiv.org/html/2605.20616#S3)\) with a learned tool\-using policy\. A fixed online writer provides fast acquisition by extracting typed memories from individual sessions\. A learned offline consolidator provides slow consolidation by rewriting selected memory regions after experience has accumulated\. The task agent, retriever, writer, memory schema, and token budget are all fixed; only the consolidator parameters are updated during training, and only the memory bank is updated after deployment\. At a consolidation event, a region selector providesℛ⊆ℬ\\mathcal\{R\}\\subseteq\\mathcal\{B\}\. The consolidatorCθC\_\{\\theta\}performs a bounded tool\-use rollout: searching the bank, inspecting candidate entries, and retrieving provenance\-linked source trajectories, and emitting synthesized entries\. The synthesized entries form a replacement set𝒮\\mathcal\{S\}, inserted into the bank\.
### 4\.1Designing Two\-timescale Memory for Auto\-Dreamer
Fast online acquisition\.After each session, a prompted writer language model emits typed memory entries that record potentially useful experience, each storing a provenance pointer to the source trajectory\. The writer is intentionally local and append\-only: it does not search the existing bank, compare entries across sessions, or rewrite old memories\. This favors plasticity—new experience is recorded immediately and cheaply\. Slower cross\-session operations \(merging, abstraction, correction, compression, forgetting\) are delegated to the consolidator\. The writer prompt and schema are given in Appendix[E](https://arxiv.org/html/2605.20616#A5)\.
Slow consolidation as region rewriting\.A bank produced by repeated local writing is useful but inefficient\. It may contain duplicate procedures, overspecific rules, stale facts, and partial observations whose common structure is only visible across sessions\.Auto\-Dreameraddresses this by learning to rewrite active memory regions\. Given a regionℛ⊆ℬ\\mathcal\{R\}\\subseteq\\mathcal\{B\}and its provenance\-linked trajectories𝒯ℛ\\mathcal\{T\}\_\{\\mathcal\{R\}\}, the consolidator performs a bounded tool\-use rollout over a fixed turn budget, with each step conditioned on\(ℛ,𝒯ℛ\)\(\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\)and the history of previous tool calls and observations \(tool interface in Appendix[E\.2](https://arxiv.org/html/2605.20616#A5.SS2)\)\. The rollout ends when the policy emitsterminateor reaches the budget\. The synthesized entries form the replacement set𝒮\\mathcal\{S\}, and the bank is updated by replacingℛ\\mathcal\{R\}with𝒮\\mathcal\{S\}according to Eq\.[1](https://arxiv.org/html/2605.20616#S3.E1)\.
This provenance\-grounded region\-replacement semantics is central to Auto\-Dreamer’s compactness\. In CRUD\-style memory managers, existing memories persist unless the controller explicitly edits or deletes them, so consolidation is expressed as many local retention decisions\. In region rewriting, the unit of rewriting is a region rather than an individual entry: the old entries serve as evidence, and only information re\-synthesized into the replacement set remains active\. This makes abstraction, deduplication, contradiction resolution, and omission\-based forgetting the default behaviors of the operator\. As a result, compactness arises from the primitive itself, while learning determines which compact abstractions are most useful for downstream tasks\.
### 4\.2Deploying Auto\-Dreamer via Online Memory Acquisition
In the online setting, the trained consolidator updates the memory bankℬ\\mathcal\{B\}using the Write operator from Eq\.[1](https://arxiv.org/html/2605.20616#S3.E1)\. We trigger consolidation everykksessions and define the working regionℛ\\mathcal\{R\}as the union of entries newly written during the interval and older entries retrieved by the task agent during the same interval\. This working region bridges online memory acquisition and offline consolidation: newly written entries provide fresh evidence, while recently retrieved entries identify older memories currently interacting with the task agent’s behavior\. The consolidator treatsℛ\\mathcal\{R\}as read\-only evidence and synthesizes a replacement set𝒮\\mathcal\{S\}, yieldingℬ⋆=\(ℬ∖ℛ\)∪𝒮\\mathcal\{B\}^\{\\star\}=\(\\mathcal\{B\}\\setminus\\mathcal\{R\}\)\\cup\\mathcal\{S\}\. Entries outsideℛ\\mathcal\{R\}are left unchanged but may enter a future working region if retrieved in subsequent tasks\.
### 4\.3Training Auto\-Dreamer via Offline Memory Consolidation
We train the consolidator on regions sampled from an offline corpus of agent trajectories\. We first collect trajectories from the training environments, run the fixed writer once, and store the resulting entries together with their provenance links\. At each training step, we sampleJJsupport trajectories\{τ\(j\)\}j=1J\\\{\\tau^\{\(j\)\}\\\}\_\{j=1\}^\{J\}and form the working regionℛ\\mathcal\{R\}from their prewritten entries, with corresponding source trajectories𝒯ℛ\\mathcal\{T\}\_\{\\mathcal\{R\}\}accessible through provenance links\. The consolidator samplesGGtool\-use rollouts over\(ℛ,𝒯ℛ\)\(\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\); rolloutggproduces a candidate replacement set𝒮g\\mathcal\{S\}\_\{g\}\. For direct credit assignment, training evaluates each replacement set in a local bank consisting only of the synthesized entries:ℬg⋆=𝒮g\\mathcal\{B\}^\{\\star\}\_\{g\}=\\mathcal\{S\}\_\{g\}\.
Local evaluation for credit assignment\.Region rewriting turns consolidation into a locally evaluable operator\-learning problem\. Each rollout produces a self\-contained replacement set, so we evaluate it in a local bank consisting only of the synthesized memoriesℬg⋆=𝒮g\\mathcal\{B\}^\{\\star\}\_\{g\}=\\mathcal\{S\}\_\{g\}\. This aligns the unit of credit assignment with the unit produced by the policy: reward is assigned directly to the replacement set, rather than to a full bank whose performance may be explained by memories not produced by the current rollout\. In this way, we learn a region\-local improvement operator\. At deployment, the same learned operator rewrites a selected working region, while entries outside the working region are left unchanged, as in Eq\.[1](https://arxiv.org/html/2605.20616#S3.E1)\. The consolidation interface, provenance tools, schema, frozen writer, retriever, task agent, and memory\-token budget are shared across training and deployment\. Our continual\-memory deployment experiments in Section[5\.2](https://arxiv.org/html/2605.20616#S5.SS2)show that this region\-local objective transfers to persistent\-bank composition\.
We train the consolidatorCθC\_\{\\theta\}with GRPO\[[22](https://arxiv.org/html/2605.20616#bib.bib3)\]\. The writer, retriever, task agent, and memory\-token budget are fixed across rollouts; only the consolidator parametersθ\\thetaare updated\. The consolidator is initialized from Qwen3\-14B\. Per\-environment data construction, rollout budgets, and optimization hyperparameters are given in Appendix[B\.3](https://arxiv.org/html/2605.20616#A2.SS3)\.
Reward design\.For rolloutgg, the reward combines downstream performance with a counterfactual estimate of memory utility:
rg=U𝒱\(𝒮g\)\+αrcf\(𝒮g;𝒱\)\.r\_\{g\}=U\_\{\\mathcal\{V\}\}\(\\mathcal\{S\}\_\{g\}\)\+\\alpha r\_\{\\mathrm\{cf\}\}\(\\mathcal\{S\}\_\{g\};\\mathcal\{V\}\)\.\(2\)The two terms are defined as
U𝒱\(𝒮\)=1\|𝒱\|∑v∈𝒱Return\(v;𝒮\)rcf\(𝒮g;𝒱\)=U𝒱\(𝒮g\)−𝔼𝒮~∼qρ\(⋅∣𝒮g\)\[U𝒱\(𝒮~\)\]\.\\begin\{array\}\[\]\{c@\{\\qquad\}c\}\\displaystyle U\_\{\\mathcal\{V\}\}\(\\mathcal\{S\}\)=\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{v\\in\\mathcal\{V\}\}\\mathrm\{Return\}\(v;\\mathcal\{S\}\)&\\displaystyle r\_\{\\mathrm\{cf\}\}\(\\mathcal\{S\}\_\{g\};\\mathcal\{V\}\)=U\_\{\\mathcal\{V\}\}\(\\mathcal\{S\}\_\{g\}\)\-\\mathbb\{E\}\_\{\\widetilde\{\\mathcal\{S\}\}\\sim q\_\{\\rho\}\(\\cdot\\mid\\mathcal\{S\}\_\{g\}\)\}\\\!\\left\[U\_\{\\mathcal\{V\}\}\(\\widetilde\{\\mathcal\{S\}\}\)\\right\]\.\\end\{array\}\(3\)Hereα\\alphaweights the counterfactual term, and the distributionqρ\(𝒮~∣𝒮g\)q\_\{\\rho\}\(\\widetilde\{\\mathcal\{S\}\}\\mid\\mathcal\{S\}\_\{g\}\)masks a fixed fractionρ\\rhoof entries from𝒮g\\mathcal\{S\}\_\{g\}uniformly at random\. In practice, the expectation inrcfr\_\{\\mathrm\{cf\}\}is estimated withMgM\_\{g\}Monte Carlo samples\. The counterfactual term measures the expected performance drop under random ablation of the synthesized replacement set: masking load\-bearing entries lowers performance, masking redundant entries has little effect because duplicate information remains, and masking harmful entries can improve performance, makingrcfr\_\{\\mathrm\{cf\}\}negative\. In the GRPO update, this favors replacement sets whose entries improve downstream utility with minimal redundancy\.
Algorithm 1Auto\-Dreamer Consolidator Training1:Offline pool of trajectories with prewritten entries; frozen task agent; evaluation set
𝒱\\mathcal\{V\}; support size
JJ; group size
GG; rollout budget
TmaxT\_\{\\max\}
2:foreach training stepdo
3:Sample
JJsupport trajectories
\{τ\(j\)\}j=1J\\\{\\tau^\{\(j\)\}\\\}\_\{j=1\}^\{J\}from the offline pool
4:Form working region
ℛ←\\mathcal\{R\}\\leftarrowentries written from
\{τ\(j\)\}\\\{\\tau^\{\(j\)\}\\\}, with provenance trajectories
𝒯ℛ\\mathcal\{T\}\_\{\\mathcal\{R\}\}
5:for
g=1,…,Gg=1,\\ldots,Gdo
6:
𝒮g←Cθ\(ℛ,𝒯ℛ\)\\mathcal\{S\}\_\{g\}\\leftarrow C\_\{\\theta\}\(\\mathcal\{R\},\\mathcal\{T\}\_\{\\mathcal\{R\}\}\)// tool\-use rollout, up to
TmaxT\_\{\\max\}steps
7:
rg←U𝒱\(𝒮g\)\+αrcf\(𝒮g;𝒱\)r\_\{g\}\\leftarrow U\_\{\\mathcal\{V\}\}\(\\mathcal\{S\}\_\{g\}\)\+\\alpha\\,r\_\{\\mathrm\{cf\}\}\(\\mathcal\{S\}\_\{g\};\\mathcal\{V\}\)// Eq\.[2](https://arxiv.org/html/2605.20616#S4.E2), local bank
ℬg⋆=𝒮g\\mathcal\{B\}^\{\\star\}\_\{g\}=\\mathcal\{S\}\_\{g\}
8:endfor
9:Update
θ\\thetavia GRPO using
\{rg\}g=1G\\\{r\_\{g\}\\\}\_\{g=1\}^\{G\}
10:endfor
## 5Experiments
We evaluate Auto\-Dreamer along three axes\. Section[5\.2](https://arxiv.org/html/2605.20616#S5.SS2)studies realistic continual\-memory deployment, showing improved task success over per\-session memory baselines with a compact memory bank\. Section[5\.3](https://arxiv.org/html/2605.20616#S5.SS3)isolates the consolidation operator in a fixed\-bank setting and shows gains over prompted offline baselines\. Section[5\.4](https://arxiv.org/html/2605.20616#S5.SS4)ablates the key design choices, highlighting the roles of offline consolidation and the counterfactual utility reward\.
### 5\.1Experimental Settings
Tasks\.We evaluate Auto\-Dreamer on three tasks in different domains and difficulty:ALFWorld\[[24](https://arxiv.org/html/2605.20616#bib.bib6)\]\(household instruction\-following\),ScienceWorld\[[28](https://arxiv.org/html/2605.20616#bib.bib7)\]\(text\-based science experiments\), andWebArena\[[44](https://arxiv.org/html/2605.20616#bib.bib10)\]\(web navigation; shopping, shopping\_admin, gitlab\)\.
Models\.The frozen task agent is shared across all methods within a domain: Qwen3\.5\-9B on ALFWorld and ScienceWorld, and Gemini\-3\-flash\-preview\[[3](https://arxiv.org/html/2605.20616#bib.bib64)\]on WebArena\. For RL\-trained memory baselines, we use the released 4B Mem\-α\\alphacheckpoint and a Qwen3\-14B UMEM model trained on ALFWorld trajectories\. All other baselines and Auto\-Dreamer’s per\-session writer use Qwen3\-14B on ALFWorld and ScienceWorld, and Gemini\-3\.1\-flash\-lite\-preview\[[4](https://arxiv.org/html/2605.20616#bib.bib63)\]on WebArena, ensuring no baseline is disadvantaged by a weaker memory\-generation LLM\. The Auto\-Dreamer consolidatorCθC\_\{\\theta\}is initialized from Qwen3\-14B, trained on ScienceWorld trajectories only, and applied without further updates on all three domains—including across the writer\-backbone shift on WebArena\.
Baselines\.Ten baselines spanning seven families: no memory \(No memory\); reflective and insight extraction \(Reflexion\[[23](https://arxiv.org/html/2605.20616#bib.bib49)\], ExpeL\[[42](https://arxiv.org/html/2605.20616#bib.bib26)\]\); workflow and procedural memory \(AWM\[[31](https://arxiv.org/html/2605.20616#bib.bib28)\], Memp\[[6](https://arxiv.org/html/2605.20616#bib.bib14)\]\); structured stores \(ReasoningBank\[[20](https://arxiv.org/html/2605.20616#bib.bib25)\], Mem0\[[2](https://arxiv.org/html/2605.20616#bib.bib27)\]\); two\-timescale prompted memory \(LightMem\[[5](https://arxiv.org/html/2605.20616#bib.bib19)\], the closest architectural counterpart to our method\); RL\-trained writers \(Mem\-α\\alpha\[[30](https://arxiv.org/html/2605.20616#bib.bib29)\], UMEM\[[37](https://arxiv.org/html/2605.20616#bib.bib15)\]\)\. Detailed descriptions and per\-baseline implementation are in Appendix[B\.1](https://arxiv.org/html/2605.20616#A2.SS1)\.
Metrics\.We report task success rate \(SR, %\), macro\-averaged over task families within each domain, and final active memory\-bank size in tokens \(\#Tok\)\. For continual\-memory deployment, we additionally report the normalized area under the cumulative success curve \(AUC∈\[0,1\]\\in\[0,1\]\) over the task stream\.
Table 1:Memory evaluation across continual deployment and controlled consolidation\.Panel A reports continual\-memory deployment where memory starts empty and is updated from trajectories collected during evaluation\. Panel B isolates the consolidation operator by giving each method the same fixed initial bankℬ0\\mathcal\{B\}\_\{0\}and evaluating on held\-out tasks with a frozen task agent\. For SR and AUC, higher is better; for \#Tok\., lower is better\.Boldandunderlinedenote the best and second\-best results\.\(A\) Continual\-memory deployment\(B\) Bank consolidation: Control studyMethodALFWorldScienceWorldWebArenaMethodALFWorldScienceWorldSR\#Tok\.AUCSR\#Tok\.AUCSR\#Tok\.AUCSR\#Tok\.SR\#Tok\.\\cellcolorgray\!15Baseline\\cellcolorgray\!15BaselineNo memory30\.800\.28728\.700\.29544\.600\.527No memory24\.5029\.60\\cellcolorgray\!15Reflective / insight extraction\\cellcolorgray\!15Reflective / insight extractionReflexion\[[23](https://arxiv.org/html/2605.20616#bib.bib49)\]49\.211,9670\.47529\.649,9360\.30646\.48,1480\.567Reflexion\[[23](https://arxiv.org/html/2605.20616#bib.bib49)\]46\.345531\.0661ExpeL\[[42](https://arxiv.org/html/2605.20616#bib.bib26)\]33\.62,0420\.30628\.311,6280\.28950\.83,3710\.576ExpeL\[[42](https://arxiv.org/html/2605.20616#bib.bib26)\]70\.12,94238\.1808\\cellcolorgray\!15Workflow / procedural memory\\cellcolorgray\!15Workflow / procedural memoryAWM\[[31](https://arxiv.org/html/2605.20616#bib.bib28)\]32\.816,8460\.30630\.23,8770\.31152\.08900\.597AWM\[[31](https://arxiv.org/html/2605.20616#bib.bib28)\]66\.975833\.6103Memp\[[6](https://arxiv.org/html/2605.20616#bib.bib14)\]31\.46,7310\.29730\.411,5310\.30650\.84,9730\.574Memp\[[6](https://arxiv.org/html/2605.20616#bib.bib14)\]67\.392740\.7351\\cellcolorgray\!15Structured stores\\cellcolorgray\!15Structured storesReasoningBank\[[20](https://arxiv.org/html/2605.20616#bib.bib25)\]31\.142,7840\.28530\.9155,1170\.32449\.824,3620\.581ReasoningBank\[[20](https://arxiv.org/html/2605.20616#bib.bib25)\]43\.83,02125\.03,221Mem0\[[2](https://arxiv.org/html/2605.20616#bib.bib27)\]30\.6119,0130\.27926\.889,8540\.28150\.343,3550\.571Mem0\[[2](https://arxiv.org/html/2605.20616#bib.bib27)\]66\.14,95342\.01609\\cellcolorgray\!15Two\-timescale prompted\\cellcolorgray\!15Two\-timescale promptedLightMem\[[5](https://arxiv.org/html/2605.20616#bib.bib19)\]31\.2130,0010\.28828\.1272,0740\.28752\.0370,8740\.567LightMem\[[5](https://arxiv.org/html/2605.20616#bib.bib19)\]40\.624331\.8215\\cellcolorgray\!15RL\-trained writers\\cellcolorgray\!15RL\-trained writersMem\-α\\alpha\[[30](https://arxiv.org/html/2605.20616#bib.bib29)\]57\.4125,6350\.56630\.0344,5990\.297†Mem\-α\\alpha\[[30](https://arxiv.org/html/2605.20616#bib.bib29)\]56\.61,67625\.71,142UMEM\[[37](https://arxiv.org/html/2605.20616#bib.bib15)\]58\.462,9470\.56434\.180,9180\.353†UMEM\[[37](https://arxiv.org/html/2605.20616#bib.bib15)\]54\.25,18429\.31,298\\cellcolorgray\!15Ours\\cellcolorgray\!15Ours\\rowcolorhl\-ours\-a\[\]\[\]Auto\-Dreamer60\.210,9540\.58541\.16,9470\.41152\.39270\.628\\cellcolorwhite\\cellcolorhl\-ours\-bAuto\-Dreamer\\cellcolorhl\-ours\-b72\.7\\cellcolorhl\-ours\-b634\\cellcolorhl\-ours\-b44\.3\\cellcolorhl\-ours\-b539
†UMEM and Mem\-α\\alpharely on small open\-source memory\-optimizer models whose context windows cannot accommodate WebArena’s accessibility\-tree observations together with the memory bank\. We do not retrain with larger backbones because doing so departs from the published configurations and exceeds our compute budget\.
### 5\.2Main Results
We evaluate Auto\-Dreamer in the continual\-memory deployment regime, where memory starts empty and is updated from trajectories collected during evaluation\. After each completed session, the writer adds entries; methods with offline consolidation invoke their updater at a fixed cadence ofkksessions, with values reported in Appendix[B\.2](https://arxiv.org/html/2605.20616#A2.SS2)\.
Auto\-Dreamer improves success while keeping memory compact\.Table[1](https://arxiv.org/html/2605.20616#S5.T1)shows that Auto\-Dreamer achieves the strongest overall continual\-memory performance across all three domains\. On ScienceWorld, it reaches 41\.1% SR, improving over the strongest baseline UMEM by 7\.0 points \(34\.1%\) and over the strongest prompted baseline ReasoningBank by 10\.2 points \(30\.9%\)\. Its AUC also improves from 0\.353 to 0\.411, indicating that the gain appears throughout the stream rather than only at the end\. The same pattern holds on ALFWorld, where Auto\-Dreamer achieves 60\.2% SR compared with 58\.4% for UMEM and 57\.4% for Mem\-α\\alpha, and on WebArena, where it obtains the highest SR despite the longer\-horizon tasks and noisier accessibility\-tree observations\.
These gains are achieved with substantially lower retrieval\-time memory cost\. On ScienceWorld, Auto\-Dreamer uses 6\.9k memory tokens, compared with 80\.9k for UMEM and 155\.1k for ReasoningBank\. On WebArena, it uses only 927 tokens, compared with 370k for LightMem and 43\.4k for Mem0, while still obtaining the best SR and AUC\. Figure[3\(a\)](https://arxiv.org/html/2605.20616#S5.F3.sf1)summarizes this tradeoff: baselines that approach Auto\-Dreamer’s success typically require much larger banks, whereas compact baselines generally sacrifice success\. Auto\-Dreamer therefore occupies the favorable region of the success–cost plane\.
The learned consolidator transfers across task domains and writer backbones\.Auto\-Dreamer’s consolidator is trained only on ScienceWorld trajectories, yet it improves continual\-memory performance on held\-out ALFWorld and WebArena without additional updates\. This demonstrates transfer not only across task domains, but also across memory\-acquisition backbones: the same trained consolidator is paired with a Qwen3\-14B writer on ALFWorld and ScienceWorld and with a Gemini\-3\.1\-flash\-lite\-preview writer on WebArena\. This supports the domain\- and writer\-agnostic design of the consolidation interface in §[4\.1](https://arxiv.org/html/2605.20616#S4.SS1): the consolidator operates over typed textual memory entries and provenance\-linked trajectory excerpts, rather than environment\-specific state representations, action symbols, or writer\-specific hidden states\.
The online results also test the training–deployment approximation in §[4\.3](https://arxiv.org/html/2605.20616#S4.SS3)\. Although training evaluates rewritten regions locally for credit assignment, Panel A evaluates repeated composition into a persistent, growing bank\. Auto\-Dreamer’s gains show that the locally trained rewrite operator remains effective under retrieval competition and interaction with older memories\.
\(a\)Success–cost tradeoff\.
\(b\)Bank growth; colors as in \(a\)\.
\(c\)Dropout ablation: score\.
\(d\)Dropout ablation: bank size\.
\(e\)Provenance fan\-in distribution\.
\(f\)Consolidation cadence\.
Figure 3:Memory efficiency, reward ablation, and consolidator analysis\.\(a\)Auto\-Dreamer lies on the Pareto frontier of task success versus retrieval\-time memory cost\.\(b\)Auto\-Dreamer maintains a compact memory bank while most baseline methods grow monotonically as the task stream lengthens\.\(c,d\)The counterfactual utility reward preserves task performance while bounding bank growth during training\.\(e\)Provenance fan\-in distribution peaks at fan\-in=5=5on both ScienceWorld and ALFWorld, indicating that synthesized entries typically draw on multiple source memories rather than being one\-to\-one copies\.\(f\)Consolidation cadence sweep: ScienceWorld performs best at the main cadence, while ALFWorld is comparatively robust across settings\.
### 5\.3Discussion on Bank Consolidation
The continual\-memory deployment in §[5\.2](https://arxiv.org/html/2605.20616#S5.SS2)measures end\-to\-end performance with many entangled factors: writer quality, retrieval competition over a growing bank, consolidation cadence, and downstream agent decisions over many sessions\. We complement it with a controlled study that isolates the consolidation operator itself\.
Setup\.Each method receives an identical fixed initial bankℬ0\\mathcal\{B\}\_\{0\}constructed in advance from a sampled pool of trajectories run through the same fixed writer\. The consolidator is invoked exactly once onℬ0\\mathcal\{B\}\_\{0\}, producing a consolidated bankℬ⋆\\mathcal\{B\}^\{\\star\}\. We then evaluate the frozen task agent equipped withℬ⋆\\mathcal\{B\}^\{\\star\}on a held\-out task set drawn from the same environment family\. Methods that lack a consolidation step \(e\.g\., Reflexion, AWM\) are evaluated directly onℬ0\\mathcal\{B\}\_\{0\}\. This setup mirrors the training distribution ofCθC\_\{\\theta\}\(§[4\.3](https://arxiv.org/html/2605.20616#S4.SS3)\) and decouples consolidation quality from the dynamics of an evolving stream\. We use ALFWorld and ScienceWorld; WebArena is omitted because the long\-horizon, stateful nature of web tasks precludes constructing a meaningful fixed\-bank evaluation that does not collapse into a stream\.
Auto\-Dreamer also leads in the controlled regime\.Panel B of Table[1](https://arxiv.org/html/2605.20616#S5.T1)reproduces the online ranking: Auto\-Dreamer reaches 72\.7% on ALFWorld and 44\.3% on ScienceWorld, leading the strongest baseline on each \(ExpeL at 70\.1%, Mem0 at 42\.0%\) by 2\.6 and 2\.3 points respectively, with comparable or smaller memory cost\. The narrower margin compared to continual deployment is consistent with the controlled regime removing the cumulative retrieval\-competition effects that further amplify Auto\-Dreamer’s advantage at deployment time\.
### 5\.4Ablation Studies
Ablating offline consolidation\.We ablate offline consolidation with two variants:*writer\-only*, which keeps the same task agent, retriever, schema, and per\-session writer but disables consolidation; and*untrained*, which uses the same region rewriting mechanism, tool\-use rollout, and provenance grounding but without GRPO training\.
Table[2](https://arxiv.org/html/2605.20616#S5.T2)shows that the untrained pipeline already accounts for most of the bank\-size reduction over writer\-only \(6–11×\\timessmaller across all three domains\), supporting the view that region rewriting is the primary compactness mechanism: replacing a region with a synthesized set induces compression before learning\. GRPO training then changes the quality and selectivity of the rewrite policy\. On ScienceWorld and WebArena, training adds substantial SR gains \(\+9\.7pp and \+5\.7pp\); on ALFWorld, the aggregate gain is smaller \(\+1\.0pp\) and the active bank grows\. Task family\-level analysis reveals heterogeneous effects within ALFWorld: training helps on multi\-step manipulation tasks but hurts on the single location\-sensitive familylook\_at\_obj\_in\_light, where task\-specific location details are lost under abstraction\. Excluding this family widens the trained\-vs\-untrained SR margin on the remaining five categories to\+4\.9\+4\.9pp, and we analyze this failure mode as Pattern 3 in §[5\.5](https://arxiv.org/html/2605.20616#S5.SS5)\.
Table 2:Effect of offline consolidation\. Untrained Auto\-Dreamer uses the same consolidator pipeline \(region rewriting, tool\-use rollout, provenance grounding\) but without RL updates, comparing trained vs\. untrained isolates the contribution of GRPO training\.Ablating counterfactual utility reward\.We next train Auto\-Dreamer with and without the counterfactual utility termrcfr\_\{\\mathrm\{cf\}\}\. Figures[3\(c\)](https://arxiv.org/html/2605.20616#S5.F3.sf3)and[3\(d\)](https://arxiv.org/html/2605.20616#S5.F3.sf4)report the resulting training dynamics\. Withoutrcfr\_\{\\mathrm\{cf\}\}, the consolidator continues to improve raw environment score, but the bank grows rapidly over training\. Withrcfr\_\{\\mathrm\{cf\}\}, bank size first grows and then shrinks as training proceeds, while task performance remains competitive with the unshaped variant\. This supports the intended role of counterfactual utility: it encourages compact, load\-bearing memory banks without a measurable cost in task success\.
Ablating consolidation cadence\.We sweep the consolidation cadencek∈\{1,5,20\}k\\in\\\{1,5,20\\\}on ALFWorld and ScienceWorld, holding all other factors fixed\. Figure[3\(f\)](https://arxiv.org/html/2605.20616#S5.F3.sf6)reportsΔ\\DeltaSR relative to the main\-experiment cadence \(k=8k=8on ALFWorld,k=10k=10on ScienceWorld\)\. ScienceWorld shows a larger cadence effect, with the main cadence outperforming both more frequent and less frequent consolidation\. Performance is lower when consolidation is too frequent \(k=1k=1\), suggesting that very small intervals provide insufficient cross\-session evidence for useful abstraction\. Performance also drops when consolidation is too infrequent \(k=20k=20\), consistent with the consolidator being asked to process a larger and noisier working region that strains its context and tool\-use budget\. ALFWorld is more robust to cadence, with a range of roughly 5pp;k=5k=5slightly outperforms the main setting by 1\.1pp\.
### 5\.5Qualitative Analysis
Multi\-source consolidation\.Figure[3\(e\)](https://arxiv.org/html/2605.20616#S5.F3.sf5)reports the distribution of*provenance fan\-in*, the number of source entries cited by each synthesized output\. On both ScienceWorld and ALFWorld, the distribution peaks at fan\-in=5=5, indicating that synthesized entries typically draw on multiple source memories rather than being one\-to\-one copies\. We next examine three qualitative patterns: two successful patterns drawn from matched tasks where the same agent succeeds with the Auto\-Dreamer bank but fails with the writer\-only bank, and one failure pattern where abstraction discards task\-specific details\.
Case Study: Slot Abstraction \(find\-entity, task 90\)Task:Find a living thing and place it in thered boxin thekitchen\. Writer memory \(3 hardcoded entries\)Auto\-Dreamer memory✗“move tobluebox inliving room”✓“move to designated container \(yellow, red, purple, or orange box\)”✗“move togreenbox inbathroom”✓“teleport to target room \(bedroom, workshop, etc\.\)”✗“move togreenbox inbathroom” \(duplicate\)✓“living things typically outside, greenhouse, or art studio”Task agent \(fail, 8 steps\):Task agent \(success, 6 steps\):✗No memory matches \(red, kitchen\); agent searches workshop freezer \(no living thing inside\)✓Goes to greenhouse, finds cherry tree \(a living thing\)✗Wanders to kitchen, focuses on anorange\(a fruit, not living\)✓Carries cherry tree to kitchen, places in red box✗Score−1\.0\-1\.0✓Score1\.01\.0
Pattern 2: filtering wrong and contradicting entries via abstraction\.Writer entries from different past tasks can encode mutually contradictory claims about the same task element, or contain phrasing errors carried over from an LLM\-generated trace\. The agent, treating writer memory as authoritative, can act on a wrong entry and become stuck\. Rather than adjudicating among conflicting specifics or propagating phrasing errors, the consolidator drops these entries and emits a higher\-level rule that preserves the shared task structure while leaving instance\-specific answers to in\-context reasoning\.
Case Study: Filtering Wrong Entries \(lifespan\-compare, task 61\)Task:Find the animal with thelongest, then theshortest, lifespan\. Writer memory \(contradictory \+ wrong\)Auto\-Dreamer memory✗“longest iscrocodile”✓“compare lifespans of the listed candidates”✗“longest issea turtle” \(contradicts above\)✓“focus on longest first, then shortest”✗“shortest is mouse”✓“focus the adult life stage”✗“focus on adultandbaby elephant” \(writer error: trace only focused adult\)Task agent \(fail, 18 steps\):Task agent \(success, 3 steps\):✗Follows the elephant entry; outputs actionfocus on adult adult elephant✓Identifies elephant as the longest\-lived candidate, focuses on it✗Env: “no such entity”; agent retries 16 times, never adapts✓Identifies dragonfly as the shortest\-lived, focuses on it✗Step cap reached, score−1\.0\-1\.0✓Score1\.01\.0
Pattern 3 \(failure mode\): over\-compression of locally useful details\.We compare Auto\-Dreamer with the untrained variant\. Both perform region\-rewriting, and they differ in whether the consolidator policy is trained\. Although the trained consolidator improves performance in most task families, it underperforms the untrained consolidator onlook\_at\_obj\_in\_light\. In this task, concrete locations of target object and light source can help disambiguate where to search and where to examine the object\. These details are episodic in form, but can still be useful for the current task\. The trained consolidator tends to replace such task\-specific paths with generic procedural entries, while the untrained consolidator retains more task\-specific details\. This suggests that the learned policy can over\-compress specific facts that are locally useful\. Quantitatively, this single category drives the bulk of the ALFWorld trained\-vs\-untrained gap in Table[2](https://arxiv.org/html/2605.20616#S5.T2): removinglook\_at\_obj\_in\_lightwidens the SR margin on the remaining five families from\+1\.0\+1\.0pp to\+4\.9\+4\.9pp \(65\.865\.8vs\.60\.960\.9\)\.
Case Study: Over\-Abstraction \(look\_at\_obj\_in\_light, task 308\)Task:Examine the alarmclock under the desklamp \(alarmclock on desk\-2, desklamp on desk\-1\)\. Auto\-Dreamer \(untrained\) memoryAuto\-Dreamer memory✓“alarmclock on desk\-2, desklamp on desk\-1; take then go”✗“Key considerations for interacting with light sources”✓“avoid repeated examine of desk\-1 when target is on desk\-2”✗“Step\-by\-step process for examining an object”Task agent \(success\):Task agent \(fail\):✓Takes alarmclock to desk\-1, examines under desklamp✗Lacks location info; loops examining wrong desk✓score1\.01\.0✗score−1\.0\-1\.0
## 6Conclusion
We presentedAuto\-Dreamer, a two\-timescale memory system that pairs fast per\-session writing with learned offline region rewriting\. By separating online acquisition from slow cross\-session consolidation, Auto\-Dreamer matches or exceeds the strongest memory baselines on ALFWorld, ScienceWorld, and WebArena while maintaining substantially smaller active memory banks\. A consolidator trained only on ScienceWorld transfers to held\-out domains and across a writer\-backbone shift, supporting the view that consolidation over textual memory entries and source trajectories can be domain\- and writer\-agnostic\. Several extensions follow naturally\. The current consolidator rewrites one working region per event; future work could maintain longer\-range bank structure, jointly optimize retrieval and consolidation, or handle multimodal source trajectories\. More broadly, offline learned consolidation may be useful whenever agent experience accumulates faster than it can be reorganized in\-session\.
## References
- \[1\]\(2025\-11\)DecisionFlow: advancing large language model as principled decision maker\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 16668–16692\.External Links:ISBN 979\-8\-89176\-335\-7Cited by:[Appendix A](https://arxiv.org/html/2605.20616#A1.p1.1)\.
- \[2\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.InEuropean Conference on Artificial Intelligence \(ECAI\),Cited by:[§B\.1\.3](https://arxiv.org/html/2605.20616#A2.SS1.SSS3.Px2),[§2](https://arxiv.org/html/2605.20616#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.16.14.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.16.14.12)\.
- \[3\]G\. DeepMind\(2025\-12\)Gemini 3 Flash model card\.Technical reportGoogle DeepMind\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-flash/)Cited by:[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p2.2)\.
- \[4\]G\. DeepMind\(2026\-03\)Gemini 3\.1 Flash\-Lite model card\.Technical reportGoogle DeepMind\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-flash-lite/)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p4.2),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p2.2)\.
- \[5\]J\. Fang, X\. Deng, H\. Xu, Z\. Jiang, Y\. Tang, Z\. Xu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao, H\. Chen, and N\. Zhang\(2025\)LightMem: lightweight and efficient memory\-augmented generation\.arXiv preprint arXiv:2510\.18866\.Cited by:[§B\.1\.4](https://arxiv.org/html/2605.20616#A2.SS1.SSS4.Px1),[§2](https://arxiv.org/html/2605.20616#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.18.16.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.18.16.12)\.
- \[6\]R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang\(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§B\.1\.2](https://arxiv.org/html/2605.20616#A2.SS1.SSS2.Px2),[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.13.11.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.13.11.12)\.
- \[7\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[footnote 1](https://arxiv.org/html/2605.20616#footnote1)\.
- \[8\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[footnote 1](https://arxiv.org/html/2605.20616#footnote1)\.
- \[9\]C\. Hu, X\. Gao, Z\. Zhou, D\. Xu, Y\. Bai, X\. Li, H\. Zhang, T\. Li, C\. Zhang, L\. Bing,et al\.\(2026\)EverMemOS: a self\-organizing memory operating system for structured long\-horizon reasoning\.arXiv preprint arXiv:2601\.02163\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[10\]Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi,et al\.\(2025\)Memory in the age of ai agents\.arXiv preprint arXiv:2512\.13564\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1)\.
- \[11\]W\. Huang, W\. Zhang, Y\. Liang, Y\. Bei, Y\. Chen, T\. Feng, X\. Pan, Z\. Tan, Y\. Wang, T\. Wei,et al\.\(2026\)Rethinking memory mechanisms of foundation agents in the second half\.arXiv preprint arXiv:2602\.06052\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1)\.
- \[12\]D\. Kumaran, D\. Hassabis, and J\. L\. McClelland\(2016\)What learning systems do intelligent agents need? Complementary learning systems theory updated\.Trends in Cognitive Sciences20\(7\),pp\. 512–534\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2016.05.004)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p3.1)\.
- \[13\]Z\. Li, C\. Xi, C\. Li, D\. Chen, B\. Chen, S\. Song, S\. Niu, H\. Wang, J\. Yang, C\. Tang,et al\.\(2025\)Memos: a memory os for ai system\.arXiv preprint arXiv:2507\.03724\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[14\]K\. Lin, C\. Snell, Y\. Wang, C\. Packer, S\. Wooders, I\. Stoica, and J\. E\. Gonzalez\(2025\)Sleep\-time compute: beyond inference scaling at test\-time\.arXiv preprint arXiv:2504\.13171\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p3.1)\.
- \[15\]J\. Liu, Y\. Su, P\. Xia, S\. Han, Z\. Zheng, C\. Xie, M\. Ding, and H\. Yao\(2026\)SimpleMem: efficient lifelong memory for llm agents\.arXiv preprint arXiv:2601\.02553\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[16\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1)\.
- \[17\]J\. L\. McClelland, B\. L\. McNaughton, and A\. K\. Lampinen\(2020\)Integration of new information in memory: new insights from a complementary learning systems perspective\.Philosophical Transactions of the Royal Society B: Biological Sciences375\(1799\)\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p3.1)\.
- \[18\]J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly\(1995\)Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.Psychological Review102\(3\),pp\. 419–457\.External Links:[Document](https://dx.doi.org/10.1037/0033-295X.102.3.419)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p3.1)\.
- \[19\]J\. Nan, W\. Ma, W\. Wu, and Y\. Chen\(2025\)Nemori: self\-organizing agent memory inspired by cognitive science\.arXiv preprint arXiv:2508\.03341\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[20\]S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister\(2025\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.arXiv preprint arXiv:2509\.25140\.Cited by:[§B\.1\.3](https://arxiv.org/html/2605.20616#A2.SS1.SSS3.Px1),[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.15.13.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.15.13.12)\.
- \[21\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards LLMs as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1)\.
- \[22\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p3.1),[§4\.3](https://arxiv.org/html/2605.20616#S4.SS3.p3.2)\.
- \[23\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§B\.1\.1](https://arxiv.org/html/2605.20616#A2.SS1.SSS1.Px1),[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.9.7.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.9.7.12)\.
- \[24\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Table 7](https://arxiv.org/html/2605.20616#A3.T7),[Table 7](https://arxiv.org/html/2605.20616#A3.T7.2.2.1.5),[§1](https://arxiv.org/html/2605.20616#S1.p4.2),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p1.1)\.
- \[25\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p4.2)\.
- \[26\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[27\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin, W\. X\. Zhao, Z\. Wei, and J\. Wen\(2024\)A survey on large language model based autonomous agents\.Frontiers Comput\. Sci\.18\(6\),pp\. 186345\.External Links:[Link](https://doi.org/10.1007/s11704-024-40231-1),[Document](https://dx.doi.org/10.1007/S11704-024-40231-1)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p1.1)\.
- \[28\]R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu\(2022\)ScienceWorld: is your agent smarter than a 5th grader?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[Table 7](https://arxiv.org/html/2605.20616#A3.T7),[Table 7](https://arxiv.org/html/2605.20616#A3.T7.2.3.2.5),[§1](https://arxiv.org/html/2605.20616#S1.p4.2),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p1.1)\.
- \[29\]Y\. Wang and X\. Chen\(2025\)MIRIX: multi\-agent memory system for LLM\-based agents\.arXiv preprint arXiv:2507\.07957\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[30\]Y\. Wang, R\. Takanobu, Z\. Liang, Y\. Mao, Y\. Hu, J\. McAuley, and X\. Wu\(2025\)Mem\-α\\alpha: learning memory construction via reinforcement learning\.arXiv preprint arXiv:2509\.25911\.Cited by:[§B\.1\.5](https://arxiv.org/html/2605.20616#A2.SS1.SSS5.Px1),[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.3.1.1.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.2.2)\.
- \[31\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§B\.1\.2](https://arxiv.org/html/2605.20616#A2.SS1.SSS2.Px1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.12.10.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.12.10.12)\.
- \[32\]T\. Wei, N\. Sachdeva, B\. Coleman, Z\. He, Y\. Bei, X\. Ning, M\. Ai, Y\. Li, J\. He, E\. H\. Chi, C\. Wang, S\. Chen, F\. Pereira, W\. Kang, and D\. Z\. Cheng\(2025\)Evo\-Memory: benchmarking LLM agent test\-time learning with self\-evolving memory\.arXiv preprint arXiv:2511\.20857\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[33\]Z\. Xi, W\. Chen, X\. Guo, W\. He, Y\. Ding, B\. Hong, M\. Zhang, J\. Wang, S\. Jin, E\. Zhou, R\. Zheng, X\. Fan, X\. Wang, L\. Xiong, Y\. Zhou, W\. Wang, C\. Jiang, Y\. Zou, X\. Liu, Z\. Yin, S\. Dou, R\. Weng, W\. Qin, Y\. Zheng, X\. Qiu, X\. Huang, Q\. Zhang, and T\. Gui\(2025\)The rise and potential of large language model based agents: a survey\.Sci\. China Inf\. Sci\.68\(2\)\.External Links:[Link](https://doi.org/10.1007/s11432-024-4222-0),[Document](https://dx.doi.org/10.1007/S11432-024-4222-0)Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p1.1)\.
- \[34\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-MEM: agentic memory for LLM agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[35\]S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, J\. Bi, K\. Kersting, J\. Z\. Pan, H\. Schütze, V\. Tresp, and Y\. Ma\(2025\)Memory\-R1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv preprint arXiv:2508\.19828\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
- \[36\]K\. Yang, Z\. Chen, X\. He, J\. Jiang, M\. Galley, C\. Wang, J\. Gao, J\. Han, and C\. Zhai\(2026\)PlugMem: a task\-agnostic plugin memory module for llm agents\.arXiv preprint arXiv:2603\.03296\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p1.1)\.
- \[37\]Y\. Ye, H\. Jiang, F\. Jiang, T\. Lan, Y\. Du, B\. Fu, X\. Shi, Q\. Jia, L\. Wang, and W\. Luo\(2026\)UMEM: unified memory extraction and management framework for generalizable memory\.arXiv preprint arXiv:2602\.10652\.Cited by:[§B\.1\.5](https://arxiv.org/html/2605.20616#A2.SS1.SSS5.Px2),[§1](https://arxiv.org/html/2605.20616#S1.p2.1),[§2](https://arxiv.org/html/2605.20616#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.20.18.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.20.18.10)\.
- \[38\]H\. Yu, T\. Chen, J\. Feng, J\. Chen, W\. Dai, Q\. Yu, Y\. Zhang, W\. Ma, J\. Liu, M\. Wang,et al\.\(2025\)MemAgent: reshaping long\-context llm with multi\-conv rl\-based memory agent\.arXiv preprint arXiv:2507\.02259\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
- \[39\]D\. Zhang, L\. Chen, S\. Zhang, H\. Xu, Z\. Zhao, and K\. Yu\(2023\)Large language models are semi\-parametric reinforcement learning agents\.Advances in Neural Information Processing Systems36,pp\. 78227–78239\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
- \[40\]S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
- \[41\]Z\. Zhang, Q\. Dai, R\. Li, X\. Bo, X\. Chen, and Z\. Dong\(2025\)Learn to memorize: optimizing llm\-based agents with adaptive memory framework\.arXiv preprint arXiv:2508\.16629\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
- \[42\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)ExpeL: LLM agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§B\.1\.1](https://arxiv.org/html/2605.20616#A2.SS1.SSS1.Px2),[§2](https://arxiv.org/html/2605.20616#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.10.8.1),[Table 1](https://arxiv.org/html/2605.20616#S5.T1.4.2.10.8.12)\.
- \[43\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§1](https://arxiv.org/html/2605.20616#S1.p2.1)\.
- \[44\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2307.13854)Cited by:[Table 7](https://arxiv.org/html/2605.20616#A3.T7),[Table 7](https://arxiv.org/html/2605.20616#A3.T7.2.4.3.5),[§1](https://arxiv.org/html/2605.20616#S1.p4.2),[§5\.1](https://arxiv.org/html/2605.20616#S5.SS1.p1.1)\.
- \[45\]Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. Liang\(2025\)MEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.arXiv preprint arXiv:2506\.15841\.Cited by:[§2](https://arxiv.org/html/2605.20616#S2.p2.1)\.
## Appendix ALimitations
Evaluation scope\.Our evaluation is restricted to three text\-based agent environments sharing an LLM\-mediated interface \(ALFWorld, ScienceWorld, WebArena\)\. We make no claims about transfer to settings with structured state representations, non\-textual observations\[[1](https://arxiv.org/html/2605.20616#bib.bib60)\], or domains where memory must encode visual or multimodal evidence\.
Writer and schema dependence\.The consolidator operates over entries written by a fixed prompted writer with a specific semantic/procedural schema\. We hold these constant to isolate the consolidator, but robustness to alternative writers, schemas, or noisier provenance links remains untested\. Information missed by the writer cannot generally be recovered by the consolidator unless source trajectories make it salient\.
Retrieval\-budget sensitivity\.Our main evaluation uses top\-K=3K\{=\}3retrieval with a token cap on retrieved entries\. This regime favors compact banks; methods that benefit from larger retrieval budgets may rank differently under looser constraints\. Characterizing how the Pareto frontier shifts with the retrieval budget is left to future work\.
Surrogate training objective\.The local\-bank training objective \(§[4\.3](https://arxiv.org/html/2605.20616#S4.SS3)\) is a surrogate for deployment\-time bank composition\. While our online experiments show that the surrogate transfers, the formal relationship between local\-bank ranking and full\-bank ranking is not characterized\.
Variance\.We report point estimates without seed or task\-order variance\. Several margins in Table[1](https://arxiv.org/html/2605.20616#S5.T1), particularly on ALFWorld, are small enough that variance estimates would be useful for interpretation\.
## Appendix BExperimental Details
### B\.1Baseline Details
We compareAuto\-Dreameragainst ten memory mechanisms for LLM agents, spanning six families\. For each baseline we describe what is written into memory, how it is retrieved, and how it is consolidated\. Unless otherwise noted, we use the original authors’ prompts and hyperparameters, and pair each method with the same backbone task agent for fairness\.
##### No memory\.
A memoryless baseline in which the task agent receives only the current observation and the task instruction\. It serves as a baseline that isolates the contribution of any memory mechanism on top of the underlying policy\.
#### B\.1\.1Reflective / Insight Extraction
##### Reflexion\[[23](https://arxiv.org/html/2605.20616#bib.bib49)\]\.
After each trajectory, the agent generates a free\-form natural\-language*reflection*that diagnoses failures and proposes corrections\. Reflections are appended to a per\-task buffer and prepended to the prompt on subsequent attempts\. There is no cross\-task generalization or structured retrieval: memory is task\-local and grows monotonically until truncated by the context budget\.
##### ExpeL\[[42](https://arxiv.org/html/2605.20616#bib.bib26)\]\.
ExpeL distills successful and failed trajectories into a small set of high\-level*insights*\(rules of thumb\) and a pool of in\-context exemplars\. At inference time, the most relevant insights and exemplars are retrieved by similarity to the current task and inserted into the prompt\. Compared to Reflexion, ExpeL transfers across tasks and emphasizes compact, generalizable rules over per\-episode reflections\.
#### B\.1\.2Workflow / Procedural Memory
##### AWM \(Agent Workflow Memory\)\[[31](https://arxiv.org/html/2605.20616#bib.bib28)\]\.
AWM induces reusable*workflows*– abstracted action templates extracted from successful trajectories – and stores them in a workflow library\. On a new task, the most relevant workflows are retrieved and injected as procedural guidance for the agent\. Memory growth is tied to the diversity of induced workflows rather than to the number of trajectories, leading to compact stores\.
##### Memp\[[6](https://arxiv.org/html/2605.20616#bib.bib14)\]\.
Memp builds a procedural memory by summarizing trajectories into stepwise*procedures*and indexing them for retrieval\. It supports both addition and revision of procedures as new evidence accumulates, occupying a middle ground between purely episodic stores \(Reflexion\) and abstract workflow libraries \(AWM\)\.
#### B\.1\.3Structured Stores
##### ReasoningBank\[[20](https://arxiv.org/html/2605.20616#bib.bib25)\]\.
ReasoningBank maintains a structured bank of*reasoning traces*extracted from prior episodes, organized to support semantic retrieval\. Each entry captures the chain\-of\-thought and key decision points of a trajectory, which are surfaced to the agent on related future tasks\. The store grows quickly with experience, trading retrieval coverage for substantial token overhead\.
##### Mem0\[[2](https://arxiv.org/html/2605.20616#bib.bib27)\]\.
Mem0 is a general\-purpose long\-term memory layer that extracts atomic*facts*and*preferences*from interactions and stores them in a queryable memory graph\. Retrieval combines vector similarity with light structural reasoning over the graph\. We adapt Mem0 to the agentic setting by treating each trajectory as an interaction stream from which memories are distilled\.
#### B\.1\.4Two\-Timescale Prompted
##### LightMem\[[5](https://arxiv.org/html/2605.20616#bib.bib19)\]\.
LightMem separates memory operations into two timescales: a fast*working memory*that buffers recent context, and a slower*consolidation*step that periodically distills the buffer into long\-term notes\. Both stages are fully prompted, with no learned components\. This yields a clean ablation point for two\-timescale designs that does not rely on reinforcement learning\.
#### B\.1\.5RL\-Trained Writers
##### Mem\-α\\alpha\[[30](https://arxiv.org/html/2605.20616#bib.bib29)\]\.
Mem\-α\\alphatrains the*memory writer*with reinforcement learning, optimizing what to write so that downstream task success is maximized\. The reader/retrieval pipeline is held fixed, isolating the contribution of a learned write policy\. We omit Mem\-α\\alphaon WebArena: the released checkpoint relies on a small 4B memory\-optimizer backbone whose context window cannot accommodate WebArena’s accessibility\-tree observations together with the memory bank \(see footnote in Table[1](https://arxiv.org/html/2605.20616#S5.T1)\)\.
##### UMEM\[[37](https://arxiv.org/html/2605.20616#bib.bib15)\]\.
UMEM \(Unified Memory\) similarly trains the writer with RL but unifies episodic, procedural, and semantic memory into a single store with a learned update operator\. It represents the strongest RL\-trained baseline in our comparison\. As with Mem\-α\\alpha, we omit UMEM on WebArena due to context window constraints\.
#### B\.1\.6Baseline Implementation
All baselines share the Qwen2\.5\-3B last\-token\-hidden embedder \(20482048\-d\) and FAISSIndexFlatIPretrieval over L2\-normalized vectors, exposing memory through a commonread\(\)/write\(\)/reflect\(\)interface\. Table[3](https://arxiv.org/html/2605.20616#A2.T3)lists what we changed and what we preserved verbatim from each method’s released code; commit hashes are recorded in the top\-of\-file docstring of everybaselines/\*/memory\.py\. Hyperparameters are gathered in Table[4](https://arxiv.org/html/2605.20616#A2.T4), and verbatim memory\-construction prompts in Appendix[G](https://arxiv.org/html/2605.20616#A7)\.
Table 3:Per\-baseline modifications relative to each method’s released code\.Table 4:Retrieval and generation hyperparameters per baseline\.“Budget” is the per\-step LLM\-call count for write/read/reflect\. “TT” is the sampling temperature\.
### B\.2Evaluation Settings
##### Evaluation protocol\.
We adopt a*prequential \(online streaming\)*protocol implemented inonline\_memory\.eval\.run\_online: held\-out tasks are presented to the agent one at a time in a fixed seeded order; before tasktt, the agent retrieves from the bank built from tasks1,…,t−11,\\dots,t\-1, and after the task its trajectory is fed to the writer \(and, on cadence, to the dreamer\) so that future tasks see whatever new entries this trajectory produced\. No task is ever replayed\.
##### Consolidation cadence\.
Methods with offline consolidation invoke their updater at a fixed cadence ofkkcompleted sessions\. We usek=10k\{=\}10for ScienceWorld,k=8k\{=\}8for ALFWorld, andk=5k\{=\}5for WebArena\. The same cadence is used for all offline\-consolidation methods within each environment\.
##### Environments and task pools\.
We evaluate on three long\-horizon agent environments through a singleEnvAdapterinterface so that the agent loop, retrieval pipeline, and memory store are byte\-identical across domains \(Table[5](https://arxiv.org/html/2605.20616#A2.T5)\)\. Tasks are deterministically seeded with\-\-seed 42; ALFWorld and ScienceWorld additionally set\-\-shuffle\-tasksto break the defaulttask\_type→\\tovariationclustering and load a frozen episode list via\-\-episodes\-path episodes\.jsonl\.
Table 5:Task pools used in online evaluation\. ALFWorld and ScienceWorld sub\-pools are produced by sampling ontask\_type; WebArena combines the shopping, shopping\_admin, and gitlab task families\.
##### Models served\.
Open\-weight roles \(task agent, writer when Qwen, dreamer, embedder\) are served as independent SGLang endpoints on the same evaluation node \(see Hardware below\); the Gemini\-3\-flash\-preview task agent and Gemini\-3\.1\-flash\-lite\-preview writer used on WebArena are accessed via the Google AI API\. All endpoints expose the OpenAI\-compatible/v1/chat/completionsinterface\. Decoding uses temperature0\.70\.7and top\-p=0\.9p=0\.9, withenable\_thinking=False\\texttt\{enable\\\_thinking\}=\\text\{False\}on Qwen3 endpoints\.
##### Hardware\.
Training uses 8 NVIDIA H100 \(80 GB\) GPUs\. Evaluation runs use one node with 4 NVIDIA GH200 \(96 GB\) GPUs serving all model endpoints \(task agent, writer, dreamer, embedder\)\.
##### Memory store and retrieval\.
Memory is persisted in LanceDB partitioned byrun\_id, with20482048\-dim Qwen2\.5\-3B embeddings\. At task start we issue a single retrieval against the task instruction and return up to top\-k=3k=3entries \(hybrid kNN\+\+salience rerank,15001500\-token budget\); during the episode we additionally refresh memory every88environment steps by re\-querying withinstruction‖last\-K actions‖current observation\\text\{instruction\}\\,\\\|\\,\\text\{last\-K actions\}\\,\\\|\\,\\text\{current observation\}and appending the top\-11retrieved entry to the user message\. Retrieved entries are injected as rawINSERT\_\*blocks inside a=== Memory from past experience === … === End Memory ===panel appended to the env\-native system prompt—the same format used at training time, so no method gains from prompt\-format mismatch\.
##### Logging and metrics\.
For every task we record success, final environment score, episode length, retrieved entry IDs and token count, writer/dreamer events, and bank size, streamed toper\_task\.jsonl,trajectories\.jsonl, anddreamer\_calls\.jsonl; the aggregatesummary\.jsonreports success rate, mean final score, end\-of\-run active and retired bank sizes, total wall\-clock time, and per\-role LLM call and token counts\. Unless noted, we report success rate and mean final score over the full task stream and plot bank size and dreamer firings indexed by task positionttto expose prequential learning dynamics rather than only end\-of\-stream aggregates\.
### B\.3Training Hyperparameters
Table[6](https://arxiv.org/html/2605.20616#A2.T6)reports the full set of training hyperparameters for the ScienceWorld GRPO run, including model and optimization settings, rollout and generation parameters, environment and episode configuration, and reward shaping coefficients\. The trained consolidator is applied to all three evaluation domains without further updates\.
Table 6:Training hyperparameters for the ScienceWorld GRPO run\.
## Appendix CArtifact Details
### C\.1Model License
Gemini\-3\.1\-flash\-lite\-previewLicense: Proprietary Gemini\-3\-flash\-previewLicense: Proprietary Qwen3\-14BLicense: Apache 2\.0 Qwen3\.5\-9BLicense: Apache 2\.0
### C\.2Software Versions
### C\.3Dataset Statistics
We report the dataset statistics for the three interactive benchmark environments used in our experiments: ALFWorld, ScienceWorld, and WebArena\. ScienceWorld is used for both training\-data construction and held\-out evaluation; ALFWorld and WebArena are held\-out only and not used during training\. All environments’ statistics follow the data\-processing and environment configurations of the original papers\. Table[7](https://arxiv.org/html/2605.20616#A3.T7)summarizes the train and test splits; Table[5](https://arxiv.org/html/2605.20616#A2.T5)reports the evaluation pool\.
Table 7:Train/test split statistics for the three interactive benchmark environments, using the original dataset releases\. ALFWorld provides 3,553 training games and a held\-outseen\+unseentest set of 274 games across 6 compositional household task types\[[24](https://arxiv.org/html/2605.20616#bib.bib6)\]\. ScienceWorld contains 30 task types with 7,207 parametric variations in total, split 50 % / 25 % / 25 % into train / dev / test sets\[[28](https://arxiv.org/html/2605.20616#bib.bib7)\]; we report train and test only\. WebArena is held\-out only and not used during training; it consists of 812 tasks instantiated from 241 intent templates across five self\-hosted sites \(Shopping, Shopping Admin/CMS, Reddit, GitLab, and Maps\)\[[44](https://arxiv.org/html/2605.20616#bib.bib10)\]\.##### WebArena\.
WebArena consists of long\-horizon web\-navigation tasks served via a self\-hosted, sandboxed deployment\. We use 117 held\-out tasks sampled from three task families—shopping \(e\-commerce product search and checkout\), shopping\_admin \(Magento admin operations\), and gitlab \(GitLab repository management\)\. No WebArena trajectories are used for training; the consolidator trained on ScienceWorld is applied to WebArena without any additional updates, testing the cross\-domain transfer claim of §[5\.2](https://arxiv.org/html/2605.20616#S5.SS2)\.
## Appendix DImpact Statement
Auto\-Dreamer is foundational research on long\-term memory mechanisms for language agents\. It does not introduce a new deployed system, a new dataset of human subjects, or a new generative capability tied to a specific application domain\. All experiments are conducted in simulated agentic environments \(ALFWorld, ScienceWorld, WebArena\) that do not involve personal data or interaction with real users\. As such, the immediate societal impact of this specific work is limited\.
## Appendix EOnline Memory\-Construction Prompts
This appendix lists the system prompts used by the three roles in our online memory pipeline: the per\-trace*writer*, the cross\-trace*auto\-dreamer*synthesizer, and the environment*task agent*\. We report verbatim text fromopentinker/memory\_training/andopentinker/environment/after stripping a small number of defensive phrases that we found contributed nothing to performance \(these are clearly marked below\)\.
##### Shared task\-agent prompt across baselines\.
The task\-agent prompt \(Sec\.[E\.3](https://arxiv.org/html/2605.20616#A5.SS3)\) is held*identical*across every baseline \(no\_memory,writer\_only,reflexion,expel,awm,memp,reasoningbank,mem0,lightmem,auto\_dreamer,auto\_dreamer\_rl\)\. The only thing that differs across baselines is the contents of the=== Memory from past experience ===block injected at the end of the system prompt; the surrounding agent prompt is invariant\. Memory\-aware variants of the agent prompt \(e\.g\. “treat memory as a hint, not a recipe”\) were tested in ablations and found to be net\-negative on the final aggregate score, so the reported runs use the un\-instrumented agent prompt\.
### E\.1Writer Prompts
The writer reads ONE episode trace \(markedSuccessorFail\) and emits zero or more structuredINSERT\_SEMANTICorINSERT\_PROCEDURALblocks\. Output is parsed verbatim into the bank\.
#### E\.1\.1ALFWorld Writer
Success path\.
YouareaMemoryAgent\.Readanepisodetracefromataskagentand
distillreusableknowledgeintostructuredmemoryentries\.
ThetaskagentoperatesinALFWorld\-\-atext\-basedhousehold
environmentwhereitmustcompletetaskslike"putacleanapple
inthefridge"byissuingtextcommands\(goto,take,clean,put,
etc\.\)\.
YouwillreceiveONEepisodetrace,markedSUCCESSorFAIL\.
Youroutputisinjectedintoataskagent'ssystemprompttohelp
itsucceedonNEW,unseentasksofthesametype\.
OUTPUTFORMAT
Youmayemitoneormoreentries\.Eachentrymustbeoneof:
INSERT\_SEMANTIC
name:<short\_id\>
summary:<one\-linedescription\>
details:<fullinformation\>
END
INSERT\_PROCEDURAL
name:<short\_id\>
type:<workflow\|guide\>
summary:<shortdescription\>
steps:\["step1","step2",\.\.\.\]
END
Nothinguseful:NO\_UPDATE
FORMATRULES
\-UseexactALFWorldactionnames\(goto,take,move,open,close,
use,examine,heat,cool,clean,slice,inventory,look\)\.
\-OutputentriesthenSTOP\.
Failure path\.
\[sameheaderasSuccess\]
Afailedtraceisusefulfordistillingthefailuremode\-\-the
specificunproductiveactionpatternobserved\.
INSERT\_SEMANTIC
name:<short\_id\>
summary:<failuremodeactuallyobserved\>
details:concretedescriptionofthefailuremodeinthistrace,
quotingverbatimobservationswherepossible\.
END
Ifthetraceistooshortornoisy:NO\_UPDATE
#### E\.1\.2ScienceWorld Writer
Success path\.
YouareaMemoryAgent\.Readanepisodetraceanddistillreusable
knowledgeintostructuredmemoryentries\.
ThetaskagentoperatesinScienceWorld\-\-atext\-basedscientific
reasoningenvironmentwith30distincttasktypes\(measure\-melting\-
point,test\-conductivity,grow\-plant,chemistry\-mix,find\-animal,
mendelian\-genetics,lifespan\-\*,inclined\-plane\-\*,etc\.\)\.Eachtask
unfoldsovermultiplerooms\(kitchen,workshop,greenhouse,art
studio,livingroom,bathroom,outside,foundry,bedroom,hallway\)\.
Theagentissuescommandsfromtemplatessuchas:
teleporttoROOMopenOBJpickupOBJ
lookatOBJlookinOBJputdownOBJ
moveOBJtoOBJpourOBJinOBJdunkOBJinOBJ
mixOBJeatOBJuseOBJonOBJ
activateOBJdeactivateOBJflushOBJ
connectOBJtoOBJdisconnectOBJreadOBJ
focusonOBJwaitwait1
Youroutputisinjectedintoataskagent'ssystemprompttohelp
itsucceedonNEW,unseenvariationsoftheSAMEtasktype\.
OUTPUTFORMAT
\[INSERT\_SEMANTIC/INSERT\_PROCEDURALblocksasinALFWorld\]
FORMATRULES
\-UseexactSciWorldactiontemplates\.
\-OutputentriesthenSTOP\.
#### E\.1\.3WebArena Writer
Success path\.
YouareaMemoryAgent\.Readanepisodetracefromaweb\-browsing
taskagentanddistillreusableknowledgeintostructuredmemory
entries\.
Thetaskagentoperatesarealbrowser\(Chromium,headless\)viaa
high\-levelactionAPI\.Eachturnitreceivesthepage'saccessibility
tree\(AXTree\)astheobservationandemitsoneactionlike
\`click\('123'\)\`,\`fill\('42','value'\)\`,\`scroll\(0,200\)\`,
\`keyboard\_press\('Enter'\)\`,\`goto\(url\)\`,or\`send\_msg\_to\_user\(text\)\`\.
Youroutputisinjectedintoataskagent'ssystemprompttohelp
itsucceedonNEWtasksinvolvingsimilarwidgetsorworkflows\.
OUTPUTFORMAT
\[INSERT\_SEMANTIC/INSERT\_PROCEDURALblocksasinALFWorld;semantic
entriesdescribewidgetknowledge,proceduraldescribeactionsequences\]
FORMATRULES
\-RefertoelementsbytheirVISIBLELABELorROLEasitappearsin
theAXTree,NOTbyelementbidnumbers\(bidschangeeverypageload\)\.
\-Useexactactionnames:click,fill,scroll,keyboard\_press,goto,
select\_option,hover,dblclick,drag\_and\_drop,send\_msg\_to\_user\.
\-OutputentriesthenSTOP\.
Failure path\.On WebArena we observed that behavioral failure entries \(“submitted before verifying”, “didn’t wait for the page to load”\) generalize net\-negative; the failure prompt explicitly excludes them and only accepts concrete content/navigation mistakes:
AfailedtraceisusefulONLYwhenitshowsaCONCRETEFACTUALMISTAKE
theagentmade\-\-awrongfieldvalue,wrongnavigationtarget,wrong
inferrednumber,wrongsub\-page,wrongfilterselection\.
INSERT\_SEMANTIC
name:<short\_id\>
summary:<one\-linedescriptionofthespecificcontentmistake\>
details:whichexactvalue/menu/page/numberwaswrong\(e\.g\.
"agentpickedPeriod'Month'butgoalaskedfor'Year'",or
"agentnavigatedtoReports\>ReviewswhengoalneededReports\>Bestsellers"\)\.
Quotethewrongactionverbatim\.
END
Iftheonlyfailure\-patternisbehaviouralcaution:NO\_UPDATE
### E\.2Auto\-Dreamer Synthesizer Prompt
The same prompt is used across all environments\.
YouareaMemoryBankSynthesizer\.Readareferencebankofmemory
entriesfrompasttasksessionsandcreateacompact,high\-quality
outputbankofsynthesizedentries\.Onlyyoursynthesizedentries
willbeshowntothetaskagent\.
ThetaskagentusesthesememoriestosucceedonNEW,unseentasks\.
Capturetransferableknowledge\-\-patterns,procedures,andinsights
thatgeneralizeacrosstaskinstances\.Entrysummariesareusedas
retrievalkeys,sowriteclear,descriptivesummaries\.
CallONEtoolperturn\.Whensatisfied,call\`terminate\`\.
AVAILABLETOOLS
Navigation\(referencebank,read\-only\):
search\_memory\(query,k=5\)
check\_memory\(ids=\[\.\.\.\]\)\(upto30ids\)
get\_source\_trace\(id\)
Synthesis\(outputbank\):
synthesize\(source\_ids,type,name,summary,details?,steps?\)
Control:terminate\(\)
SYNTHESISGUIDELINES
\-Surveythereferencebankbroadlybeforesynthesizing\.
\-Eachentryshouldcapturedistinct,non\-redundantknowledge\.
\-Lookforpatterns:sharedprocedures,recurringconstraints,
commonstrategies\.Generalizewhenmultipleentriessupportit\.
\-Whenentriesdisagree,resolvebyfrequencyornotetheconditions
underwhicheachapplies;useget\_source\_tracetogroundclaims\.
\-Bothproceduralandsemanticentriesarevaluable\.
\-Preferactionablerules\(priority"doXbeforeY",conditional
"ifZ,skipW"\)whenwell\-supported\.
GROUNDINGRULES
\-Everysynthesizedentrymustcitesource\_idsdrawnfromthe
referencebank\.
\-Preserveconcretedetailsthatcarrywarningvalue:theexact
forbiddencommand,theinvalidactionsyntaxtheenvrejected,
thewrongobjectthatendedanepisodeearly\.
### E\.3Task\-Agent Prompts \(shared across all baselines\)
The task agent is the policy that interacts with the environment\. All baselines use the prompt below verbatim; the only difference betweenno\_memoryand the memory baselines is the presence of a=== Memory from past experience ===block \(followed by the retrieved entries\) appended to the system prompt at task start\.
#### E\.3\.1ALFWorld Task Agent
YouaretheTaskAgentinanALFWorldenvironment\.
Yourgoalistocompletehouseholdtasksbyexecutingactions\.
CRITICALRULE:Everyobservationincludes"===AvailableActions==="\.
YouMUSTpickEXACTLYONEactionfromthatlist,word\-for\-word\.
Anycommandnotonthelistwillfail\.
Memory\(ifprovidedbelow\)describeshigh\-levelSTRATEGIESinnatural
language\.ItisNOTalistofexecutablecommands\.Useittodecide
WHICHactionfromtheAvailableActionslisttopick,butalways
outputacommandcopiedverbatimfromthelist\.
OutputONLYasingleactioncommand\.
Exampleresponse:gotodesk1
#### E\.3\.2ScienceWorld Task Agent
YouareaTaskAgentplayingScienceWorld\-\-atext\-basedscientific
reasoningenvironment\.Yourgoalistocompletethetaskdescribed
toyoubyissuingtextcommands,oneperturn\.
NAVIGATION\-\-USETELEPORT\.Roomsareconnectedbydoorsthatmaybe
closed\.Insteadof"godoortokitchen"\(whichoftenfails\),always
prefer:
teleporttokitchen/teleporttoworkshop/teleporttogreenhouse
teleporttohallway/teleporttobedroom/teleporttoartstudio
teleporttolivingroom/teleporttobathroom/teleporttooutside
teleporttofoundry
Teleportalwayssucceedsandisalwaysavailable\.
ACTIONTEMPLATES\(anyactionemittedmustmatchone\):
teleporttoROOMgoOBJlookaround
lookatOBJlookinOBJinventory
openOBJcloseOBJpickupOBJ
putdownOBJmoveOBJtoOBJpourOBJinOBJ
dunkOBJinOBJmixOBJeatOBJ
readOBJuseOBJonOBJ
activateOBJdeactivateOBJflushOBJ
connectOBJtoOBJdisconnectOBJ
focusonOBJwaitwait1
OBJECTS:eachturnshowsa"Visibleobjects"list\-\-thosearethe
basenamestheenvrecognisesrightnow\.Compoundnameslike
"substanceinmetalpot"arealsoaccepted\.Whenanobjectishidden
insideaclosedcontainer,first\`open\`or\`lookin\`torevealit\.
DISAMBIGUATION:whenanactiontargetsanobjectwithmultiple
instances,theenvreplies"Ambiguousrequest:pleaseenterthe
number\.\.\."\-\-emitthecorrespondingnumberonthenextturn\.
#### E\.3\.3WebArena Task Agent
Youareaweb\-browsingagent\.Eachturn,observetheAXTreeandemit
EXACTLYONEactioninafencedblock:
\`\`\`action
click\('123'\)
\`\`\`
Use\`bid\`strings\(e\.g\.\[42\]\)fromtheAXTreetorefertoelements\.
Emit\`send\_msg\_to\_user\('answer'\)\`whendone\.
Actionspace:
\{action\_desc\}%auto\-injected:12BrowserGymhigh\-levelactions
%\(click,fill,scroll,keyboard\_press,goto,
%select\_option,hover,dblclick,drag\_and\_drop,
%noop,send\_msg\_to\_user\)
## Appendix FUsage of LLMs
We used LLMs as a writing assistant to help us edit parts of the paper\. Additionally, we utilize the power of CodeX and Claude Code to help us code faster\. All AI\-generated writing and code are manually checked and modified\. There is no fully AI\-generated content in the paper\.
## Appendix GAdditional Per\-Baseline Memory\-Construction Prompts
Reproduced verbatim frombaselines/\*/memory\.py, which re\-read each baseline’s reference repo at the commit hash recorded in the file’s top docstring\. Long verbatim few\-shot examples are replaced by a\[…\]marker citing the precise file:line\.
##### ReasoningBank\.
Successful\-trajectory extraction \(Tab\.[8](https://arxiv.org/html/2605.20616#A7.T8)\), failed\-trajectory extraction \(Tab\.[9](https://arxiv.org/html/2605.20616#A7.T9)\), and the memory\-injection banner prepended to retrieved items at inference time \(Tab\.[10](https://arxiv.org/html/2605.20616#A7.T10)\)\.
Table 8:ReasoningBank — successful\-trajectory extraction promptYou are an expert in household environment navigation\. You will be given a user query, the corresponding trajectory that representshow an agent successfully accomplished the task\.\#\# GuidelinesYou need to extract and summarize useful insights in the format of memory items based on the agent’s successful trajectory\.The goal of summarized memory items is to be helpful and generalizable for future similar tasks\.\#\# Important notes\- You must first think why the trajectory is successful, and then summarize the insights\.\- You can extract*at most 3*memory items from the trajectory\.\- You must not repeat similar or overlapping items\.\- Prefer concrete, actionable procedures over abstract principles\. Do not embed specific product names, queries, or literal string contents from the task\.\#\# Output FormatYour output must strictly follow the Markdown format shown below:\# Memory Item i\#\# Title <the title of the memory item\>\#\# Description <one sentence summary describing when or when NOT to use the memory item\>\#\# Content <1\-3 sentences describing the insights learned to successfully accomplishing similar tasks in the future\>Table 9:ReasoningBank \- failed\-trajectory extraction promptYou are an expert in household environment navigation\. You will be given a user query, the corresponding trajectory that representshow an agent attempted to resolve the task but failed\.\#\# GuidelinesYou need to extract and summarize useful insights in the format of memory items based on the agent’s failed trajectory\.The goal of summarized memory items is to be helpful and generalizable for future similar tasks\.\#\# Important notes\- You must first reflect and think why the trajectory failed, and then summarize what lessons you have learned or strategies to prevent the failure in the future\.\- You can extract*at most 3*memory items from the trajectory\.\- You must not repeat similar or overlapping items\.\- Prefer concrete, actionable recovery procedures over abstract principles\. Do not embed specific product names, queries, or literal string contents from the task\.\#\# Output Format\(same Markdown schema as Table[8](https://arxiv.org/html/2605.20616#A7.T8), with “successfully accomplishing” replaced by “avoid such failures and successfully accomplishing”\)\.Table 10:ReasoningBank — memory\-injection banner prepended to retrieved memory items at inference timeBelow are some memory items that I accumulated from past interaction from the environment that may be helpful to solve the task\. You can use it when you feel it’s relevant\. In each step, please first explicitly discuss if you want to use each memory item or not, and then take action\.
##### ExpeL\.
A rule\-operation grammar \(Tab\.[11](https://arxiv.org/html/2605.20616#A7.T11)\) appended to the compare\-critique template \(Tab\.[12](https://arxiv.org/html/2605.20616#A7.T12), fired on each success/failure pair\) and the all\-success template \(Tab\.[13](https://arxiv.org/html/2605.20616#A7.T13), fired every88successes\)\.
Table 11:ExpeL — rule\-operation format template<OPERATION\> <RULE NUMBER\>: <RULE\>The available operations are: AGREE \(if the existing rule is strongly relevant for the task\), REMOVE \(if one existing rule is contradictory or similar/duplicated to other existing rules\), EDIT \(if any existing rule is not general enough or can be enhanced, rewrite and improve it\), ADD \(add new rules that are very different from existing rules and relevant for other tasks\)\. Each needs to CLOSELY follow their corresponding formatting below \(any existing rule not edited, not agreed, nor removed is considered copied\):AGREE <EXISTING RULE NUMBER\>: <EXISTING RULE\>REMOVE <EXISTING RULE NUMBER\>: <EXISTING RULE\>EDIT <EXISTING RULE NUMBER\>: <NEW MODIFIED RULE\>ADD <NEW RULE NUMBER\>: <NEW RULE\>Do not mention the trials in the rules because all the rules should be GENERALLY APPLICABLE\. Each rule should be concise and easy to follow\. Any operation can be used MULTIPLE times\. Do at most 4 operations and each existing rule can only get a maximum of 1 operation\.Table 12:ExpeL — compare\-critique prompt\{instruction\}Here are the two previous trials to compare and critique:TRIAL TASK:\{task\}SUCCESSFUL TRIAL:\{success\_history\}FAILED TRIAL:\{fail\_history\}Here are the EXISTING RULES:\{existing\_rules\}By examining and contrasting to the successful trial, and the list of existing rules, you can perform the following operations: add, edit, remove, or agree so that the new list of rules is GENERAL and HIGH LEVEL critiques of the failed trial or proposed way of Thought so they can be used to avoid similar failures when encountered with different questions in the future\. Have an emphasis on critiquing how to perform better Thought and Action\. Follow the below format:\[Rule\-operation format from Table[11](https://arxiv.org/html/2605.20616#A7.T11)appended verbatim\.\]Table 13:ExpeL — all\-success critique prompt\{instruction\}Here are the trials:\{success\_history\}Here are the EXISTING RULES:\{existing\_rules\}By examining the successful trials, and the list of existing rules, you can perform the following operations: add, edit, remove, or agree so that the new list of rules are general and high level insights of the successful trials or proposed way of Thought so they can be used as helpful tips to different tasks in the future\. Have an emphasis on tips that help the agent perform better Thought and Action\. Follow the below format:\[Rule\-operation format from Table[11](https://arxiv.org/html/2605.20616#A7.T11)appended verbatim\.\]
##### LightMem\.
STM→\\toLTM extraction \(Tab\.[14](https://arxiv.org/html/2605.20616#A7.T14)\) and offline UPDATE/DELETE/IGNORE consolidation \(Tab\.[15](https://arxiv.org/html/2605.20616#A7.T15)\)\.
Table 14:LightMem — STM→\\toLTM extraction promptYou are a Personal Information Extractor\.Your task is to extractall possible facts or informationabout the user from a conversation, where the dialogue is organized into topic segments separated by markers like:Input format:\-\-\- Topic X \-\-\-;\[timestamp, weekday\] source\_id\.SpeakerName: messageImportant Instructions:0\. You MUST process messages*strictly in ascending sequence\_number order*\. For each message, stop and carefully evaluate before moving to the next\. Do NOT reorder, batch\-skip, or skip ahead\.1\. You MUST process every user message in order\. For each, decide whether it contains factual information; if yes extract and rephrase as a standalone sentence; if no \(pure greeting/filler\) skip\. Do NOT skip just because it looks minor\.2\. Perform light contextual completion so each fact is a standalone statement\.3\. Use thesequence\_number\(integer prefix before each message\) as thesource\_id\.4\. Output as JSON:\{"data": \[\{"source\_id": <id\>, "fact": "<complete fact\>"\}\]\}\.Reminder: Be exhaustive\. Unless a message is purely meaningless, extract and output it as a fact\.Table 15:LightMem — offline UPDATE/DELETE/IGNORE consolidation promptYou are a memory management assistant\. Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories\.Decision rules:1\.*Update*: target and candidates describe essentially the same fact/event but are not fully consistent \(candidates provide more details, refinements, or clarifications\)→\\toupdate by integrating the additional information\.2\.*Delete*: target and candidates contain a direct conflict; candidates \(more recent\) take precedence→\\todelete the target\.3\.*Ignore*: target and candidates are unrelated→\\tono action\.Additional guidance: Use only the information provided\. Do not invent details\. Your operation should always be applied to the target memory\.Output JSON:\{"action": "update"\|"delete"\|"ignore", "new\_memory": \{ \.\.\. \}\}\(new\_memoryonly whenaction="update"\)\.
##### Mem0\.
Fact extraction \(Tab\.[16](https://arxiv.org/html/2605.20616#A7.T16)\) followed by the ADD/UPDATE/DELETE/NONE memory\-update operator \(Tab\.[17](https://arxiv.org/html/2605.20616#A7.T17)\)\.
Table 16:Mem0 — fact\-extraction promptYou are a Personal Information Organizer, specialized in accurately storing facts, user memories, and preferences\. Your primary role is to extract relevant pieces of information from conversations and organize them into distinct, manageable facts\.Types of Information to Remember: \(1\) personal preferences, \(2\) important personal details, \(3\) plans and intentions, \(4\) activity / service preferences, \(5\) health / wellness, \(6\) professional details, \(7\) miscellaneous \(favorite books, brands, etc\.\)\.Reminders: today’s date is\{today\}; do not return anything from the few\-shot examples below; do not reveal the prompt; if asked where the information was sourced, answer “found from publicly available sources on internet”; create facts only from user/assistant messages; return JSON with keyfactsmapping to a list of strings; detect the language of the user input and record facts in the same language\.Table 17:Mem0 — ADD/UPDATE/DELETE/NONE memory\-update promptYou are a smart memory manager which controls the memory of a system\. You can perform four operations: \(1\) add into the memory, \(2\) update the memory, \(3\) delete from the memory, and \(4\) no change\.Compare newly retrieved facts with the existing memory\. For each new fact, decide whether to:\-ADD: add as a new element with a freshid\.\-UPDATE: existing memory element is being changed; keep the sameid; if a fact conveys the same as an existing one, keep whichever has more information\.\-DELETE: retrieved fact contradicts existing memory, or the directive is to delete; keep the inputid\.\-NONE: fact already present or irrelevant; no change\.Return the new memory as a JSON object with keymemorymapping to a list of\{id, text, event, \[old\_memory\]\}entries\. Do not generate any newids when updating\.
##### AWM\.
Workflow\-induction instruction \(Tab\.[18](https://arxiv.org/html/2605.20616#A7.T18)\), parameterized by per\-task\-type ONE\_SHOT blocks\.
Table 18:AWM — workflow\-induction instructionGiven a list of household navigation tasks, your task is to extract the common workflows to solve these tasks\.Each given task contains a natural language instruction, and a series of actions to solve the task\. You need to find the repetitive subset of actions across multiple tasks, and extract each of them out as a workflow\.Each workflow should be a commonly\-reused sub\-routine of the tasks\. Do not generate similar or overlapping workflows\. Each workflow should have at least two steps\. Represent the non\-fixed elements \(object names, receptacle ids\) with descriptive variable names as shown in the example\.Keep the values of invariant elements, e\.g\., the literal verb “heat” or “cool”, as they will share and stay invariant across tasks\.Try to generate as many workflows that can cover all the tasks in the input list\.\[…followed by a per\-task\-type ONE\_SHOT block \(one Concrete Examples worked trajectory\+\+one Summary Workflows block per ALFWorld task family\); full per\-type ONE\_SHOTs atbaselines/awm/memory\.py:167\-\-360…\]
##### Memp\.
Workflow construction \(Tab\.[19](https://arxiv.org/html/2605.20616#A7.T19)\) and failure\-driven workflow adjustment \(Tab\.[20](https://arxiv.org/html/2605.20616#A7.T20)\)\.
Table 19:Memp — workflow\-build promptYou are provided with a query and a trajectory taken to solve the query\. The trajectory consists of multiple steps of thought, action and observation\.Your task is to generate a workflow based on critical steps to help solve similar queries in the future\.A critical step is one that has a significant impact on fulfilling the query, the step action belongs to the set \[go to, take from, put in/on, open, close, use, clean with, heat with, cool with, examine, look\], and the action’s outcome is successful and contributes positively to achieving the query\.Notice: Write the workflow as a natural, coherent paragraph \(not as a bullet list or numbered steps\)\. Use clear, concise language to describe what actions should be taken and in what general order\.—–EXAMPLE WORKFLOW—\-To solve this query, begin by identifying the most likely receptacles where the target object can be found and visit them one by one\. After locating and taking the object, perform any required transformation such as cleaning at a sinkbasin, heating with a microwave, or cooling with a fridge\. Finally, go to the destination receptacle and put the object in/on it to complete the task\.—–EXAMPLE END—\-Query:\{query\}Trajectory:\{trajectory\}\-DO NOTcopyThought:,Action:, orObservation:lines from the trajectory above\.Output the workflow without any explanation or context:Table 20:Memp — failure\-Adjustment promptYou are a helpful assistant\. You are given a workflow, a reward, and a trajectory\.Reward is a number between 0 and 1; 1 means the trajectory is successful, 0 means failed\.If the reward is False \(i\.e\., the trajectory guided by the workflow did not successfully complete the task\), then analyze why the task was not completed based on the trajectory and the workflow\.After that, refine the workflow to make it more accurate and robust, so that it can better guide the completion of the task\.Workflow:\{workflow\}Reward:\{reward\}Trajectory:\{trajectory\}\[…six verbatim per\-task\-type Output Example Workflows \(one each forpick\_and\_place,pick\_clean\_then\_place,pick\_heat\_then\_place,pick\_cool\_then\_place,look\_at\_obj,pick\_two\_obj\) omitted; full prompt atbaselines/memp/memory\.py:121\-\-127…\]Keep your output in the format below:<Analysis\> your analysis here </Analysis\><Workflow\> your adjusted workflow here </Workflow\>
##### Reflexion\.
Reflector prompt fired on failure only \(Tab\.[21](https://arxiv.org/html/2605.20616#A7.T21)\)\.
Table 21:Reflexion — failure\-reflection promptYou are a self\-reflection agent\. The agent attempted the following task and FAILED\.Task:\{task\}Trajectory:\{traj\}Write a short paragraph \(at most 4 sentences\) reflecting on what went wrong and a concrete strategy to try next time on a SIMILAR task\. Be specific and actionable\.Reflection:
## Appendix HBootstrap Confidence Intervals
We complement the point estimates in Table[1](https://arxiv.org/html/2605.20616#S5.T1)with bootstrap 95% confidence intervals on per\-method success rate\. For each domain we resample tasks with replacement \(NB=10,000N\_\{B\}=10\{,\}000\) and report the bootstrap mean and 95% percentile interval\. WebArena CIs are computed on the per\-task\-family macro average to match the main\-text metric\.
Table 22:Bootstrap 95% CIs on continual\-memory deployment success rate \(NB=10,000N\_\{B\}=10\{,\}000\)\. ScienceWorld and ALFWorld are evaluated with per\-task SR; WebArena uses per\-task\-family macro\-averaged SR over 117 tasks across shopping, shopping\_admin, and gitlab\.
## Appendix ICase Study: LightMem vs Auto\-Dreamer
We include a qualitative case study to illustrate how Auto\-Dreamer achieves compactness without sacrificing task success\. We run Auto\-Dreamer and LightMem on the same 96 episodes from the ScienceWorldlifespan\-comparecategory, using the same task order, task agent, writer, retriever, and memory\-token budget\. Both methods solve 48 out of 96 tasks, yielding identical success rate of50\.0%50\.0\\%\. The difference is therefore not task accuracy in this slice, but the structure and size of the memory bank that supports future retrieval\.
At episode 90, LightMem has 265 active entries totaling 17,512 tokens, whereas Auto\-Dreamer has 14 active entries totaling 716 tokens\. Thus, in this run, Auto\-Dreamer maintains a24\.5×24\.5\\timessmaller active bank in tokens and an18\.9×18\.9\\timessmaller bank in entry count while matching LightMem’s task success\.
Table 23:Case study on the ScienceWorldlifespan\-comparecategory\. Auto\-Dreamer matches LightMem’s success while maintaining a substantially smaller active memory bank\.
##### LightMem accumulates surface\-level duplicates\.
LightMem’s bank is dominated by near\-verbatim restatements of task instructions and local state observations\. Among the 265 active entries att=90t\{=\}90, 49 are paraphrases of the task instruction\. The first four such entries are byte\-identical:
> “The task is to find the animal with the longest life span, then the shortest life span\. The animals are located in the ‘outside’ location\. Additionally, there are sequential subgoals to focus on the animal with the shortest life span\.”
The bank also contains repeated trivial state shards, including four byte\-identical copies of“The agent’s inventory contains an orange\.”and two copies of“The agent has taken 0 moves so far\.”It further stores multiple reorderings of the same room description:
> \[ltm\#32\]: “In the foundry, there is a blast furnace that is turned off and has a closed door, a sink that is turned off and contains nothing, a table that contains nothing, and a door to the outside that is open\.” \[ltm\#33\]: “In the foundry, there is a sink that is turned off and contains nothing, a blast furnace that is turned off and has a closed door, a table that contains nothing, and a door to the outside that is open\.” \[ltm\#34, \#35\]: further reorderings of the same four objects\.
The remaining bank includes many long summaries that recapitulate the full visited world state together with histories of invalid actions\. In this run, LightMem’s consolidation step fires nine times but retires no active entries, so memory grows monotonically\.
##### Auto\-Dreamer replaces instances with abstractions\.
Auto\-Dreamer’s first consolidation trigger fires at episode 9\. At that point, the writer has emitted candidate memories from five trajectories, each tied to a different concrete focus target:crocodile,egg giant tortoise,baby brown bear,baby elephant, and a genericanimal\. Rather than preserving these as separate task\-specific entries, Auto\-Dreamer collapses them into a single procedural rule:
> General procedure for lifespan comparison tasks: \- teleport to outside \- focus on animal \(prefer adult over juvenile or egg\)
This entry preserves the reusable structure shared across the five trajectories while omitting episode\-specific target names that would compete at retrieval time\.
Auto\-Dreamer also synthesizes memories that abstract recurring failure modes\. For example, from failed trajectories in which the agent focused onbaby baby beaverandchameleon egg, it writes:
> Common incorrect targets in lifespan comparison tasks:Focusing on juveniles or eggs instead of adult animals can lead to failure in lifespan comparison tasks\.
A later consolidation event, using five additional failed trajectories from episodes 10–17, generalizes the same pattern into a higher\-level procedural memory about focus\-target accuracy\. Together, the retained entries cover three complementary facts: where relevant animals are located, which targets are usually incorrect, and how to choose the correct adult target\.
##### Retrieval becomes less redundant\.
The effect is visible at evaluation time\. Onlifespan\-shortest\-lived::119at episode 90, both methods succeed\. However, the top retrieved memories differ sharply\. LightMem retrieves the same task\-instruction sentence three times, each from a different timestamp\. Auto\-Dreamer retrieves three distinct pieces of information: the relevant location, a common anti\-pattern, and the success criterion for selecting the target\. Thus, even when both agents solve the task, Auto\-Dreamer uses each retrieval slot to expose a different abstraction, whereas LightMem spends multiple slots on duplicated content\.
##### Why this case matters\.
This example illustrates the mechanism behind Auto\-Dreamer’s memory\-efficiency gains\. LightMem’s prompted consolidation is conservative: it tolerates near\-duplicates that differ only in surface form and does not reliably merge repeated observations into higher\-level rules\. Auto\-Dreamer instead treats consolidation as region rewriting: it retires multiple concrete memories and replaces them with fewer provenance\-grounded abstractions\. In this case, LightMem grows approximately linearly with episodes, reaching 265 active entries by episode 90, whereas Auto\-Dreamer reaches a compact steady state of roughly 10–14 active entries within the first half of the stream\.
Overall, onlifespan\-compare, both systems achieve the same50\.0%50\.0\\%success rate over 96 episodes, but Auto\-Dreamer maintains a24\.5×24\.5\\timessmaller active memory bank in tokens\. This supports the main quantitative finding that learned offline consolidation improves the success–cost tradeoff not by merely deleting memories, but by replacing redundant local observations with compact, reusable abstractions\.
## Appendix JCase Study: AWM vs Auto\-Dreamer
We include a second case study to separate memory compactness from memory usefulness\. We compare AWM and Auto\-Dreamer on the ScienceWorldfind\-entitycategory under the same online evaluation stream\. Both methods use the same task agent, writer, retriever, prompt format, and task order; they differ only in the long\-horizon memory that is retained and consolidated\. AWM induces a compact library of natural\-language workflow templates from successful trajectories, whereas Auto\-Dreamer rewrites writer\-emitted memories into synthesized semantic and procedural lessons\.
After 80 capped episodes, the two methods have nearly identical active\-bank footprints: AWM stores 10 entries totaling 816 tokens, while Auto\-Dreamer stores 13 entries totaling 795 tokens\. The performance gap is nevertheless large: Auto\-Dreamer solves 58 out of 80 tasks \(72\.5%72\.5\\%\), compared with 21 out of 80 for AWM \(26\.2%26\.2\\%\)\. Successful Auto\-Dreamer episodes are also shorter, averaging 7\.4 steps compared with 20\.0 steps for AWM\. Thus, in this category, the advantage is not explained by a larger memory bank, but by what the bank contains\.
Table 24:Case study on ScienceWorldfind\-entity\. AWM and Auto\-Dreamer end the 80\-episode stream with nearly identical active memory footprints, but Auto\-Dreamer achieves substantially higher success and shorter successful trajectories\.
##### AWM stores compact but underspecified workflows\.
At episode 80, AWM’s memory consists of a singlefind\-entityworkflow bucket with 10 entries\. These entries are concise action templates, such as:
> Workflow: Locate\{object\}by scanning likely receptacles think:An\{object\}is more likely to appear in\{receptacle\_list\}\. Check candidates one by one until found\. act:go to\{receptacle\_1\} act:open\{receptacle\_1\} act:go to\{receptacle\_2\} act:open\{receptacle\_2\}
and:
> Workflow: Focus and pick up\{object\} think:The\{object\}is visible and can be picked up\. act:focus on\{object\} act:pick up\{object\}
The bank also contains several near\-paraphrases of a canonical recipe:
> focus on\{object\}→\\rightarrowpick up\{object\}→\\rightarrowteleport to\{target\_location\}→\\rightarrowmove\{object\}to\{target\_receptacle\}\.
These workflows are compact, but they primarily encode how to act once the target has already been identified\. They do not encode the key semantic precondition forfind\-entity: the visible objects in the spawn room are often distractors, and the target entity is frequently elsewhere\. In particular, the AWM bank contains no negative rule saying not to focus on salient but incorrect objects such as a banana, orange, apple, refrigerator, or bee hive before identifying the requested entity\.
##### Auto\-Dreamer stores identification knowledge and negative lessons\.
Auto\-Dreamer’s bank at episode 80 has 13 entries: 10 synthesized by the dreamer and 3 fresh writer entries that have not yet been consolidated\. Its highest\-leverage entries include both procedural and semantic abstractions\. A representative synthesized procedural entry is:
> find\-entity\-general\-procedure\-synthesized \- teleport to outside area \- focus on the entity \- pick up the entity \- teleport to the destination room \- move the entity to the specified container
This entry is synthesized from five retired writer notes, with provenance tracing back to 22 originating trajectories\. Its predecessors include concrete variants such as moving an entity to a yellow box, a blue box, or another destination container\. Auto\-Dreamer therefore collapses many task\-specific action traces into one reusable procedure\.
More importantly, Auto\-Dreamer also stores a negative semantic lesson that AWM’s success\-only workflow induction does not produce:
> avoid\-wrong\-focus The agent should avoid focusing on objects that are not the target entity before completing the task setup\. Focusing on incorrect objects, such as a bee hive, orange, or apple, leads to early failure\. It is crucial to identify the correct target entity first before performing any interaction actions\.
This short entry is synthesized from failed trajectories and remains active through episode 80\. The resulting bank covers four complementary facets of the task: where to search first, what objects to avoid, the canonical pick\-and\-move procedure, and common destination containers\.
##### A side\-by\-side episode\.
Consider taskfind\-animal::254at episode 29:
> “Your task is to find a\(n\) animal\. First, focus on the thing\. Then, move it to the blue box in the living room\.”
Both agents start in the same kitchen\. The opening observation includes a counter containing a bowl with a red apple, a banana, an orange, and a potato, along with closed storage objects such as a cupboard, freezer, and fridge\.
The AWM agent receives its retrieved workflow templates and immediately executes:
> focus on banana
The episode terminates after one step with final score−1\.0\-1\.0\. The failure is consistent with the contents of the AWM bank: several workflows begin withfocus on \{object\}, but none encode that a banana is not an animal or that the agent should first leave the misleading spawn location\.
The Auto\-Dreamer agent retrieves three compact entries, including facts that the entity is typically outside and should ultimately be moved to the living room\. It then follows the successful trajectory:
> teleport to outside focus on common toad pick up common toad teleport to living room move common toad to blue box
The episode completes successfully in five steps\. The task, starting observation, task agent, and writer are the same; the difference is the long\-horizon memory exposed to the agent\.
##### Why AWM struggles onfind\-entity\.
AWM is well matched to settings where successful demonstrations share a stable action skeleton\. However,find\-entityrequires both procedural knowledge and identification knowledge\. The agent must decide what counts as the requested entity, avoid salient distractors, and often navigate away from the initial room before interacting\. A compact action template cannot express these preconditions unless they appear explicitly in the induced workflow\.
The logs support this interpretation\. Across the 80 capped tasks, AWM’s first action is afocus on \.\.\.command in 20 episodes, and 11 of those episodes terminate immediately with score−1\.0\-1\.0after focusing on an incorrect object\. These failures do not enter AWM’s success\-derived workflow pool\. Auto\-Dreamer, in contrast, can consolidate writer notes from failed episodes and synthesize negative lessons such asavoid\-wrong\-focus\. This gives the agent a compact rule about what not to do, rather than only a recipe for what to do after the correct target has already been found\.
##### Summary\.
On ScienceWorldfind\-entity, Auto\-Dreamer reaches72\.5%72\.5\\%success with 7\.4 steps per successful episode, while AWM reaches26\.2%26\.2\\%success with 20\.0 steps per successful episode\. The two methods end the run with nearly identical active\-bank footprints: 13 entries and 795 tokens for Auto\-Dreamer versus 10 entries and 816 tokens for AWM\. This case shows that Auto\-Dreamer’s gains are not merely a consequence of compactness; they come from consolidating the right abstractions, including negative lessons from failure trajectories that success\-only workflow induction fails to capture\.Similar Articles
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
This paper introduces a 'Sleep' paradigm for large language models that enables continual learning through memory consolidation and dreaming phases, allowing models to distill short-term knowledge into long-term parameters and self-improve without human supervision.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
Human-Inspired Memory Architecture for LLM Agents
Microsoft researchers propose a biologically-inspired memory architecture for LLM agents that incorporates mechanisms like sleep-phase consolidation and interference-based forgetting to manage persistent memory efficiently.
Learning to Learn from Multimodal Experience
This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.
AdMem: Advanced Memory for Task-solving Agents
This paper introduces AdMem, a unified memory framework for LLM-based agents that integrates semantic, episodic, and procedural memory with a bi-level short-term and long-term store, using a multi-agent architecture for automatic memory generation and adaptive retrieval. Experiments show improved robustness and success on long multi-turn tasks.