AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

arXiv cs.CL 06/15/26, 04:00 AM Papers
Summary
Introduces AgentSpec, a modular specification framework for systematically composing and analyzing embodied LLM agent scaffolds, revealing that performance depends on scaffold compatibility and interaction effects rather than isolated module strength.
arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec-embodied.github.io.
Original Article
View Cached Full Text
Cached at: 06/15/26, 08:59 AM
# AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition
Source: [https://arxiv.org/html/2606.14674](https://arxiv.org/html/2606.14674)
Jixuan Chen1Jianzhi Shen2Haoqiang Kang1Zhi Hong1Qingyi Jiang1Soham Bose1Yiming Zhang1Leon Leng3Amit Vyas1Lingjun Mao1Siru Ouyang4Kun Zhou1Lianhui Qin11University of California, San Diego2Johns Hopkins University3University of Washington4University of Illinois Urbana\-Champaign

###### Abstract

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning\. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior\. We introduceAgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces\.AgentSpecstandardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions\. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement\-learning modules across model backbones\. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength\. In particular, structured multi\-granularity memory improves long\-horizon state tracking, reasoning and memory interact non\-uniformly across environments, reflection trades off correction and cost, and RL\-trained policies compose best when optimized with deployment\-time scaffold structure\.AgentSpecprovides a controlled foundation for studying, comparing, and designing composable LLM agents\. Our code, baselines and interactive playground are publicly available at[https://agentspec\-embodied\.github\.io](https://agentspec-embodied.github.io/)\.

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Jixuan Chen1Jianzhi Shen2Haoqiang Kang1Zhi Hong1Qingyi Jiang1Soham Bose1Yiming Zhang1Leon Leng3Amit Vyas1Lingjun Mao1Siru Ouyang4Kun Zhou1Lianhui Qin11University of California, San Diego2Johns Hopkins University3University of Washington4University of Illinois Urbana\-Champaign

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.14674v1/x1.png)Figure 1:AgentSpecturns tightly coupled embodied\-agent pipelines into a controlled modular design space with fixed typed interfaces, enabling systematic module composition and revealing interaction effects between reasoning, memory, reflection, action, and learning\.Recent advances in large language models \(LLMs\) have substantially improved end\-to\-end reasoning capability\. However, solving complex real\-world tasks as agents, especially long\-horizon decision\-making in embodied environments, requires more than stronger next\-step prediction\(Ahn et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib1); Huang et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib34); Wang et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib84)\)\. Success depends on aligning perception, memory, reasoning, and action across many rounds of interaction\. Recent agent frameworks such as OpenClaw111[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)illustrate this shift: rather than relying on a single model invocation, they augment a base LLM \(e\.g\., GPT\-5\) with tool execution, state tracking, and persistent memory\. Their capability therefore comes not from the model alone, but from composing these components into a coherent decision\-making system\.

Yet despite their growing capability, most agent systems remain tightly coupled pipelines\. Recent modular agent frameworks and cognitive architectures, such as CoALA\(Sumers et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib79)\), AgentSquare\(Shang et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib74)\), AgentGym\(Xi et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib97)\), Voyager\(Wang et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib84)\), and OpenClaw, expose reasoning, memory, tool use, and action execution as reusable building blocks\. However, they are typically designed as complete systems or optimized for high\-performing configurations, rather than controlled platforms for attributing component\-level and interaction\-level effects\. Consequently, when reasoning, memory, reflection, and reinforcement learning are intertwined, improvements remain difficult to isolate and generalize\. The field still lacks principled answers to basic design questions: which reasoning strategies help in which settings; when memory is useful and what form it should take; when reinforcement learning composes well with reasoning strategies; and when reflection improves decisions rather than merely increasing cost\.

We address this gap withAgentSpec, a modular framework that makes agent composition explicit\. It represents an agent as a Perception–Memory–Reasoning–Reflection–Action loop, with reinforcement learning as an optional module for further optimizing behavior\. Perception converts raw observations into a standardized state representation; memory retrieves relevant history and knowledge; reasoning proposes a decision; reflection critiques or revises it; and action executes it in the environment\. By standardizing interfaces,AgentSpecturns many existing agent designs\(Packer et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib69); Park et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib70); Li et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib51)\)into special cases within a shared design space, allowing components to be swapped, recombined, and studied without rebuilding the full system\. This enables stronger agents and clearer scientific analysis, but requires evaluation settings where module interactions are observable rather than hidden inside one\-shot outputs\.

We use embodied agents as a diagnostic setting for modular agent design because embodied tasks are closed\-loop: each action changes the agent’s future observations, available choices, and accumulated history\. Performance therefore depends not only on individual module quality, but also on compatibility within the full decision loop\. For instance, detailed trajectory memory may help long\-horizon state tracking, but distract a planning\-oriented reasoner if the retrieved context is too low\-level; conversely, strong reasoning may still fail when memory does not preserve the task state needed for later decisions\. These interaction effects are central to the questionsAgentSpecis designed to study\.

We evaluateAgentSpecacross four embodied benchmarks that stress complementary aspects of modular decision\-making: DeliveryBench\(Mao et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib62)\)emphasizes long\-horizon planning under resource and deadline constraints; ALFRED\(Shridhar et al\.,[2020](https://arxiv.org/html/2606.14674#bib.bib77)\)requires compositional household manipulation and persistent task\-state tracking; MiniGrid\(Chevalier\-Boisvert et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib12)\)isolates symbolic navigation and partial observability; and RoboTHOR\(Deitke et al\.,[2020](https://arxiv.org/html/2606.14674#bib.bib14)\)tests first\-person navigation in realistic 3D scenes\. Together, they vary horizon length, observation modality, realism, and control difficulty, allowing us to study when module combinations help, when they hurt, and which design principles transfer across settings\.

Our experiments reveal three general principles\. First, module compatibility matters as much as module strength: reasoning structures local decisions, while memory preserves task state across long horizons, but memory helps only when its representation matches the reasoning strategy\. Second, the best composition is environment\-dependent\. Shorter or more symbolic tasks rely more on per\-step reasoning, whereas long\-horizon embodied tasks are bottlenecked by state tracking and trajectory coherence\. Third, effectiveness must be evaluated together with efficiency: stronger performance does not simply come from more tokens or deeper deliberation, and lightweight but well\-matched compositions often achieve better performance–cost trade\-offs than heavier misaligned ones\.

Overall, these findings suggest that modular agent design should be treated as a structured and analyzable design space rather than a collection of interchangeable heuristics\.AgentSpecprovides a controlled framework for composing reasoning, memory, reflection, and learning modules under shared interfaces, enabling systematic comparisons across backbones, tasks, and efficiency constraints\. Beyond improving benchmark performance, our results expose reusable design principles: modules should be selected based on task horizon, state\-tracking demands, representation compatibility, and inference cost\. This also highlights an important future direction: instead of attaching reasoning or memory only at inference time, modular components may need to be jointly optimized with the policy so learned agents remain compatible with their deployment\-time scaffolds\.

In general, our contributions are threefold\. First, we introduceAgentSpec, a typed modular specification for embodied LLM agents that separates perception, memory, reasoning, reflection, action execution, and optional learning into interchangeable components with shared interfaces\. Second, we instantiate this specification across four embodied benchmarks and multiple model backbones, enabling controlled comparisons of module choices that are usually entangled inside complete agent pipelines\. Third, we use this controlled design space to identify reusable principles for scaffolded agent design, showing that memory is useful only when its representation matches the downstream reasoner, multi\-granularity memory is a robust default for long\-horizon tasks, reflection is most valuable when it repairs local execution errors, and RL\-trained policies should be optimized together with the scaffolds they will use at deployment time\.

## 2Related Work

LLM\-Based Agent Systems\.Modern LLM agents are often built as multi\-step pipelines that integrate reasoning, memory, tool use, reflection, and action execution\(Park et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib70); Hong et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib27); Chen et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib10); Wu et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib95); Li et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib45)\)\. Cognitive\-inspired frameworks such as CoALA\(Sumers et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib79)\)formalize agents as compositions of functional modules, while systems such as Voyager and AgentGym show that agents can accumulate skills or improve across environments\(Wang et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib84); Xi et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib97); Lin et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib53); Huang et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib35)\)\. However, most existing systems are proposed as complete end\-to\-end designs, with reasoning, memory, perception, and action components tightly coupled to task\-specific prompts, control logic, or environment interfaces, especially in long\-horizon embodied settings\(Deitke et al\.,[2020](https://arxiv.org/html/2606.14674#bib.bib14); Mao et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib62)\)\. This makes it difficult to isolate components, replace them with alternatives, or systematically study how module interactions affect performance\. In contrast,AgentSpectreats agents as explicit compositions of reusable policy components with standardized interfaces, enabling controlled replacement, recombination, and analysis\.

Agent Design Space\.Prior work has explored a wide range of agent components, including reasoning strategies such as chain\-of\-thought prompting\(Kojima et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib40); Wang et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib89)\), search\-based planning\(Yao et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib110); Zhou et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib125)\), and self\-correction\(Madaan et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib61); Shinn et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib76); Kumar et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib41)\), as well as memory mechanisms such as flat buffers\(Zhong et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib124)\), tiered stores\(Packer et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib69); Chhikara et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib13)\), graph or hierarchical memories\(Li et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib47); Rasmussen et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib73); Anokhin et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib2); Zhang et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib116)\), retrieval\-augmented memories\(Qian et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib72); Fang et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib16); Liu et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib54)\), and procedural or self\-organizing memories\(Wang et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib91); Zheng et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib123); Hu et al\.,[2026a](https://arxiv.org/html/2606.14674#bib.bib30); Nan et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib65)\)\. Recent frameworks further automate architecture search\(Hu et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib32); Zhang et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib120); Li et al\.,[2026b](https://arxiv.org/html/2606.14674#bib.bib50)\), and AgentSquare\(Shang et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib74)\)standardizes modules for automatic recombination\. Yet these methods mainly aim to discover high\-performing configurations under a target metric, offering limited insight into why a configuration works, how much each module contributes, or when modules interact constructively or destructively\. Rather than only searching for the best agent,AgentSpecexposes the agent design space as a controlled platform for analyzing component\-level and interaction\-level effects across tasks and backbones\.

## 3Modular Design

![Refer to caption](https://arxiv.org/html/2606.14674v1/x2.png)Figure 2:Framework overview ofAgentSpec\.AgentSpecdecomposes embodied decision\-making into a typed Perception–Memory–Reasoning–Reflection–Action loop, where transition feedback updates memory and optional learning optimizes module policies or the controller\.We instantiateAgentSpecas a Gym\-compatible agent wrapper organized around a modularPerception–Memory–Reasoning–Reflection–Actionloop, as shown in Figure[2](https://arxiv.org/html/2606.14674#S3.F2)\. The key design choice is not simply modularization but interface control, which indicates that every module receives and emits a typed intermediate object, so changing one component does not require rewriting the rest of the agent\. The interaction can be viewed as a partially observable sequential decision\-making problem, where the environment is modeled as\(𝒮,𝒪,𝒜,T,ρ\)\(\\mathcal\{S\},\\mathcal\{O\},\\mathcal\{A\},T,\\rho\)withlatent states,observations,actions,transition dynamics, andrewards\. This design separates capabilities that are conceptually distinct but often entangled in embodied\-agent systems: interpreting heterogeneous observations, retaining task\-relevant history, reasoning over actions, revising decisions, and executing valid environment actions\.

At each time steptt, the agent receives a task descriptionddand raw observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}\. The perception module abstracts them into a unified state representationut=𝒫\(d,ot\)u\_\{t\}=\\mathcal\{P\}\(d,o\_\{t\}\); memory retrieves relevant historical contextmt=ℳ\(h<t\)m\_\{t\}=\\mathcal\{M\}\(h\_\{<t\}\); reasoning produces an initial decisionrt=ℛ\(ut,mt\)r\_\{t\}=\\mathcal\{R\}\(u\_\{t\},m\_\{t\}\); and reflection refines it intor^t=ℱ\(rt\)\\hat\{r\}\_\{t\}=\\mathcal\{F\}\(r\_\{t\}\), which is converted into an executable actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}\. The environment returns the next observationot\+1o\_\{t\+1\}, rewardρt\\rho\_\{t\}, and termination signaldonet\\textit\{done\}\_\{t\}; the transition is then written back to memory, with optional reinforcement\-learning updates\. By fixing interface\-level computation while varying module implementations,AgentSpecenables controlled studies of how agent components and their interactions affect performance\.

##### Perception\.

The perception module𝒫\\mathcal\{P\}converts heterogeneous observations, such as symbolic states, sensor readings, RGB frames, and textual feedback, into a standardized representation for downstream modules\. It normalizes raw inputs into structured JSON\-like fields and concise textual summaries consumed by language\-model\-based components\. By decoupling environment\-specific observations from downstream control, perception allows the same memory, reasoning, and reflection modules to operate across environments while preserving task\-relevant structure\.

##### Memory\.

The memory moduleℳ\\mathcal\{M\}stores and retrieves information beyond the current context, allowing the agent to accumulate experience and reuse task\-relevant knowledge\. InAgentSpec, memory includes both*episodic*and*semantic*forms: episodic memory records concrete trajectories, action sequences, failures, and successes as raw logs, summaries, or selectively retained experiences, while semantic memory stores reusable knowledge such as maps, domain constraints, strategies, and heuristics\.

AgentSpecsupports and compares multiple memory paradigms, including retrieval\-based methods, persistent guidance such as dynamic cheatsheets\(Suzgun et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib80)\)or writable notebooks, and broader agent\-centric memory engineering approaches such as Agent Context Engineering \(ACE\)\(Zhang et al\.,[2025e](https://arxiv.org/html/2606.14674#bib.bib121)\)\. It also accommodates recent memory systems including CAM\(Li et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib47)\), Zep\(Rasmussen et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib73)\), MemGPT\(Packer et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib69)\), Mem0\(Chhikara et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib13)\), SimpleMem\(Liu et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib54)\), and OpenClaw context management, where external knowledge and semantic memory are represented as maintainable agent skills\.

##### Reasoning\.

The reasoning moduleℛ\\mathcal\{R\}maps the current state representation and retrieved memory to an action proposal with supporting rationale\. Under a shared interface,AgentSpecsupports direct methods such as Chain\-of\-Thought \(CoT\)\(Wei et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib92)\), interactive methods such as ReAct\(Yao et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib111)\), and search\-based methods such as Reasoning via Planning \(RAP\)\(Hao et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib23)\)and Tree of Thoughts \(ToT\)\(Yao et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib110)\)\. This modularization allows us to study not only which reasoning strategy performs best, but also how reasoning interacts with memory, reflection, and computational cost\.

##### Reflection\.

The reflection moduleℱ\\mathcal\{F\}critiques or revises intermediate decisions before execution and can reuse feedback from prior failures\. Under a shared interface,AgentSpecsupports step\-level reflection such as Self\-Refine\(Madaan et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib61)\), trajectory\-level verbal feedback such as Reflexion\(Shinn et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib76)\), and retrospective trajectory analysis such as Retroformer\(Yao et al\.,[2023b](https://arxiv.org/html/2606.14674#bib.bib112)\)\. This enables controlled study of when explicit revision improves decisions and when it mainly increases inference cost\.

##### Reinforcement Learning\.

AgentSpecsupports reinforcement learning as an optional module for improving agent policies from environment feedback\. The framework is compatible with policy optimization methods such asGRPO\(Guo et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib21); Shao et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib75)\)and exposes a unified interface for integrating learning with reasoning, memory, and action\. The RL module is task\-agnostic and can be applied across environments supported byAgentSpec, including DeliveryBench and AI2\-THOR, enabling analysis of both learning itself and its interaction with different reasoning and memory modules\.

## 4Experiments

### 4\.1Evaluation Benchmarks

We evaluateAgentSpecon four embodied\-agent benchmarks that require multi\-step interaction with an environment, covering complementary dimensions of embodied intelligence: DeliveryBench\(Mao et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib62)\)for long\-horizon planning under resource constraints, ALFRED\(Shridhar et al\.,[2020](https://arxiv.org/html/2606.14674#bib.bib77)\)for compositional household instruction following, MiniGrid\(Chevalier\-Boisvert et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib12)\)for symbolic navigation under partial observability, and RoboTHOR\(Deitke et al\.,[2020](https://arxiv.org/html/2606.14674#bib.bib14)\)for object navigation in photorealistic 3D scenes\. These benchmarks vary in observation modality, environment realism, task horizon, and required skills, allowing us to assess whetherAgentSpecgeneralizes across diverse embodied settings with minimal environment\-specific adaptation\. More details are provided in Appendix[D](https://arxiv.org/html/2606.14674#A4)\.

DeliveryBenchis a city\-scale delivery benchmark across 9 urban maps, testing long\-horizon planning under consumable resource constraints with structured and natural\-language observations\. Performance is measured by hourly profit\.ALFREDevaluates long\-horizon household instruction following in 3D environments, requiring object manipulation and state\-dependent reasoning across seven task types\. We report Success Rate \(SR\) and Success\-weighted Path Length \(SPL\)\.MiniGridis a 2D gridworld benchmark with pixel observations, evaluating navigation, object interaction, and reasoning under partial observability across 10 tasks, using SR and an SPL\-style efficiency reward\.RoboTHORevaluates first\-person object navigation in photorealistic indoor scenes, where agents navigate to a target object category within a limited step budget; we report SR and SPL\.

### 4\.2Overall Performance Across Tasks

##### Experimental Setup\.

We conduct the main experiments on DeliveryBench under the1\-hour setting, a long\-horizon and resource\-constrained testbed for evaluating how agent modules affect embodied decision\-making\. We compare configurations along three axes—reasoning, memory, and reflection—across both open\-source and closed\-source backbones\. For each backbone, we start from a lightweight base agent with simple memory and no additional reasoning or reflection, then swap in different module variants under a unified protocol\. We also evaluate representative configurations on MiniGrid, ALFRED, and RoboTHOR to test whether the trends transfer across environments with different horizons, observation modalities, and task structures\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/figures/final_main_results_compact.png)Figure 3:Main results across four interactive\-agent benchmarks\.Colored curves denote representative modular configurations, gray curves denote additional common configurations, and the dashed star\-marked curve shows the best configuration for each backbone\. DeliveryBench reports hourly profit; the other benchmarks report success rate\. Higher is better\. Full results are provided in Appendix[E](https://arxiv.org/html/2606.14674#A5)\.
##### Scaffolding narrows the backbone gap\.

As shown in Figure[3](https://arxiv.org/html/2606.14674#S4.F3)\(a\), stronger backbones generally achieve higher best\-case DeliveryBench performance, with Qwen\-27B and closed\-source models outperforming smaller Qwen variants\. However, the gap is not determined by scale alone\. Well\-matched modular configurations allow smaller or open\-source models to approach stronger models: Qwen\-27B with ReAct\+MemoryBank performs strongly, and Qwen\-9B also benefits from memory\- and reasoning\-augmented variants\. This suggests that external scaffolding can compensate for weaker native long\-horizon reasoning, especially when it provides structured state tracking or reusable task guidance\.

##### Memory helps only when it is structured\.

Figure[3](https://arxiv.org/html/2606.14674#S4.F3)\(a\) shows that memory can substantially improve DeliveryBench performance, but its benefit depends on how information is organized and retrieved\. Structured, action\-oriented memories such as DynamicCheatsheet and MemoryBank provide compact guidance useful for the next decision, while less selective memory may introduce stale or weakly relevant context\. Thus, effective memory is not simply longer context; it must preserve task\-relevant experience in a concise and controllable form\. As shown in Figure[3](https://arxiv.org/html/2606.14674#footnote3), on DeliveryBench with GPT\-5 mini, replacing Base memory with MemoryBank increases ReAct from 8\.54 to 30\.67 and Plan\-and\-Solve from 6\.18 to 26\.78, while CoT shows a weaker and less monotonic pattern\. This indicates that memory gains are mediated by the reasoner that consumes them\.

##### The best module depends on the environment\.

The cross\-benchmark results in Figure[3](https://arxiv.org/html/2606.14674#S4.F3)\(b–d\) show that module effectiveness varies across environments\. In MiniGrid, where tasks are shorter and more symbolic, reasoning\-oriented configurations often match the best envelope, while memory gains are less consistent\. In ALFRED, both memory and reasoning matter because agents must maintain coherence over multi\-step household instructions\. In RoboTHOR, success remains lower and the best configuration varies across backbones, suggesting additional bottlenecks from perception, navigation, and long\-horizon recovery\. Overall, modules should be selected based on the dominant failure mode: reasoning improves local decision structure, memory supports long\-horizon state tracking, and reflection helps when errors can be corrected through explicit revision\.

### 4\.3How Modules Interact

BaseSMemDCCDBMBRCoTP&SMAD8\.5415\.1616\.469\.3530\.6712\.487\.6913\.1912\.7313\.376\.1817\.4321\.9111\.0326\.7822\.2624\.7216\.6318\.4729\.465510101515202025253030Figure 4:Modular combination performance \(mean hourly profit over three runs\) on DeliveryBench using GPT\-5 mini\.333Abbreviations used throughout the paper: R=ReAct, CDB=ChatDB, SMem=SimpleMem, MB=MemoryBank, DC=DynamicCheatsheet, OC=OpenClaw, P&S=Plan\-and\-Solve, SR=SelfRefine, Rfx=Reflexion\.P&S\+BaseCoT\+BaseR\+BaseCoT\+SMemMAD\+MB0101020203030404050506\.26\.212\.512\.58\.58\.57\.77\.729\.529\.518\.618\.61919282824\.224\.235\.435\.4w/o refinew/ refineFigure 5:Mean hourly profit of Self\-Refine on selected modular combinations \(DeliveryBench, GPT\-5 mini\)\.
##### Experimental Setup\.

We further analyze module interactions on DeliveryBench\. Since module effects in long\-horizon embodied tasks are rarely independent, we compare reasoning, memory, and reflection combinations under a unified setting to identify helpful pairings and harmful mismatches\.

##### Planning benefits from abstract memory\.

As shown in Figure[3](https://arxiv.org/html/2606.14674#footnote3), ReAct and Plan\-and\-Solve gain only modestly from low\-level memories such as Base and ChatDB, which mainly return raw observations or trajectory fragments\. They improve substantially with abstract memories such as SimpleMem, DynamicCheatsheet, and MemoryBank, suggesting that planning\-oriented reasoning benefits more from distilled strategies than exact historical states\. Raw observations can support local control, but summarized memory better matches long\-horizon plan\-then\-act reasoning \(see Appendix[E\.1\.2](https://arxiv.org/html/2606.14674#A5.SS1.SSS2)and Appendix[E\.1\.3](https://arxiv.org/html/2606.14674#A5.SS1.SSS3)\)\.

##### Multi\-granularity memory is the safest default\.

MemoryBank performs strongest across reasoning methods because it combines raw trajectories, experience summaries, and higher\-level environmental insights\. This lets each reasoning strategy use memory at the granularity it needs, making the module robust to different downstream reasoning styles \(see Appendix[E\.1\.4](https://arxiv.org/html/2606.14674#A5.SS1.SSS4)and Appendix[E\.1\.5](https://arxiv.org/html/2606.14674#A5.SS1.SSS5)\)\.

##### Multi\-agent reasoning tolerates weaker memory\.

Across memory settings, multi\-agent reasoning remains effective even with low\-level memory, showing lower sensitivity to memory quality than single\-pass reasoning\. Its robustness likely comes from error correction and complementary perspectives, which help maintain decision quality when retrieved memory is coarse or partially relevant \(see Appendix[E\.1\.6](https://arxiv.org/html/2606.14674#A5.SS1.SSS6)\)\.

##### Reflection is a general\-purpose correction layer\.

As Figure[5](https://arxiv.org/html/2606.14674#S4.F5)shows, reflection yields large gains for weak reasoning–memory pairs and smaller but consistent gains for strong ones\. This suggests that reflection is broadly compatible: by re\-evaluating candidate actions against state feedback and recent failures, it reduces local execution errors before they accumulate over long horizons \(see Appendix[E\.1\.7](https://arxiv.org/html/2606.14674#A5.SS1.SSS7)\)\.

## 5Analysis

### 5\.1Efficiency–Performance Trade\-off

![Refer to caption](https://arxiv.org/html/2606.14674v1/x3.png)Figure 6:Pareto frontier of Qwen3\.5\-9B across different module combinations, showing the trade\-off between \(a\) Profit vs\. Token Usage and \(b\) Profit vs\. Thinking Time \(in log axis\)\.##### Motivation\.

Performance alone does not fully characterize embodied agents, since stronger module combinations may introduce substantial inference overhead over long\-horizon interactions\. We therefore evaluate each configuration by both effectiveness and efficiency, usingtotal token usageandaverage thinking time per stepas the main efficiency metrics\.

##### More computation is not always better\.

As shown in Figure[6](https://arxiv.org/html/2606.14674#S5.F6)a, configurations with similar token budgets can achieve very different profits, while some higher\-cost methods offer limited additional gains\. Several ReAct\-based variants lie on the Pareto frontier, suggesting that strong performance can often come from better module composition rather than simply larger inference budgets\. Thus, token efficiency depends not only on reasoning strength, but also on whether reasoning and memory are well matched\.

##### Latency depends on alignment, not just deliberation\.

A similar pattern appears in Figure[6](https://arxiv.org/html/2606.14674#S5.F6)b\. Higher\-profit configurations do not always require proportionally longer thinking time: some achieve strong returns with moderate latency, whereas others spend more time without comparable gains\. This shows that useful computation is task\- and module\-aligned computation, not merely more deliberation\.

### 5\.2Case Study and Error Analysis

![Refer to caption](https://arxiv.org/html/2606.14674v1/figures/error_analysis.png)Figure 7:Overall failure taxonomy across benchmarks\.![Refer to caption](https://arxiv.org/html/2606.14674v1/x4.png)Figure 8:An ALFRED failure case where the agent fails to align recent history with the current observation, leading to a repeated\-action loop\.
![Refer to caption](https://arxiv.org/html/2606.14674v1/x5.png)Figure 9:A DeliveryBench recovery case where Self\-Refine turns repeated bagging failures into an inspect\-and\-repair process\.

##### Many failures come from losing the task state\.

Figure[7](https://arxiv.org/html/2606.14674#S5.F7)shows that failures across benchmarks often involve history misunderstanding: the agent fails to track the current state from past interactions, mistakes completed progress for the current objective, and falls into repetitive loops\. Beyond this shared pattern, RoboTHOR and ALFRED failures concentrate more on visual grounding, while MiniGrid and DeliveryBench errors more often reflect poor adaptation to environment rules, action constraints, and long\-horizon strategy, as detailed in Table[5](https://arxiv.org/html/2606.14674#A5.T5)\.

##### Memory tracks progress; reflection repairs execution\.

Figure[9](https://arxiv.org/html/2606.14674#S5.F9)illustrates a representative ALFRED failure caused by state confusion\. Although one pillow has already been placed, the agent treats the visible pillow as the next target and repeatedly alternates between picking and placing actions, forming a loop\. This suggests that lightweight episodic memory may not reliably distinguish completed progress from visually similar current observations\. DeliveryBench exposes a different challenge: its action interface requires precise parameterization, and agents often fail by issuing invalid or over\-compressed commands\. Figure[9](https://arxiv.org/html/2606.14674#S5.F9)shows that Self\-Refine mitigates this by redirecting the agent to inspect the bag state, recover missing information, and resume with valid actions\. Together, these cases show that memory is crucial for long\-horizon progress tracking, while reflection helps diagnose invalid assumptions and repair failed action plans\.

### 5\.3Learning with Modular Scaffolds

MethodNon\-RLGRPOSUPOBase\-3\.075\.805\.48MAD\+Base\-3\.207\.876\.56ReAct\+DynamicCheatsheet\-2\.895\.028\.27ReAct\+MemoryBank2\.904\.037\.07ReAct\+OpenClaw3\.364\.796\.57ReAct\+Base0\.005\.625\.83

Table 1:Effect of inference\-time modules on different policy backbones\.SUPOshows stronger compatibility with modular agent components, likely because its explicit trajectory\-summary training exposes the policy to structured history during RL\.##### Experimental Setup\.

We instantiate the learning module withGRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib75)\)onQwen3\-4B\-Instruct\-2502and train on DeliveryBench using*earning delta*as the rewardρt\\rho\_\{t\}\. Training uses up to 40 interaction turns per episode, 4 rollouts, and a batch size of 32\.SUPO\(Lu et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib60)\)follows the same environment, reward, and rollout setup, but augments policy learning with trajectory summarization: every 8 actions, the model summarizes the interaction history and conditions subsequent decisions on the generated summary\. We then evaluate each learned policy under the same downstream inference\-time modules\. This design lets us separate two questions: whether RL improves the underlying policy in isolation, and whether the learned policy remains compatible with modular agent scaffolds such as reasoning and memory\.

##### RL improves the bare policy, but does not automatically solve agent composition\.

Table[1](https://arxiv.org/html/2606.14674#S5.T1)shows that RL substantially improves the unscaffolded or lightly scaffolded policy\. For the Base configuration, performance increases from−3\.07\-3\.07without RL to5\.805\.80withGRPOand5\.485\.48withSUPO\. Similar gains appear for ReAct\+Base, where the non\-RL policy obtains0\.000\.00, whileGRPOandSUPOreach5\.625\.62and5\.835\.83, respectively\. This indicates that reward optimization teaches the model useful DeliveryBench\-specific behaviors, such as respecting action constraints, selecting profitable orders, and avoiding locally invalid decisions\. However, these improvements mainly reflect a stronger underlying decision policy\. They do not imply that RL\-trained policies can automatically exploit richer agent frameworks once reasoning or memory modules are attached at inference time\.

##### Post\-hoc scaffolding can be misaligned with standard RL\.

TheGRPOresults reveal a mismatch between policy learning and inference\-time scaffolding\. AlthoughGRPOimproves the Base policy, adding structured memory on top of theGRPO\-trained model does not consistently produce further gains\. For example, ReAct\+DynamicCheatsheet, ReAct\+MemoryBank, and ReAct\+OpenClaw obtain5\.025\.02,4\.034\.03, and4\.794\.79, all below or comparable to the simplerGRPOBase and ReAct\+Base configurations\. This suggests that a policy optimized under the original prompt–observation interface may learn to rely on its training\-time input format and action habits\. When external memory later changes the context distribution, the learned policy may not know how to use the additional structured information effectively\. In this sense, stronger RL on the “bare” LLM does not necessarily translate into stronger RL for the full agent\.

##### Summary\-based learning improves scaffold compatibility\.

In contrast,SUPOcomposes more favorably with memory\-centric scaffolds\. Under the same inference\-time modules,SUPOoutperformsGRPOon ReAct\+DynamicCheatsheet \(8\.278\.27vs\.5\.025\.02\), ReAct\+MemoryBank \(7\.077\.07vs\.4\.034\.03\), and ReAct\+OpenClaw \(6\.576\.57vs\.4\.794\.79\)\. The best overall result in Table[1](https://arxiv.org/html/2606.14674#S5.T1)is not the strongest bare RL policy, butSUPOcombined with DynamicCheatsheet\. This pattern suggests that trajectory\-summary training exposes the policy to a compressed, structured form of history during optimization, making it more receptive to structured memory at test time\. Rather than treating memory as an external add\-on,SUPOpartially aligns the learned policy with the kind of context that modular agents provide\.

##### Policies should be optimized with their deployment\-time scaffolds\.

These results highlight a distinction between*RL for LLM policies*and*RL for agent frameworks*\. Training a bare policy can improve local action selection, but modular agents also require interface compatibility: the policy must learn how to interpret retrieved memory, summarized history, reasoning traces, and reflection feedback\. If these signals appear only at inference time, they may act as distribution shifts rather than useful scaffolds\. More broadly, reasoning, memory, and reflection should be incorporated into policy optimization instead of being attached only after training\.SUPOprovides a lightweight step in this direction by training with summary\-based context, but the larger implication is that future agent RL should optimize the policy and its modular scaffold jointly\.

## 6Conclusion

We introducedAgentSpec, a modular specification framework for studying LLM\-based embodied agents as typed compositions of perception, memory, reasoning, reflection, action execution, and optional learning components\.AgentSpecexposes the scaffold as a controlled design space in which modules can be swapped, recombined, and analyzed under shared interfaces rather than treating an agent scaffold as a fixed end\-to\-end pipeline\. Across four embodied benchmarks and multiple model backbones, our experiments show that agent performance is shaped by compatibility among modules, not only by the isolated strength of individual components\. Structured multi\-granularity memory improves long\-horizon state tracking, reasoning and memory interact in environment\-dependent ways, reflection is most useful when it repairs local execution errors, and RL\-trained policies compose best when their training\-time context resembles the scaffolded context used at deployment time\. These findings suggest that future LLM agent research should optimize not only the base policy, but also the interfaces and scaffold structures through which the policy perceives, remembers, reasons, and acts\.

## References

- Ahn et al\. \(2022\)Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, and 1 others\. 2022\.Do as i can, not as i say: Grounding language in robotic affordances\.*arXiv preprint arXiv:2204\.01691*\.
- Anokhin et al\. \(2024\)Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev\. 2024\.Arigraph: Learning knowledge graph world models with episodic memory for llm agents\.*arXiv preprint arXiv:2407\.04363*\.
- Aytes et al\. \(2025\)Simon A Aytes, Jinheon Baek, and Sung Ju Hwang\. 2025\.Sketch\-of\-thought: Efficient llm reasoning with adaptive cognitive\-inspired sketching\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 24307–24331\.
- Banerjee et al\. \(2025\)Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh\. 2025\.Crane: Reasoning with constrained llm generation\.*arXiv preprint arXiv:2502\.09061*\.
- Bini et al\. \(2025\)Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, and Taha Ceritli\. 2025\.Memlora: Distilling expert adapters for on\-device memory systems\.*arXiv preprint arXiv:2512\.04763*\.
- Cao et al\. \(2025a\)Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, and Zhouhan Lin\. 2025a\.Memory decoder: A pretrained, plug\-and\-play memory for large language models\.*arXiv preprint arXiv:2508\.09874*\.
- Cao et al\. \(2025b\)Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao\. 2025b\.Remember me, refine me: A dynamic procedural memory framework for experience\-driven agent evolution\.*arXiv preprint arXiv:2512\.10696*\.
- Chatterjee and Agarwal \(2025\)Maitreyi Chatterjee and Devansh Agarwal\. 2025\.Semantic anchoring in agentic memory: Leveraging linguistic structures for persistent conversational context\.*arXiv preprint arXiv:2508\.12630*\.
- Chen et al\. \(2025a\)Weishu Chen, Jinyi Tang, Zhouhui Hou, Shihao Han, Mingjie Zhan, Zhiyuan Huang, Delong Liu, Jiawei Guo, Zhicheng Zhao, and Fei Su\. 2025a\.Moom: Maintenance, organization and optimization of memory in ultra\-long role\-playing dialogues\.*arXiv preprint arXiv:2509\.11860*\.
- Chen et al\. \(2023\)Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi\-Min Chan, Heyang Yu, Yaxi Lu, Yi\-Hsin Hung, Chen Qian, and 1 others\. 2023\.Agentverse: Facilitating multi\-agent collaboration and exploring emergent behaviors\.In*The Twelfth International Conference on Learning Representations*\.
- Chen et al\. \(2025b\)Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, and 1 others\. 2025b\.Scaling agent learning via experience synthesis\.*arXiv preprint arXiv:2511\.03773*\.
- Chevalier\-Boisvert et al\. \(2023\)Maxime Chevalier\-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez\-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J K Terry\. 2023\.Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal\-oriented tasks\.*Advances in Neural Information Processing Systems*, 36:73383–73394\.
- Chhikara et al\. \(2025\)Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\. 2025\.Mem0: Building production\-ready ai agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*\.
- Deitke et al\. \(2020\)Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, and 1 others\. 2020\.Robothor: An open simulation\-to\-real embodied ai platform\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 3164–3174\.
- Du et al\. \(2024\)Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch\. 2024\.Improving factuality and reasoning in language models through multiagent debate\.In*Forty\-first international conference on machine learning*\.
- Fang et al\. \(2025a\)Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, and 1 others\. 2025a\.Lightmem: Lightweight and efficient memory\-augmented generation\.*arXiv preprint arXiv:2510\.18866*\.
- Fang et al\. \(2025b\)Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang\. 2025b\.Memp: Exploring agent procedural memory\.*arXiv preprint arXiv:2508\.06433*\.
- Forouzandeh et al\. \(2025\)Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, and Mahdi Jalili\. 2025\.Learning hierarchical procedural memory for llm agents through bayesian selection and contrastive refinement\.*arXiv preprint arXiv:2512\.18950*\.
- Gao et al\. \(2024a\)Kuofeng Gao, Huanqia Cai, Qingyao Shuai, Dihong Gong, and Zhifeng Li\. 2024a\.Embedding self\-correction as an inherent ability in large language models for enhanced mathematical reasoning\.*arXiv preprint arXiv:2410\.10735*\.
- Gao et al\. \(2024b\)Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen\. 2024b\.Interpretable contrastive monte carlo tree search reasoning\.*arXiv preprint arXiv:2410\.01707*\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others\. 2025\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*\.
- Han et al\. \(2025\)Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor Rühle, and Saravan Rajmohan\. 2025\.Legomem: Modular procedural memory for multi\-agent llm systems for workflow automation\.*arXiv preprint arXiv:2510\.04851*\.
- Hao et al\. \(2023\)Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu\. 2023\.Reasoning with language model is planning with world model\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 8154–8173\.
- Hao et al\. \(2024\)Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian\. 2024\.Training large language models to reason in a continuous latent space\.*arXiv preprint arXiv:2412\.06769*\.
- Havrilla et al\. \(2024\)Alex Havrilla, Sharath Raparthy, Christoforus Nalmpantis, Jane Dwivedi\-Yu, Maksym Zhuravinskyi, Eric Hambro, and Roberta Raileanu\. 2024\.Glore: When, where, and how to improve llm reasoning via global and local refinements\.*arXiv preprint arXiv:2402\.10963*\.
- Ho et al\. \(2025\)Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, and Lianhui Qin\. 2025\.Arcmemo: Abstract reasoning composition with lifelong llm memory\.*arXiv preprint arXiv:2509\.04439*\.
- Hong et al\. \(2023\)Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, and 1 others\. 2023\.Metagpt: Meta programming for a multi\-agent collaborative framework\.In*The twelfth international conference on learning representations*\.
- Hong et al\. \(2026\)Yining Hong, Huang Huang, Manling Li, Li Fei\-Fei, Jiajun Wu, and Yejin Choi\. 2026\.Learning from trials and errors: Reflective test\-time planning for embodied llms\.*arXiv preprint arXiv:2602\.21198*\.
- Hu et al\. \(2023\)Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao\. 2023\.Chatdb: Augmenting llms with databases as their symbolic memory\.*arXiv preprint arXiv:2306\.03901*\.
- Hu et al\. \(2026a\)Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, and 1 others\. 2026a\.Evermemos: A self\-organizing memory operating system for structured long\-horizon reasoning\.*arXiv preprint arXiv:2601\.02163*\.
- Hu et al\. \(2025\)Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo\. 2025\.Hiagent: Hierarchical working memory management for solving long\-horizon agent tasks with large language model\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 32779–32798\.
- Hu et al\. \(2024\)Shengran Hu, Cong Lu, and Jeff Clune\. 2024\.Automated design of agentic systems\.*arXiv preprint arXiv:2408\.08435*\.
- Hu et al\. \(2026b\)Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, and Zhicheng Dou\. 2026b\.Memory matters more: Event\-centric memory as a logic map for agent searching and reasoning\.*arXiv preprint arXiv:2601\.04726*\.
- Huang et al\. \(2022\)Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, and 1 others\. 2022\.Inner monologue: Embodied reasoning through planning with language models\.*arXiv preprint arXiv:2207\.05608*\.
- Huang et al\. \(2025\)Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder\. 2025\.Cascade: Cumulative agentic skill creation through autonomous development and evolution\.*arXiv preprint arXiv:2512\.23880*\.
- Jiang et al\. \(2026\)Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li\. 2026\.Magma: A multi\-graph based agentic memory architecture for ai agents\.*arXiv preprint arXiv:2601\.03236*\.
- Kang et al\. \(2025\)Haoqiang Kang, Enna Sachdeva, Piyush Gupta, Sangjae Bae, and Kwonjoon Lee\. 2025\.Gflowvlm: Enhancing multi\-step reasoning in vision\-language models with generative flow networks\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pages 3815–3825\.
- Kang et al\. \(2026\)Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi\-An Ma, and Lianhui Qin\. 2026\.Beyond mode elicitation: Diversity\-preserving reinforcement learning via latent diffusion reasoner\.*arXiv preprint arXiv:2602\.01705*\.
- Kazemi et al\. \(2023\)Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran\. 2023\.Lambada: Backward chaining for automated reasoning in natural language\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6547–6568\.
- Kojima et al\. \(2022\)Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\. 2022\.Large language models are zero\-shot reasoners\.*Advances in neural information processing systems*, 35:22199–22213\.
- Kumar et al\. \(2024\)Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co\-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, and 1 others\. 2024\.Training language models to self\-correct via reinforcement learning\.*arXiv preprint arXiv:2409\.12917*\.
- Lanchantin et al\. \(2023\)Jack Lanchantin, Shubham Toshniwal, Jason Weston, Sainbayar Sukhbaatar, and 1 others\. 2023\.Learning to reason and memorize with self\-notes\.*Advances in Neural Information Processing Systems*, 36:11891–11911\.
- LangChain AI \(2024\)LangChain AI\. 2024\.LangMem: A toolkit for agent memory management\.[https://langchain\-ai\.github\.io/langmem/](https://langchain-ai.github.io/langmem/)\.
- Latimer et al\. \(2025\)Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava, Xuan Wang, and Naren Ramakrishnan\. 2025\.Hindsight is 20/20: Building agent memory that retains, recalls, and reflects\.*arXiv preprint arXiv:2512\.12818*\.
- Li et al\. \(2023\)Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem\. 2023\.Camel: Communicative agents for" mind" exploration of large language model society\.*Advances in neural information processing systems*, 36:51991–52008\.
- Li et al\. \(2024a\)Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, and 1 others\. 2024a\.Embodied agent interface: Benchmarking llms for embodied decision making\.*Advances in Neural Information Processing Systems*, 37:100428–100534\.
- Li et al\. \(2025a\)Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang\. 2025a\.Cam: A constructivist view of agentic memory for llm\-based reading comprehension\.*arXiv preprint arXiv:2510\.05520*\.
- Li et al\. \(2024b\)Shilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, and 1 others\. 2024b\.Graphreader: Building graph\-based agent to enhance long\-context abilities of large language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 12758–12786\.
- Li et al\. \(2026a\)Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Lizhang Chen, Amy Zhang, and Liu Leqi\. 2026a\.Learning robust reasoning through guided adversarial self\-play\.*arXiv preprint arXiv:2602\.00173*\.
- Li et al\. \(2026b\)Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, and Fengli Xu\. 2026b\.Agentswift: Efficient llm agent design via value\-guided hierarchical search\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 31843–31851\.
- Li et al\. \(2025b\)Yuan Li, Lichao Sun, and Yixuan Zhang\. 2025b\.Metaagents: large language model based agents for decision\-making on teaming\.*Proceedings of the ACM on Human\-Computer Interaction*, 9\(2\):1–27\.
- Lightman et al\. \(2023\)Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2023\.Let’s verify step by step\.In*The twelfth international conference on learning representations*\.
- Lin et al\. \(2025\)Jiaye Lin, Yifu Guo, Yuzhen Han, and 1 others\. 2025\.SE\-Agent: Self\-evolution trajectory optimization in multi\-step reasoning with LLM\-based agents\.*arXiv preprint arXiv:2508\.02085*\.
- Liu et al\. \(2026\)Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao\. 2026\.Simplemem: Efficient lifelong memory for LLM agents\.*arXiv preprint arXiv:2601\.02553*\.
- Liu et al\. \(2025a\)Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, and 1 others\. 2025a\.Memverse: Multimodal memory for lifelong learning agents\.*arXiv preprint arXiv:2512\.03627*\.
- Liu et al\. \(2023\)Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang\. 2023\.Think\-in\-memory: Recalling and post\-thinking enable llms with long\-term memory\.*arXiv preprint arXiv:2311\.08719*\.
- Liu et al\. \(2022\)Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te\-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, and Andrew M Dai\. 2022\.Mind’s eye: Grounded language model reasoning through simulation\.*arXiv preprint arXiv:2210\.05359*\.
- Liu et al\. \(2024\)Weijie Liu, Zecheng Tang, Juntao Li, Kehai Chen, and Min Zhang\. 2024\.Memlong: Memory\-augmented retrieval for long text modeling\.*arXiv preprint arXiv:2408\.16967*\.
- Liu et al\. \(2025b\)Yitao Liu, Chenglei Si, Karthik R Narasimhan, and Shunyu Yao\. 2025b\.Contextual experience replay for self\-improvement of language agents\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 14179–14198\.
- Lu et al\. \(2025\)Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen\. 2025\.Scaling llm multi\-turn rl with end\-to\-end summarization\-based context management\.*arXiv preprint arXiv:2510\.06727*\.
- Madaan et al\. \(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others\. 2023\.Self\-refine: Iterative refinement with self\-feedback\.*Advances in Neural Information Processing Systems*, 36\.
- Mao et al\. \(2025\)Lingjun Mao, Jiawei Ren, Kun Zhou, Jixuan Chen, Ziqiao Ma, and Lianhui Qin\. 2025\.Deliverybench: Can agents earn profit in real world?*arXiv preprint arXiv:2512\.19234*\.
- Modarressi et al\. \(2023\)Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze\. 2023\.Ret\-llm: Towards a general read\-write memory for large language models\.*arXiv preprint arXiv:2305\.14322*\.
- Motwani et al\. \(2024\)Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip HS Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt\. 2024\.Malt: Improving reasoning with multi\-agent llm training\.*arXiv preprint arXiv:2412\.01928*\.
- Nan et al\. \(2025\)Jiayan Nan, Wenquan Ma, Wenlong Wu, and Yize Chen\. 2025\.Nemori: Self\-organizing agent memory inspired by cognitive science\.*arXiv preprint arXiv:2508\.03341*\.
- Ning et al\. \(2023\)Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang\. 2023\.Skeleton\-of\-thought: Prompting llms for efficient parallel generation\.*arXiv preprint arXiv:2307\.15337*\.
- OpenClaw \(2026\)OpenClaw\. 2026\.Openclaw\.[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)\.
- Ouyang et al\. \(2025\)Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, and 1 others\. 2025\.Reasoningbank: Scaling agent self\-evolving with reasoning memory\.*arXiv preprint arXiv:2509\.25140*\.
- Packer et al\. \(2023\)Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez\. 2023\.Memgpt: towards llms as operating systems\.*arXiv preprint arXiv:2310\.08560*\.
- Park et al\. \(2023\)Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein\. 2023\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th annual acm symposium on user interface software and technology*, pages 1–22\.
- Paul et al\. \(2024\)Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings\. 2024\.Refiner: Reasoning feedback on intermediate representations\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1100–1126\.
- Qian et al\. \(2025\)Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang\. 2025\.Memorag: Boosting long context processing with global memory\-enhanced retrieval augmentation\.In*Proceedings of the ACM on Web Conference 2025*, pages 2366–2377\.
- Rasmussen et al\. \(2025\)Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\. 2025\.Zep: A temporal knowledge graph architecture for agent memory\.*arXiv preprint arXiv:2501\.13956*\.
- Shang et al\. \(2024\)Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li\. 2024\.Agentsquare: Automatic llm agent search in modular design space\.*arXiv preprint arXiv:2410\.06153*\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\. 2023\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in Neural Information Processing Systems*, 36\.
- Shridhar et al\. \(2020\)Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox\. 2020\.Alfred: A benchmark for interpreting grounded instructions for everyday tasks\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 10740–10749\.
- Sprague et al\. \(2024\)Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett\. 2024\.To cot or not to cot? chain\-of\-thought helps mainly on math and symbolic reasoning\.*arXiv preprint arXiv:2409\.12183*\.
- Sumers et al\. \(2023\)Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, and Thomas L Griffiths\. 2023\.Cognitive architectures for language agents\.*Transactions on Machine Learning Research*\.
- Suzgun et al\. \(2026\)Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou\. 2026\.Dynamic cheatsheet: Test\-time learning with adaptive memory\.In*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 7080–7106\.
- Tang et al\. \(2025\)Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, and 1 others\. 2025\.Agent kb: Leveraging cross\-domain experience for agentic problem solving\.*arXiv preprint arXiv:2507\.06229*\.
- Tian et al\. \(2025\)Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, and Yanfang Liu\. 2025\.Rgmem: Renormalization group\-based memory evolution for language agent user profile\.*arXiv preprint arXiv:2510\.16392*\.
- Wang et al\. \(2025a\)Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Zhenhe Wu, ShuangZhi Wu, Zejun Ma, and Zhoujun Li\. 2025a\.Scm: Enhancing large language model with self\-controlled memory framework\.In*International Conference on Database Systems for Advanced Applications*, pages 188–203\. Springer\.
- Wang et al\. \(2023a\)Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar\. 2023a\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*\.
- Wang et al\. \(2025b\)Jiaan Wang, Fandong Meng, Yunlong Liang, and Jie Zhou\. 2025b\.Drt: Deep reasoning translation via long chain\-of\-thought\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 6770–6782\.
- Wang et al\. \(2026\)Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, and Liyan Xu\. 2026\.Comorag: A cognitive\-inspired memory\-organized rag for stateful long narrative reasoning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 33557–33565\.
- Wang et al\. \(2023b\)Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka\-Wei Lee, and Ee\-Peng Lim\. 2023b\.Plan\-and\-solve prompting: Improving zero\-shot chain\-of\-thought reasoning by large language models\.*arXiv preprint arXiv:2305\.04091*\.
- Wang et al\. \(2025c\)Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu\. 2025c\.R3mem: Bridging memory retention and retrieval via reversible compression\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 4541–4557\.
- Wang et al\. \(2022\)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\. 2022\.Self\-consistency improves chain of thought reasoning in language models\.*arXiv preprint arXiv:2203\.11171*\.
- Wang and Chen \(2025\)Yu Wang and Xi Chen\. 2025\.Mirix: Multi\-agent memory system for llm\-based agents\.*arXiv preprint arXiv:2507\.07957*\.
- Wang et al\. \(2024\)Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig\. 2024\.Agent workflow memory\.*arXiv preprint arXiv:2409\.07429*\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou\. 2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in Neural Information Processing Systems*, 35:24824–24837\.
- Wei et al\. \(2025\)Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, and 1 others\. 2025\.Evo\-memory: Benchmarking llm agent test\-time learning with self\-evolving memory\.*arXiv preprint arXiv:2511\.20857*\.
- Welleck et al\. \(2022\)Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi\. 2022\.Generating sequences by learning to self\-correct\.*arXiv preprint arXiv:2211\.00053*\.
- Wu et al\. \(2024\)Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others\. 2024\.Autogen: Enabling next\-gen llm applications via multi\-agent conversations\.In*First conference on language modeling*\.
- Wu et al\. \(2025\)Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and 1 others\. 2025\.Evolver: Self\-evolving llm agents through an experience\-driven lifecycle\.*arXiv preprint arXiv:2510\.16079*\.
- Xi et al\. \(2025\)Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Xin Guo, Dingwen Yang, Chenyang Liao, Wei He, and 1 others\. 2025\.Agentgym: Evaluating and training large language model\-based agents across diverse environments\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 27914–27961\.
- Xiang et al\. \(2025\)Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, and 1 others\. 2025\.Towards system 2 reasoning in llms: Learning how to think with meta chain\-of\-thought\.*arXiv preprint arXiv:2501\.04682*\.
- Xiao et al\. \(2024\)Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun\. 2024\.Infllm: Training\-free long\-context extrapolation for llms with an efficient context memory\.*Advances in neural information processing systems*, 37:119638–119661\.
- Xiao et al\. \(2025\)Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, and Zora Zhiruo Wang\. 2025\.Toolmem: Enhancing multimodal agents with learnable tool capability memory\.*arXiv preprint arXiv:2510\.06664*\.
- Xiong et al\. \(2025a\)Siheng Xiong, Ali Payani, Yu’an Yang, and Faramarz Fekri\. 2025a\.Deliberate reasoning in language models as structure\-aware planning with an accurate world model\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 31900–31931\.
- Xiong et al\. \(2025b\)Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang\. 2025b\.Self\-rewarding correction for mathematical reasoning\.*arXiv preprint arXiv:2502\.19613*\.
- Xu et al\. \(2025\)Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\. 2025\.A\-mem: Agentic memory for llm agents\.*arXiv preprint arXiv:2502\.12110*\.
- Yan et al\. \(2025\)Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, and 1 others\. 2025\.Memory\-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning\.*arXiv preprint arXiv:2508\.19828*\.
- Yang et al\. \(2024a\)Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, and 1 others\. 2024a\.Memory3: Language modeling with explicit memory\.*arXiv preprint arXiv:2407\.01178*\.
- Yang et al\. \(2024b\)Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E\. Gonzalez, and Bin Cui\. 2024b\.Buffer of thoughts: Thought\-augmented reasoning with large language models\.In*Advances in Neural Information Processing Systems*, volume 37\.
- Yang et al\. \(2025a\)Sen Yang, Yafu Li, Wai Lam, and Yu Cheng\. 2025a\.Multi\-llm collaborative search for complex problem solving\.*arXiv preprint arXiv:2502\.18873*\.
- Yang et al\. \(2026\)Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, and Jesse Thomason\. 2026\.Auditing multi\-agent llm reasoning trees outperforms majority vote and llm\-as\-judge\.*arXiv preprint arXiv:2602\.09341*\.
- Yang et al\. \(2025b\)Wei Yang, Jiacheng Pang, Shixuan Li, Paul Bogdan, Stephen Tu, and Jesse Thomason\. 2025b\.Maestro: Learning to collaborate via conditional listwise policy optimization for multi\-agent llms\.*arXiv preprint arXiv:2511\.06134*\.
- Yao et al\. \(2023a\)Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan\. 2023a\.Tree of thoughts: Deliberate problem solving with large language models\.*Advances in Neural Information Processing Systems*, 36\.
- Yao et al\. \(2022\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\. 2022\.React: Synergizing reasoning and acting in language models\.In*The eleventh international conference on learning representations*\.
- Yao et al\. \(2023b\)Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, and 1 others\. 2023b\.Retroformer: Retrospective large language agents with policy gradient optimization\.*arXiv preprint arXiv:2308\.02151*\.
- Yeo et al\. \(2025\)Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang\. 2025\.Worldmm: Dynamic multimodal memory agent for long video reasoning\.*arXiv preprint arXiv:2512\.02425*\.
- Yu et al\. \(2024\)Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin\. 2024\.Flow of reasoning: Training llms for divergent problem solving with minimal examples\.
- Zhang et al\. \(2025a\)Gaoke Zhang, Bo Wang, Yunlong Ma, Dongming Zhao, and Zifei Yu\. 2025a\.A multi\-memory segment system for generating high\-quality long\-term memory content in agents\.*arXiv preprint arXiv:2508\.15294*\.
- Zhang et al\. \(2025b\)Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan\. 2025b\.G\-memory: Tracing hierarchical memory for multi\-agent systems\.*arXiv preprint arXiv:2506\.07398*\.
- Zhang et al\. \(2025c\)Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan\. 2025c\.Memevolve: Meta\-evolution of agent memory systems\.*arXiv preprint arXiv:2512\.18746*\.
- Zhang et al\. \(2026\)Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang\. 2026\.Memskill: Learning and evolving memory skills for self\-evolving agents\.*arXiv preprint arXiv:2602\.02474*\.
- Zhang et al\. \(2025d\)Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, and Chuang Gan\. 2025d\.Ella: Embodied social agents with lifelong memory\.*arXiv preprint arXiv:2506\.24019*\.
- Zhang et al\. \(2024\)Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, and 1 others\. 2024\.Aflow: Automating agentic workflow generation\.*arXiv preprint arXiv:2410\.10762*\.
- Zhang et al\. \(2025e\)Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, and 1 others\. 2025e\.Agentic context engineering: Evolving contexts for self\-improving language models\.*arXiv preprint arXiv:2510\.04618*\.
- Zhao et al\. \(2024\)Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong\-Jin Liu, and Gao Huang\. 2024\.Expel: Llm agents are experiential learners\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 38, pages 19632–19642\.
- Zheng et al\. \(2023\)Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An\. 2023\.Synapse: Trajectory\-as\-exemplar prompting with memory for computer control\.*arXiv preprint arXiv:2306\.07863*\.
- Zhong et al\. \(2024\)Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang\. 2024\.Memorybank: Enhancing large language models with long\-term memory\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 38, pages 19724–19731\.
- Zhou et al\. \(2023\)Andy Zhou, Kai Yan, Michal Shlapentokh\-Rothman, Haohan Wang, and Yu\-Xiong Wang\. 2023\.Language agent tree search unifies reasoning, acting, and planning in language models\.*arXiv preprint arXiv:2310\.04406*\.
- Zhou et al\. \(2024\)Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng\-Tze Cheng, Quoc V Le, Denny Zhou, Swaroop Mishra, Huaixiu S Zheng, and 1 others\. 2024\.Self\-discover: Large language models self\-compose reasoning structures\.*Advances in Neural Information Processing Systems*, 37:126032–126058\.

## Appendix

## Appendix ALimitations

This work has several limitations\. Although we evaluate multiple benchmarks, backbones, and module combinations, our study does not cover the full space of agentic tasks, and its embodied and decision\-making focus may not directly transfer to domains such as web navigation, software engineering, or open\-ended human–AI interaction\. We also study representative implementations rather than exhaustively optimizing each module family; reasoning, memory, and reflection can depend on prompt design, retrieval and update policies, context budgets, and environment\-specific interfaces\. Due to computational constraints, we cannot scale every test set or run repeated trials for all model–module–environment combinations, so some individual results may be affected by sampling variation\. Finally, while we analyze cost and qualitative failures, more fine\-grained studies of token efficiency, latency, memory growth, retrieval quality, and trajectory\-level error propagation are needed\. We therefore emphasize consistent cross\-setting trends and leave broader task coverage and deeper diagnostic analyses to future work\.

## Appendix BExtended Related Work

LLM Reasoning\.Beyond the prompting and search\-based reasoning methods cited in the main text, a wide range of additional reasoning strategies have been developed\. Tool\-augmented, logic\-aided, and backward\-chaining methods ground reasoning in external computation or formal inference\(Kazemi et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib39)\), while constrained generation enforces structural validity during decoding\(Banerjee et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib4)\)\. Self\-correction is a particularly active area: iterative critique\-and\-revision\(Welleck et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib94); Havrilla et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib25); Gao et al\.,[2024a](https://arxiv.org/html/2606.14674#bib.bib19)\), self\-rewarding correction\(Xiong et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib102)\), cooperative reasoning across models\(Yang et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib107)\), and adversarial self\-play\(Li et al\.,[2026a](https://arxiv.org/html/2606.14674#bib.bib49)\)all enable models to refine their own outputs, complemented by process\-level verification\(Lightman et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib52)\)\. Multi\-agent approaches leverage inter\-agent debate and collaborative training\(Du et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib15); Motwani et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib64); Yang et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib108),[2025b](https://arxiv.org/html/2606.14674#bib.bib109)\)\. Meta\-level methods allow models to compose their own reasoning structures\(Zhou et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib126)\), treat reasoning as structure\-aware planning with world models\(Xiong et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib101); Xiang et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib98)\), incorporate contrastive objectives into Monte Carlo tree search\(Gao et al\.,[2024b](https://arxiv.org/html/2606.14674#bib.bib20)\), or adapt reasoning depth to task complexity via cognitive\-inspired sketching or efficiency analysis\(Sprague et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib78); Ning et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib66); Aytes et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib3)\)\. Latent reasoning methods bypass discrete token generation entirely, operating in continuous latent space\(Hao et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib24)\)or via latent diffusion for iterative refinement and diverse solution exploration\(Kang et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib38)\), while GFlowNet\-based training models reasoning as flow on a DAG to promote diverse generation\(Yu et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib114); Kang et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib37)\)\. Reasoning has also been extended to embodied action selection\(Li et al\.,[2024a](https://arxiv.org/html/2606.14674#bib.bib46)\), grounded simulation\(Liu et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib57)\), deep reasoning for translation\(Wang et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib85)\), and interleaved reasoning\-memorization via self\-notes\(Lanchantin et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib42)\)\. InAgentSpec, all of these are interchangeable instantiations of a single reasoning interface, enabling controlled comparison within identical agent pipelines\.

Memory for LLM Agents\.Beyond the memory categories surveyed in the main text, database\-backed approaches store dialogue history for SQL\-style retrieval\(Hu et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib29); Modarressi et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib63)\), and long\-context memory extensions augment or replace the context window itself\(Liu et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib58); Xiao et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib99); Yang et al\.,[2024a](https://arxiv.org/html/2606.14674#bib.bib105)\)\. Hierarchical and graph\-based methods organize experience into multi\-level semantic structures, temporal knowledge graphs, or event\-centric logic maps\(Li et al\.,[2024b](https://arxiv.org/html/2606.14674#bib.bib48); Jiang et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib36); Hu et al\.,[2026b](https://arxiv.org/html/2606.14674#bib.bib33); Anokhin et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib2)\)\. Procedural memory systems store reusable action sequences, trajectory exemplars, or cross\-domain skills for transfer\(Fang et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib17); Han et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib22); Tang et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib81); Xiao et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib100); Forouzandeh et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib18); Zheng et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib123); Wang et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib91); Liu et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib59); Zhang et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib118)\), while self\-organizing and evolving designs adaptively restructure memory through cognitive\-inspired consolidation, RL\-based construction, or meta\-evolution\(Zhang et al\.,[2025c](https://arxiv.org/html/2606.14674#bib.bib117); Tian et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib82); Qian et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib72); Cao et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib7); Wei et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib93); Yan et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib104)\)\. Plug\-and\-play and parameter\-efficient memory modules enable domain adaptation without retraining the base model\(Cao et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib6); Bini et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib5); Packer et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib69)\)\. Episodic and social memory grounds agents in persistent interaction histories and simulated human behavior\(Park et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib70); Li et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib51); Zhong et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib124); Zhang et al\.,[2025d](https://arxiv.org/html/2606.14674#bib.bib119)\)\. Memory maintenance, segment\-based organization, and semantic anchoring address staleness, coherence, and linguistic grounding over extended interactions\(Chen et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib9); Zhang et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib115); Chatterjee and Agarwal,[2025](https://arxiv.org/html/2606.14674#bib.bib8); Wang et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib86),[2025c](https://arxiv.org/html/2606.14674#bib.bib88)\)\. Multimodal and video\-oriented memory extends these ideas to visual and lifelong learning streams\(Yeo et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib113); Latimer et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib44); Liu et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib55)\)\. A growing line of work further recognizes that memory and reasoning are mutually dependent: recall\-and\-post\-thinking mechanisms augment reasoning with long\-term memory\(Liu et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib56)\), reusable thought templates and evolving knowledge bases bridge past reasoning into current decisions\(Yang et al\.,[2024b](https://arxiv.org/html/2606.14674#bib.bib106); Suzgun et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib80); Qian et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib72)\), hierarchical working memory and structured context manage cognitive load during multi\-step reasoning\(Hu et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib31); Wang et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib83)\), and reasoning\-memory co\-evolution allows agents to jointly improve both capabilities\(Ouyang et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib68); Ho et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib26)\)\. InAgentSpec, these diverse memory designs are exposed through a shared module interface, enabling systematic evaluation of how storage, retrieval, and update choices—and their interaction with reasoning—shape agent performance\.

Reflection and Self\-Improvement\.Reflection mechanisms range from step\-level critique\-and\-revision\(Paul et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib71)\)to trajectory\-level self\-critique that stores verbal reflections from failed episodes\(Yao et al\.,[2023b](https://arxiv.org/html/2606.14674#bib.bib112)\),reflection through trial\-and\-error driven test\-time planning\(Hong et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib28)\)and experience replay approaches that extract and internalize transferable insights from past trajectories\(Zhao et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib122)\)\. Experience\-driven lifelong learning enables agents to self\-evolve through accumulated interaction\(Wu et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib96); Chen et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib11)\)\. InAgentSpec, reflection is a dedicated module decoupled from reasoning and memory, enabling controlled ablation of its marginal contribution and interaction effects with other components\.

## Appendix CModule Implementations

This section details the module implementations instantiated inAgentSpec\. For each module family, we list the representative methods included in our study, summarize their operational characteristics, and provide the corresponding references\. All implementations follow the common interfaces introduced in the main paper, enabling controlled comparisons across reasoning, memory, and reflection mechanisms\.

### C\.1Reasoning Methods

Table[2](https://arxiv.org/html/2606.14674#A3.T2)reports the reasoning strategies included in our implementation suite\. The selected methods span linear decomposition, explicit planning, search\-based exploration, consistency\-based aggregation, and multi\-agent deliberation, covering the main design choices for structuring intermediate inference in LLM agents\.

MethodDescriptionReferenceReActInterleaves reasoning and acting over multiple Thought–Action–Observation cycles, grounding decisions in environmental feedback\.\(Yao et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib111)\)CoTGenerates explicit intermediate reasoning steps before selecting an action, decomposing complex decisions into a chain of thoughts\.\(Wei et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib92)\)Plan\-and\-SolveProduces a high\-level plan before execution, then follows the plan step by step to complete the task\.\(Wang et al\.,[2023b](https://arxiv.org/html/2606.14674#bib.bib87)\)Tree of ThoughtsExpands reasoning into a tree structure, exploring multiple thought branches and evaluating candidates via lookahead search\.\(Yao et al\.,[2023a](https://arxiv.org/html/2606.14674#bib.bib110)\)LATSCombines language model reasoning with Monte Carlo Tree Search, using execution feedback to guide trajectory exploration\.\(Zhou et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib125)\)RAPTreats reasoning as planning via MCTS, building a world model to simulate and evaluate action sequences before committing\.\(Hao et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib23)\)Self\-ConsistencySamples multiple independent reasoning paths and aggregates their outputs to select the most consistent answer\.\(Wang et al\.,[2022](https://arxiv.org/html/2606.14674#bib.bib89)\)MADIntroduces critique and disagreement among several language agents through structured debate rounds before reaching a final decision\.\(Du et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib15)\)Table 2:Reasoning module implementations supported inAgentSpec\.
### C\.2Memory Methods

Table[3](https://arxiv.org/html/2606.14674#A3.T3)summarizes the memory mechanisms implemented inAgentSpec\. These methods cover short\-context buffers, vector\- and database\-backed retrieval, hierarchical and graph\-structured stores, procedural memory, and adaptive long\-term memory systems, providing a broad basis for evaluating how storage, retrieval, and update policies affect agent behavior\.

MethodDescriptionReferenceBaseMaintains a rolling buffer of recent transitions appended directly to the prompt context\.—A\-MemMaintains memory notes with semantic metadata, LLM\-powered content analysis, and relationship management via ChromaDB hybrid search\.\(Xu et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib103)\)ACEAgentic Context Engine applying iterative reflection, LLM\-guided curation, and deduplication to maintain a compact, non\-redundant memory store\.\(Zhang et al\.,[2025e](https://arxiv.org/html/2606.14674#bib.bib121)\)Buffer of ThoughtMaintains a reusable buffer of high\-level thought templates distilled from past reasoning traces, enabling retrieval\-augmented thought reuse at inference time\.\(Yang et al\.,[2024b](https://arxiv.org/html/2606.14674#bib.bib106)\)CAMOrganizes experience into a hierarchical semantic structure using incremental overlapping clustering with LLM\-guided pruning and coherent summarization\.\(Li et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib47)\)ChatDBStores dialogue history in a structured database, enabling SQL\-style retrieval of past interactions\.\(Hu et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib29)\)DCDynamic Cheatsheet that evolves a concise knowledge base via LLM curation and vector\-store retrieval at test time\.\(Suzgun et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib80)\)GMemoryThree\-tier hierarchical graph memory spanning an interaction graph for trajectory condensation, a query graph for task retrieval, and an insight graph for high\-level insight management\.\(Zhang et al\.,[2025b](https://arxiv.org/html/2606.14674#bib.bib116)\)LangMemLangChain\-based memory infrastructure supporting episodic, semantic, and procedural memory types with structured storage and retrieval\.\(LangChain AI,[2024](https://arxiv.org/html/2606.14674#bib.bib43)\)MemGPTManages memory through a paged context window with explicit in\-context and archival storage layers, enabling unbounded long\-term memory\.\(Packer et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib69)\)LightMemManages memory through a multi\-stage pipeline involving normalization, pre\-compression, topic segmentation, LLM\-based extraction, embedding, storage in Qdrant, and retrieval\.\(Fang et al\.,[2025a](https://arxiv.org/html/2606.14674#bib.bib16)\)Mem0Adaptive personal memory with self\-updating storage integrating short\-term and long\-term retrieval across user sessions\.\(Chhikara et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib13)\)MemoryBankLong\-term memory bank that stores and retrieves past experiences using similarity\-based lookup to inform future decisions\.\(Zhong et al\.,[2024](https://arxiv.org/html/2606.14674#bib.bib124)\)MIRIXMulti\-agent memory system using event\-time storage and context\-aware retrieval via a dedicated Mirix client\.\(Wang and Chen,[2025](https://arxiv.org/html/2606.14674#bib.bib90)\)OpenClawContext management module that maintains procedural memory comprising learned policies and lessons, and augments agent prompts with relevant skill documents at each step\.\(OpenClaw,[2026](https://arxiv.org/html/2606.14674#bib.bib67)\)Generative Agent MemoryScores memory entries by combining recency, semantic relevance, and reward utility into a unified ranking function\.\(Park et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib70)\)SimpleMemEmploys a three\-stage pipeline: semantic compression, vector storage, and hybrid retrieval across semantic, keyword, and structured indexes\.\(Liu et al\.,[2026](https://arxiv.org/html/2606.14674#bib.bib54)\)ZepModels memory as a temporal knowledge graph spanning episodes, entities, and higher\-level communities\.\(Rasmussen et al\.,[2025](https://arxiv.org/html/2606.14674#bib.bib73)\)Table 3:Memory module implementations supported inAgentSpec\.
### C\.3Reflection Methods

Table[4](https://arxiv.org/html/2606.14674#A3.T4)presents the reflection mechanisms included in our framework\. These methods operationalize self\-improvement at different temporal scales, from step\-level critique and revision to episode\-level retrospective analysis and reuse of lessons across subsequent trials\.

MethodDescriptionReferenceSelf\-RefinePerforms iterative critique\-and\-revision at the step level: a critic evaluates the current output and proposes improvements until the answer stabilizes or no further improvement is detected\.\(Madaan et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib61)\)ReflexionStores verbal reflections from failed episodes and reuses them in subsequent trials, enabling trajectory\-level self\-correction across episodes\.\(Shinn et al\.,[2023](https://arxiv.org/html/2606.14674#bib.bib76)\)RetroformerRetrospectively analyzes full trajectories after episode completion to distill lessons that improve later planning and decision\-making\.\(Yao et al\.,[2023b](https://arxiv.org/html/2606.14674#bib.bib112)\)Table 4:Reflection module implementations supported inAgentSpec\.

## Appendix DDetailed Evaluation Settings

### D\.1DeliveryBench

#### D\.1\.1DeliveryBench

DeliveryBenchMao et al\. \([2025](https://arxiv.org/html/2606.14674#bib.bib62)\)is a city\-scale embodied benchmark in which the agent acts as an autonomous food courier in procedurally generated 3D cities\. Unlike short\-horizon embodied tasks, DeliveryBench emphasizes long\-horizon decision making under realistic operational constraints\. The agent must continuously choose profitable orders, travel between restaurants and customers, execute pickup and drop\-off operations, and manage limited resources such as time, energy, and transportation cost\. The benchmark therefore evaluates whether an agent can sustain constraint\-aware planning over extended interactive trajectories rather than merely selecting locally plausible actions\.

##### Task Setting\.

The original DeliveryBench benchmark is built around a profit\-earning courier task in urban environments, with diverse road layouts, functional locations, transportation options, and realistic resource dynamics\. At each step, the agent must reason over its current operational state, currently active orders, spatial context, and delivery constraints, then choose the next executable delivery action\. In our framework, DeliveryBench is instantiated as a long\-horizon single\-agent environment under the same high\-level objective of maximizing delivery profit\. For the main study in this paper, we evaluate agent modules under the 1\-hour setting, which provides a controlled testbed for comparing reasoning, memory, and reflection designs in a realistic resource\-constrained scenario\.

##### Evaluation Protocol\.

Within AGENTFACTORY, DeliveryBench is executed through the unified runner rather than through a benchmark\-specific standalone pipeline\. Episodes are evaluated over long\-horizon interactions and terminate when benchmark lifecycle limits are reached or when the framework step budget is exhausted\. This setup preserves the underlying DeliveryBench objective while allowing all agent variants to be compared under the same top\-level execution interface\. We use this shared framework to isolate the contribution of modular components without changing the surrounding evaluation pipeline\.

##### Metrics\.

Following the benchmark goal, the main metric reported in the paper is*hourly profit*, which directly measures how effectively an agent converts long\-horizon decisions into net delivery earnings\. In the main DeliveryBench table, we report the mean hourly profit over three independent runs for each agent configuration\. In addition to this headline metric, the framework\-side evaluator also retains a normalized*score*derived from hourly profit and can record finer\-grained delivery diagnostics when available, such as order\-quality, time\-efficiency, on\-time, and resource\-related indicators\. In the main text, however, we use hourly profit as the primary metric because it best captures the benchmark’s end objective and provides the most interpretable comparison across module combinations\.

##### Unified Agent Input\.

To support cross\-benchmark modularity, we expose DeliveryBench to the agent through a unified input interface rather than passing the original benchmark prompt format directly to each reasoning method\. Concretely, the perception layer organizes the current observation into four conceptual components:*Image*,*Instruction*,*State Info*, and*Action Schema*\. The*Instruction*component encodes the benchmark\-level courier objective and task context\. The*State Info*component repackages benchmark\-native textual context into a structured agent\-facing representation, including the current operational status, active orders, map\-related context, recent actions, and recent execution feedback\. When visual rendering is available, the*Image*component provides environment images through the same shared visual interface used by other embodied benchmarks\. The*Action Schema*component normalizes benchmark actions into a standardized framework\-facing action list with textual descriptions, so downstream reasoning modules interact with DeliveryBench through the same abstract interface used for other environments\.

##### Input/Memory Interface\.

This unified representation is designed to stay compatible with the original DeliveryBench philosophy, which combines persistent task instructions with dynamic state and map context, while still making benchmark\-specific content consumable by shared framework modules\. Importantly, in our framework, external memory and cross\-step reasoning history are not treated as unconditional parts of the base DeliveryBench observation\. Retrieved memory is injected only when a memory module is enabled, and explicit reasoning history is included only when the corresponding history mechanism is turned on\. This separation is important for fair modular comparison: it ensures that memory\-augmented agents receive additional context because of their selected module design, rather than because the benchmark input itself differs across methods\.

##### Action Representation\.

Finally, although DeliveryBench defines its own native action API, AGENTSPEC re\-expresses these actions through a unified action abstraction\. This preserves benchmark semantics at the execution level while allowing the same reasoning, memory, and reflection modules to operate across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR under a common interface\. As a result, differences in performance can be attributed more directly to module design rather than to environment\-specific prompt engineering or ad hoc wrapper logic\.

### D\.2Minigrid

MiniGridChevalier\-Boisvert et al\. \([2023](https://arxiv.org/html/2606.14674#bib.bib12)\)is a symbolic grid\-world benchmark for evaluating basic embodied decision\-making\. Unlike realistic 3D environments, MiniGrid uses compact grid layouts, discrete object types, and a small action space, making it useful for testing whether an agent can follow instructions, navigate, interact with objects, and solve simple compositional tasks under partial observability\.

We construct an evaluation suite in MiniGrid by selecting ten tasks with diverse structures and difficulty levels, covering navigation, object interaction, and compositional reasoning\. The selected tasks are as follows: \(i\)empty\_6x6\_seed42, \(ii\)goto\_object\_6x6\_n2\_seed7, \(iii\)goto\_door\_5x5\_seed123, \(iv\)fetch\_5x5\_n2\_seed99, \(v\)simple\_crossing\_s9n1\_seed42, \(vi\)lava\_crossing\_s9n1\_seed123, \(vii\)multiroom\_n2s4\_seed42, \(viii\)lava\_gap\_s5\_seed42, \(ix\)four\_rooms\_seed42, and \(x\)simple\_crossing\_s9n2\_seed123\.

The agent receives first\-person RGB observations together with a textual environment prompt as input\. For all reasoning, memory, and reflection methods, we adopt the default configurations from our codebase without additional tuning\. Each episode is executed with a maximum step budget of 100\.

### D\.3Alfred

ALFREDShridhar et al\. \([2020](https://arxiv.org/html/2606.14674#bib.bib77)\)is a simulated household benchmark for long\-horizon embodied task completion\. It requires an agent to execute natural\-language instructions of everyday activities in realistic indoor scenes\.

We construct a lightweight evaluation suite from the ALFRED dataset by selecting seven representative tasks, each covering a distinct task type to ensure diversity in object manipulation, multi\-object interaction, state\-dependent reasoning, and long\-horizon planning\.

The selected tasks are: \(i\) place a bat on the bed, \(ii\) put two pillows on the sofa, \(iii\) put a heated apple in the fridge, \(iv\) cool bread and place it on the countertop, \(v\) clean a mug and place it on the coffee machine, \(vi\) examine a pencil under a desk lamp, and \(vii\) place a box containing keys on a chair\.

These tasks collectively cover major ALFRED categories, including pick\-and\-place, multi\-object manipulation, heating, cooling, cleaning, lighting\-based inspection, and movable receptacles\.

The agent receives first\-person RGB observations along with a structured list of visible objects within a 1\.5 m egocentric range to aid perception and grounding\. The maximum step budget is 200\. All methods adopt default configurations without additional tuning\.

#### D\.3\.1RoboTHOR

RoboTHOR is a 3D indoor ObjectNav benchmark in which an agent navigates from a first\-person viewpoint to a specified target object category\. Each episode provides an initial agent pose and a target category, and the agent must navigate using a discrete action space:

> \{MoveAhead, RotateLeft, RotateRight, LookUp, LookDown, Stop\}\.

Following standard ObjectNav evaluation, the task uses stop\-based success: the agent must explicitly issueStop, and the target object must be visible within the configured success threshold\. We report success rate \(SR\) and SPL, where SPL additionally accounts for path efficiency\.

For our evaluation, we construct a compact but diverse RoboTHOR test suite by manually selecting 10 episodes from the validation set\. The selected episodes cover different target object categories and navigation difficulties, ranging from targets located in the same room as the agent to targets requiring cross\-room navigation and broader exploration\. This design allows us to evaluate both local navigation behavior and longer\-horizon exploration under a controlled number of episodes\. We set the maximum episode length to 200 steps, the camera field of view to90∘90^\{\\circ\}, and the rotation angle forRotateLeftandRotateRightto90∘90^\{\\circ\}\.

## Appendix EDetailed Experiment Analysis

Table 5:Distribution of failure categories across benchmarks\. Values are percentages within each benchmark\.BenchmarkParseUnderstandingVisual GroundingUnstr\.Action InterfaceState ConfusionKnowledge DeficiencyObject GroundingSpatial GroundingMemoryHalluc\.Param\.Missing Hist\.Hist\. Misund\.Rule Unfam\.NumericalStrategyMisclass\.Halluc\.StuckDistanceRoboTHOR0\.00\.00\.012\.56\.20\.00\.00\.09\.412\.531\.228\.1ALFRED2\.10\.010\.410\.443\.80\.00\.010\.40\.00\.014\.68\.3DeliveryBench7\.63\.022\.70\.010\.67\.621\.227\.30\.00\.00\.00\.0MiniGrid11\.10\.00\.013\.328\.933\.30\.013\.30\.00\.00\.00\.0

### E\.1Delivery Bench

#### E\.1\.1Complete Experimental Results

This subsection provides the detailed experimental results that support the DeliveryBench analysis in Appendix[E\.1](https://arxiv.org/html/2606.14674#A5.SS1)\. Tables[6](https://arxiv.org/html/2606.14674#A5.T6)–[12](https://arxiv.org/html/2606.14674#A5.T12)report the full configuration\-level results for each evaluated backbone, including the effects of reasoning, memory, and reflection modules on hourly profit, token usage, thinking time, steps, and cost\. Table[13](https://arxiv.org/html/2606.14674#A5.T13)further summarizes the multi\-scale Qwen comparison, making it possible to inspect how the same agentic components behave across model sizes\.

ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone21\.5441840\.492016\.569140\.8918413\.33ReActBaseNone30\.3940758\.071634\.207882\.7319312\.99ReActChatDBNone40\.9237594\.081380\.788609\.8824915\.14ReActDCNone53\.1721554\.045524\.5917349\.8916013\.15ReActSimpleMemNone33\.989716\.872622\.2710904\.652499\.55ReActMemoryBankNone44\.4712778\.682035\.7820008\.8560722\.05ReActOpenClawNone55\.1916016\.613347\.3210487\.7230016\.05CoTBaseNone34\.7438832\.021848\.478868\.4219312\.94Plan&SolveBaseNone41\.2938140\.682249\.7011676\.0423416\.42MADBaseNone31\.1286443\.2415298\.0528846\.1312933\.54ReActBaseSelf\-Refine32\.7468964\.442791\.4912699\.3220122\.94ReActBaseReflexion40\.5441173\.951769\.148817\.8617612\.16

Table 6:Complete Main results onDeliveryBenchusing GPT\-5 \(Default Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone13\.7350899\.6417\.733347\.50922\.35ReActBaseNone31\.5371241\.82176\.663619\.691154\.16ReActChatDBNone10\.9655040\.80138\.292618\.27932\.60ReActDCNone28\.6241466\.251340\.805329\.041593\.94ReActSimpleMemNone30\.3821916\.47459\.335200\.821321\.63ReActMemoryBankNone22\.7550992\.29296\.894101\.69982\.59ReActOpenClawNone22\.8417496\.32915\.423361\.191501\.72CoTBaseNone30\.3547747\.24173\.153178\.051263\.07Plan&SolveBaseNone19\.5258301\.31287\.052674\.98892\.67MADBaseNone29\.32348472\.752648\.424546\.619517\.31ReActBaseSelf\-Refine15\.87147535\.58699\.063223\.03896\.75ReActBaseReflexion32\.9351040\.20137\.451139\.871724\.46

Table 7:Complete Main results onDeliveryBenchusing Gemini\-3\-flash \(Default Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone13\.8341664\.222039\.696831\.881603\.36ReActBaseNone26\.9733584\.031168\.774256\.082023\.20ReActChatDBNone17\.6342070\.071406\.142955\.90981\.93ReActDCNone36\.8331378\.004596\.0414664\.561754\.02ReActSimpleMemNone20\.2515405\.082670\.039979\.891371\.68ReActMemoryBankNone26\.6940713\.802703\.886617\.401733\.84ReActOpenClawNone19\.5613292\.273571\.235559\.871261\.71CoTBaseNone20\.7049864\.862025\.823450\.75932\.25Plan&SolveBaseNone20\.7654603\.723400\.945382\.23982\.87MADBaseNone28\.76106888\.8017538\.4628492\.091058\.69ReActBaseSelf\-Refine26\.44110300\.935965\.6611597\.131156\.55ReActBaseReflexion33\.5847187\.861800\.583994\.621643\.71

Table 8:Complete Main results onDeliveryBenchusing Qwen3\.5\-397B \(Thinking Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone7\.1331554\.642677\.1218044\.631940\.38ReActBaseNone12\.6428232\.46483\.992086\.661640\.24ReActChatDBNone19\.8456136\.511506\.713117\.69860\.26ReActDCNone14\.9529991\.755971\.9110073\.671080\.04ReActSimpleMemNone9\.599081\.732007\.324198\.98840\.03ReActMemoryBankNone7\.4918605\.721479\.273985\.161690\.18ReActOpenClawNone18\.2216473\.9410950\.638985\.501100\.05CoTBaseNone12\.6250844\.341851\.304012\.57920\.26Plan&SolveBaseNone20\.9166762\.945704\.9413543\.98860\.36MADBaseNone23\.98209013\.1737683\.0629573\.14480\.77ReActBaseSelf\-Refine18\.36125377\.878326\.5617113\.551090\.32ReActBaseReflexion14\.6945238\.34847\.982729\.8890\.21

Table 9:Complete Main results onDeliveryBenchusing Qwen3\.5\-9B \(Thinking Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone7\.1340281\.28709\.142569\.251471\.68ReActBaseNone12\.6440734\.34765\.076316\.57951\.11ReActChatDBNone19\.8441584\.49863\.246432\.34901\.09ReActDCNone14\.9535238\.955335\.598067\.921052\.04ReActSimpleMemNone9\.5915408\.481966\.287463\.241441\.13ReActMemoryBankNone7\.4917436\.641436\.352415\.271521\.10ReActOpenClawNone18\.2219832\.972294\.613847\.011451\.38CoTBaseNone12\.6240559\.101147\.886929\.19931\.15Plan&SolveBaseNone20\.9139309\.431448\.075978\.33700\.89MADBaseNone23\.9886828\.718970\.8922950\.701887\.46ReActBaseSelf\-Refine18\.3680186\.412021\.085999\.251393\.34ReActBaseReflexion14\.6940292\.111089\.822310\.701111\.36

Table 10:Complete Main results onDeliveryBenchusing GPT\-5 mini \(Default Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone8\.2748322\.76391\.292492\.18911\.80ReActBaseNone9\.4945450\.88615\.673636\.111132\.17ReActChatDBNone8\.2147557\.91241\.844862\.252745\.24ReActDCNone7\.0039119\.262227\.618516\.991553\.17ReActSimpleMemNone11\.3217963\.88767\.853693\.18810\.71ReActMemoryBankNone4\.0719871\.15522\.383588\.51940\.84ReActOpenClawNone8\.2717619\.51699\.933441\.351231\.05CoTBaseNone22\.9846730\.31329\.823309\.031112\.11Plan&SolveBaseNone10\.4345745\.94665\.194740\.311763\.41MADBaseNone25\.57102908\.562901\.987452\.84944\.41ReActBaseSelf\-Refine20\.21106277\.74821\.237213\.432209\.54ReActBaseReflexion27\.8845527\.11210\.461118\.71262\.30

Table 11:Complete Main results onDeliveryBenchusing Qwen3\.5\-397B \(Non\-Thinking Mode\) as backbone\.ReasoningMemoryReflectionMean HourlyProfitInput Tokensper StepOutput Tokensper StepTotalThinking TimeStepsTotal Cost\(unit: US$\)NoneBaseNone\-4\.1257345\.133884\.958759\.122300\.79ReActBaseNone10\.3052362\.634393\.136840\.381560\.51ReActChatDBNone3\.3154706\.843895\.026565\.231670\.55ReActDCNone5\.7339407\.469637\.7615257\.331990\.68ReActSimpleMemNone\-0\.8926623\.5815100\.3113800\.341230\.44ReActMemoryBankNone8\.9328000\.594492\.1818146\.654831\.00ReActOpenClawNone3\.2621791\.0810267\.348938\.651540\.40CoTBaseNone5\.6753236\.664021\.515536\.891360\.44Plan&SolveBaseNone3\.5654375\.974434\.708459\.031910\.65MADBaseNone17\.45120098\.9634562\.9728549\.811171\.31ReActBaseSelf\-Refine\-0\.09130286\.0614778\.6221065\.271731\.51ReActBaseReflexion14\.6849368\.823132\.934240\.441520\.45

Table 12:Complete Main results onDeliveryBenchusing Qwen3\.5\-9B \(Non\-Thinking Mode\) as backbone\.ReasoningMemoryReflectionQwen3\.5\-27BMean Hourly ProfitQwen3\.5\-9BMean Hourly ProfitQwen3\.5\-2BMean Hourly ProfitQwen3\.5\-0\.8BMean Hourly ProfitNoneBaseNone28\.897\.13\-8\.54\-2\.89ReActBaseNone18\.5712\.64\-9\.91\-4\.55ReActChatDBNone28\.8619\.84\-9\.47\-7\.47ReActDCNone17\.4414\.95\-9\.61\-11\.18ReActSimpleMemNone21\.459\.59\-6\.86\-8\.97ReActMemoryBankNone34\.707\.49\-9\.81\-9\.09ReActOpenClawNone31\.1218\.22\-8\.72\-9\.99CoTBaseNone25\.6212\.62\-7\.63\-9\.65CoTChatDBNone21\.3211\.20–0CoTDCNone33\.6417\.17–\-9\.64CoTSimpleMemNone26\.483\.16\-8\.61\-2\.05CoTMemoryBankNone41\.7119\.48\-7\.47\-4\.15Plan&SolveBaseNone19\.4220\.91\-4\.65\-7\.01Plan&SolveChatDBNone19\.2011\.32–\-8\.14Plan&SolveDCNone21\.789\.43–\-8\.92Plan&SolveSimpleMemNone24\.108\.55\-7\.15\-4\.89Plan&SolveMemoryBankNone34\.7315\.69\-5\.97\-6\.88MADBaseNone26\.3823\.98\-9\.26\-9\.89MADChatDBNone31\.1629\.80–\-9\.52MADDCNone27\.6430\.01–\-9\.15MADSimpleMemNone25\.9121\.67–\-9\.41MADMemoryBankNone33\.9926\.94\-4\.35\-9\.18ReActBaseSelf\-Refine25\.0518\.36\-8\.040\.00CoTBaseSelf\-Refine23\.7111\.95–0CoTSimpleMemSelf\-Refine30\.0713\.43–0Plan&SolveBaseSelf\-Refine22\.0211\.19–\-6\.77MADMemoryBankSelf\-Refine35\.2912\.18–\-2\.5ReActBaseReflexion27\.4114\.69\-20\.06\-4\.54ReActDCReflexion31\.0417\.28\-9\.92\-9\.40ReActMemoryBankReflexion29\.0212\.65\-8\.66\-9\.99CoTBaseReflexion25\.7011\.09–\-7\.1CoTSimpleMemReflexion35\.7517\.76–\-7\.32Plan&SolveBaseReflexion18\.157\.78–\-9\.13MADMemoryBankReflexion39\.4531\.04–\-9\.98

Table 13:Complete main results onDeliveryBenchfor Qwen3\.5\-27B, Qwen3\.5\-9B, Qwen3\.5\-2B, and Qwen3\.5\-0\.8B\. All entries report mean hourly profit only\. Some method combinations were not evaluated due to time and resource constraints, and their missing results are marked as\-\-\.
#### E\.1\.2Case Study: Planning\-Based Reasoning Prefers Abstracted Memory

Thesis\.Among all strategy–memory combinations, Plan\-and\-Solve shows the largest memory\-induced performance gap: average profit rises from $6\.18 withsimpleto $17\.43 withsimplemem\(Δ=\+$11\.25\\Delta=\+\\mathdollar 11\.25\)\. We trace this mechanism through step 15 in0000\_medium\_city\_22roads\_seed42, where the two memory formats drive qualitatively different reasoning under comparable pressure\.

#### Case A: Raw\-History Memory Causes Context Pollution \(Plan\-and\-Solve \+ Simple\)

Source run:gpt\-5\-mini\_plan\_and\_solve\_simple \_none\_1 Environment:0000\_medium\_city\_22roads \_seed42 Key step:15

##### Task & State\.

Agent 5 is riding an e\-scooter at\(212\.00m,−362\.16m\)\(212\.00\\text\{m\},\\;\-362\.16\\text\{m\}\)with 73% energy and 60% battery \(range:1,500m1\{,\}500\\,\\text\{m\}\)\. It holds one active order: Order \#0 \(pickup at\(−522\.59m,221\.25m\)\(\-522\.59\\text\{m\},\\;221\.25\\text\{m\}\), drop\-off at\(223\.09m,−421\.78m\)\(223\.09\\text\{m\},\\;\-421\.78\\text\{m\}\), payout $11\.29, statusReady for pickup, and only4 minremaining\)\. The agent is 65\.7 m from the drop\-off but 1,323\.0 m from the pickup\. It accepted this order at step 12 \(when the time limit was 17 min\), then spent intervening steps charging and relocating instead of heading to the restaurant\.

##### Memory\.

Raw memory excerpt fromMemoryat step 15:

> \{’step’:’11’, ’state’:’\#\# DeliveryBench Instructions \.\.\. \#\# map\_snapshot \.\.\. \[50\+ POIs\] \.\.\. \#\# recent\_actions \.\.\.’, ’action’:’VIEW\_ORDERS\(\)’, \.\.\.\} \{’step’:’12’, ’state’:’\#\# DeliveryBench Instructions \.\.\. \[full prompt repeated\] \.\.\.’, ’action’:’ACCEPT\_ORDER\(0\)’, \.\.\.\} \{’step’:’13’, ’state’:’\#\# DeliveryBench Instructions \.\.\. \[full prompt repeated again\] \.\.\.’, ’action’:’MOVE\(\-522\.59m, 221\.25m\)’, \.\.\.\}

Extracted takeaway\.The memory is not sparse; it is over\-complete\. Each entry replays large prompt scaffolding \(instructions, map, and state context\), so the issue is duplicated context rather than missing task signals\.

##### Analysis\.

In the model’s “Solve” section, it states:

> “Riding to the pickup \(1\.323 km\) takes ~3\.7 minutes on the scooter, which is the fastest feasible path—so I should depart immediately to make pickup and delivery within the time limit\.”

This acknowledges the 3\.7 min ride to pickup but omits the downstream pipeline:PICKUP,PLACE\_FOOD\_IN\_BAG, and a further 1,388\.7 m delivery leg \(≈\\approx3\.9 min at 6 m/s\)\. The true minimum completion time is≥\\geq7\.6 min; with only 4 min remaining the order is physically undeliverable\. Nevertheless, the agent issuesMOVE\(\-522\.59m, 221\.25m\)and commits to a doomed trajectory\.

#### Case B: Narrative Memory Enables Resource\-Aware Planning \(Plan\-and\-Solve \+ Simplemem\)

Source run:gpt\-5\-mini\_plan\_and\_solve \_simplemem\_none\_1 Environment:0000\_medium\_city\_22roads \_seed42 Key step:15

##### Task & State\.

Agent 11 is on an e\-scooter at\(−48\.93m,423\.87m\)\(\-48\.93\\text\{m\},\\;423\.87\\text\{m\}\)with 88% energy but only 2% battery \(range: 47\.2 m\)\. It carries one picked\-up order: Order \#1 \(drop\-off at\(224\.64m,−496\.02m\)\(224\.64\\text\{m\},\\;\-496\.02\\text\{m\}\), payout $15\.89, statusPicked up, waiting for delivery, with16 minremaining\)\. The drop\-off is 1,454\.7 m away, far beyond the current battery range\.

##### Memory\.

Raw memory excerpt fromMemoryat step 15 \(5 narrative entries, condensed\):

> \- Procedural rule: after PICKUP, use PLACE\_FOOD\_IN\_BAG; temperature constraints may require heat/ice packs\. \- Procedural rule: hand\_to\_customer may require STEP\_FORWARD/TURN\_AROUND; fallback to door drop may incur penalties\. \- Timing hint: Order \#1 near pickup, prep ETA about 1 minute\. \- Historical pool snapshot: prior candidate orders and item\-temperature details\. \- Acceptance record: ACCEPT\_ORDER\(1\) with pickup/drop\-off coordinates and prep\-status note\.

Extracted takeaway\.Compared with Case A, this memory is compact and procedural: no raw JSON replay, no duplicated full prompt, and no full\-map restatement\.

##### Analysis\.

The model constructs a quantitative three\-step plan:

1. 1\.Move 15\.1 m to charging\_station 9 at\(−38\.79m,412\.00m\)\(\-38\.79\\text\{m\},\\;412\.00\\text\{m\}\)\.
2. 2\.Charge from 2% to 60%: needed battery=58%=58\\%, time=58/7\.5≈7\.73=58/7\.5\\approx 7\.73min, cost=$0\.05×58=$2\.90=\\mathdollar 0\.05\\times 58=\\mathdollar 2\.90\.
3. 3\.Ride 1,454\.7 m to drop\-off at 6 m/s \(≈\\approx4\.0 min\)\.

Total estimated time is about 11\.7 min, well within the 16\-minute deadline\. The agent executes the first step withMOVE\(\-38\.79m, 412\.00m\)\.

#### Insight

The contrast between Case A and Case B reveals why planning\-based reasoning benefits disproportionately from abstracted memory\.

- •In Case A,simplememory re\-injects full prompt blocks at recall time, consuming context budget and crowding out multi\-step feasibility checks\.
- •In Case B,simplememcompresses history into procedural summaries, preserving budget for battery, time, and deadline arithmetic\.

The key insight is that summarized memory provides reusable policy\-level guidance that better matches plan\-then\-act reasoning\. Planning\-oriented strategies like Plan\-and\-Solve benefit less from exact historical replay and more from distilled experience about what to do in recurring situations\. The performance gap \($6\.18→\\to$17\.43\) is consistent with this mechanism: summarized memory does not make Plan\-and\-Solve inherently smarter, but it prevents context overload from suppressing its planning capacity\.

#### E\.1\.3Case Study: DC Memory Refines Batching into Urgency\-Aware Triage

Source run:gpt\-5\_react\_dc\_none\_1 Environment:0001\_medium\_city\_22roads\_seed123 Key steps:4–10

##### Task & State\.

This case captures a co\-located multi\-order state where a generic batching strategy becomes locally insufficient\. At step 4, the agent is handling two active orders,Order \#19andOrder \#25, from the same restaurant\. However, the two orders are asymmetric:Order \#25is closer to readiness and has the tighter remaining time budget, whileOrder \#19is still farther from execution\. The key decision is therefore not simply whether to batch, but whether to keep waiting for a cleaner joint pickup or to split execution and prioritize the more urgent ready\-side order\.

##### Observation\.

Across steps 4–7, the environment progressively makes this asymmetry actionable\. At steps 4–6,Order \#25is still being prepared for about1 minwith only2 minleft, whereasOrder \#19remains about3 minfrom readiness with4 minleft\. By step 7,Order \#25becomesReady for pickupwith only1 minremaining, whileOrder \#19is stillFood is still being prepared \( 2 min\)\. This turns the state into a concrete asymmetric\-readiness decision point\.

##### Memory\.

Raw memory excerpt fromMemory \(Recent Steps\)at step 7:

> When multiple accepted orders share a pickup location, include only the ready order IDs in PICKUP and leave not\-ready ones for later\. When multiple active orders share a pickup location, coordinate to collect them together when both are \(nearly\) ready, then sequence drop\-offs by tighter deadlines and proximity\. If co\-located orders have mismatched prep ETAs and the sooner\-ready order has a tight time limit, then pick up and deliver the ready order first rather than waiting for the slower one\.

Extracted takeaway\.The memory does not simply restate recent observations\. Instead, it retains a general co\-located batching heuristic while refining it into a sharper exception rule for asymmetric readiness and urgency\.

##### Action & Result\.

The agent follows this refinement closely\. It first switches to the e\-scooter at step 4, moves to the pickup door at step 5, waits through the short remaining preparation window at step 6, and then executesPICKUP\(orders=\[25\]\)at step 7, explicitly excluding the still\-unreadyOrder \#19\. It then completes the standard pipeline withPLACE\_FOOD\_IN\_BAG,MOVE, andDROP\_OFF\. The outcome is not a full rescue:task\_report\.jsonshows thatOrder \#25is still slightly overdue \(deadline\_slack\_s = \-20\.71\), whileOrder \#19is substantially later \(deadline\_slack\_s = \-118\.82\)\. Still, the agent clearly prioritizes the more urgent ready order rather than waiting for a cleaner two\-order batch\.

##### Analysis\.

What makes this case interesting is that the memory refinement unfolds together with the state progression rather than appearing all at once\. In steps 4–5, the agent is still in a positioning phase: it switches transport and moves to the pickup door while the memory remains dominated by general heuristics such as waiting briefly for short preparation times and batching co\-located orders when both are nearly ready\. At step 6, once the asymmetry becomes sufficiently clear, the memory introduces a more specific exception policy for mismatched co\-located orders\. By step 7, this policy is further sharpened into an executable action\-level rule that directly constrains thePICKUPargument to ready order IDs only\. The interesting point is therefore not that the agent simply “remembered” to takeOrder \#25first, but that the memory transformed an initially implicit urgency judgment into a progressively more operational policy representation\. This makes the subsequent action sequence \(SWITCH→\\rightarrowMOVE→\\rightarrowWAIT→\\rightarrowPICKUP\(\[25\]\)\) look less like a one\-off choice and more like a policy being concretized online\.

##### Insight\.

This case suggests a plausible mechanism behind DC’s strong performance: its memory may help not merely by storing recent events, but by converting general operational heuristics into sharper exception and action\-level policies as local states become more concrete\. In long\-horizon delivery settings, many difficult decisions are not global replanning problems but small local conflicts of readiness, urgency, and execution timing\. The evidence here suggests that DC can compress such evolving local structure into reusable procedural guidance\. While this single case does not by itself explain DC’s aggregate advantage, it illustrates how memory can function as an online policy\-refinement mechanism rather than as passive recall alone\.

#### E\.1\.4Case Study: MemoryBank Converts Past Battery Depletion into Proactive Charging Policy

Source run:gpt\-5\-mini\_react\_memorybank\_none\_3 Environment:0001\_medium\_city\_22roads\_seed123 Key steps:9–11

##### Task & State\.

At step 9, the agent has just completed its first delivery \(Order \#17\) and returned to an idle state with no active orders\. The critical detail is the e\-scooter battery: it has dropped to12%with only310\.2,mof remaining range\. The agent’s personal energy is healthy \(91%\), and it has earned $112\.28 so far with roughly 53 minutes remaining\. The decision it faces is whether to immediately accept a new order or first address its depleted battery\.

##### Observation\.

The environment provides the agent’s position at\(\-421\.96m, 20\.20m\)on 8th road \(left\), with the nearest charging station \(charging\_station 15\) only46\.0,maway\. The available order list \(shown at step 10 afterVIEW\_ORDERS\) includes several orders whose pickup\-to\-dropoff distances far exceed the scooter’s remaining 310,m range\. The observation alone contains enough raw data for a cautious agent to infer the battery risk, but does not explicitly advise the agent to charge before accepting work\.

##### Memory\.

Raw memory excerpt fromMemory \(Recent Steps\)at step 9:

> \[Trajectory Summary\] Scooter battery dropped from 50

> \[Environment Insight\] Scooter range drains substantially with travel: range 1250m→\\to310m over this route \(~940m used\)\. Final battery reached ~12Avoid: accepting long\-distance orders without a comfortable scooter\-range buffer \(don’t finish deliveries with battery≤\\leq15Next\-episode policy \(3 actions\): \- Before accepting, estimate total travel \(move\-to\-pickup \+ pickup→\\todropoff\) and require scooter\_range\_left≥\\geqestimated\_distance \+ 300m buffer\. \- If order status shows prep\_time≤\\leqtime\_left, WAIT at pickup rather than moving away\. \- Follow explicit delivery notes \(e\.g\., use "call" when requested\) to minimize delivery friction\.

Extracted takeaway\.The memory does not merely record what happened in the previous trajectory\. It distills the experience of finishing a delivery at dangerously low battery into two concrete operational rules: \(1\) a 15% battery floor below which new orders should not be accepted, and \(2\) a 300,m range buffer requirement over the estimated travel distance\. These are not present in the system prompt or environment instructions; they are learned policies derived from the first episode’s resource consumption pattern\.

##### Action & Result\.

The agent follows the memory\-derived policy closely across the subsequent steps:

- •Step 9:The agent recognizes its low battery and decides to first check available orders before committing:VIEW\_ORDERS\(\)\. Its thought explicitly notes:*My e\-scooter battery is low \(12%, ~310 m range\) … I’ll choose ones within range or head to a nearby charging station before accepting long trips\.*”
- •Step 10:After reviewing the order list, the agent concludes that recharging must come first:*To safely accept and complete profitable nearby orders … I need to recharge first\. The nearest charging station is charging\_station 15 at \(\-412\.00m, 61\.24m\) \(~46m away\)\.*” It executesMOVE\(\-412\.00m, 61\.24m, pace="normal"\)\.
- •Step 11:Upon arriving at the charging station \(battery now 10%, range ~259,m\), the agent charges to a deliberate target: “*I’ll charge the scooter to a comfortable level \(60%\) to ensure I can take mid\-range orders … without risking battery depletion\. Charging now also avoids wasting time going to pickup only to have to detour for charging later\.*” It executesCHARGE\(target\_pct=60\)\.

The outcome is that the agent resumes order acceptance with a fully operational scooter \(~60% battery, ~1500,m range\), enabling it to handle subsequent deliveries without mobility constraints\.

##### Analysis\.

What makes this case informative is the contrast between the agent’s behavior with and without memory\. In the first episode \(step 1, no memory\), the agent begins with 50% battery and simply executesVIEW\_ORDERS\(\)without any battery\-awareness reasoning\. It then acceptsOrder \#17and rides 769\.8,m to the dropoff, consuming nearly all remaining range and arriving at 12% battery—a state that would have been catastrophic had the delivery been slightly longer\.

In the second episode, the same 12% battery state triggers a qualitatively different response\. The agent does not blindly accept the next available order\. Instead, it explicitly references the low battery threshold, evaluates whether nearby orders are within range, and proactively moves to a charging station before accepting new work\. The 60% charging target is also notable: rather than fully charging \(which would waste time at 0\.05/% and 7\.5%/min\), the agent selects a target that provides sufficient range for mid\-distance orders while minimizing idle time—a cost\-benefit tradeoff that aligns with the memory’s “estimated\_distance \+ 300m buffer” heuristic\.

The key causal link is the\[Environment Insight\]entry, which converts a single episode’s resource\-depletion experience into a reusable policy\. Without this memory, the agent would need to independently re\-derive the battery management strategy from the raw observation \(battery percentage, charging station distances, order distances\)\. With it, the policy is pre\-computed and directly available in the prompt, reducing the cognitive burden on the LLM and making the correct action sequence more likely\.

##### Insight\.

This case illustrates MemoryBank’s core mechanism: converting trajectory\-level outcomes into episode\-transferable operational policies\. The memory system does not simply replay past observations; it generates structuredEnvironment Insightentries that encode quantitative thresholds \(battery≥\\geq15%\), buffer requirements \(range≥\\geqdistance \+ 300,m\), and procedural guidance \(charge before accepting\)\. These insights function as learned heuristics that shift the agent from reactive execution—accepting orders until resources are exhausted—to proactive resource management—maintaining sufficient capacity before committing to new tasks\. In long\-horizon delivery settings where resource depletion is a common failure mode, this kind of cross\-episode policy transfer can meaningfully improve operational robustness, even when the underlying model is a smaller\-capacity backbone\.

#### E\.1\.5Case Study: Dynamic Memory Turns Waiting Time into Actionable Maintenance

Source run:agent\_react\_openclaw\_gemini\-3\-flash Episode:0000\_medium\_city\_22roads\_seed42 Key steps:4–9

##### Task & State\.

At step 4, the agent has already acceptedOrder \#7andOrder \#8, both fromRestaurant 4\. It has reached the pickup location, but neither order is ready yet: Order \#7 still requires about4 minutes, and Order \#8 about7 minutes\. Meanwhile, the e\-scooter battery is only44%\. This creates a non\-trivial decision point: the agent is not yet able to pick up the food, but it must decide whether to simply wait or use the preparation window for a useful intermediate action\.

##### Observation\.

The environment explicitly exposes the two pending orders, their remaining preparation times, and the fact that they share the same pickup location\. It also provides nearby POI information, including acharging stationwithin short travel distance\. In other words, the state already contains all the raw ingredients for a good decision, but the observation alone does not specify how waiting time should be operationally exploited\.

##### Memory\.

Below we show a short timeline of the raw memory excerpts that are most relevant to this case\.

> Step 4 “Order Batching: Prioritize accepting multiple orders from the same pickup location or nearby locations \(e\.g\., restaurant 4, order \#7 and \#8\) to minimize travel time and maximize earnings per route\.” “Preparation Wait Time: When batching orders from the same restaurant, be aware of differing food preparation times \(e\.g\., 4 min vs 7 min\); use the wait time to plan the optimal delivery sequence based on drop\-off locations\.” Step 5 “Wait Time Optimization: If food preparation time exceeds 3 minutes at a restaurant, use the window to visit nearby POIs \(stores for battery packs/energy drinks or charging stations\) if they are within a 1–2 minute round\-trip distance\.” “Wait Time Visibility: Food prep status \(e\.g\., 4 min vs 7 min\) is visible in the active\_orders status; checking this immediately upon arrival allows for better scheduling of mid\-route maintenance tasks\.” Step 6 “Charging during Wait Times: If food prep time is\>3\>3minutes and a charging station is within 100m, prioritize moving to the charging station to top up the e\-scooter rather than idling at the restaurant\.” Step 8 “Overdue Priority: Prioritize orders in ‘OVERTIME’ status for immediate pickup and delivery to minimize further late penalties\.” “Pickup Readiness: Before traveling to a restaurant, check if orders are ‘Ready for pickup’ in the active\_orders status to avoid idling upon arrival\.” Step 9 “Pickup Readiness: …If an order is already ready while the agent is nearby, proceed to pickup immediately\.”

These excerpts show that OpenClaw does not only preserve static task knowledge; it also maintains and reuses increasingly action\-specific guidance as the trajectory unfolds\.

##### Action & Result\.

The action sequence follows the retrieved memory closely\. Atstep 5, the agent moves to the nearby charging station\. Atstep 6, it executesCHARGE\(target\_pct=100\)\. Atstep 7, it switches back to the e\-scooter, and atstep 8, it returns to Restaurant 4\. Bystep 9, both orders are ready and already markedOVERTIME, so the agent immediately executesPICKUP\(orders=\[7,8\]\)\. In effect, the agent converts otherwise idle preparation time into battery maintenance, then resumes the batch with a much stronger mobility state \(97%battery at pickup\)\.

##### Analysis\.

What makes this case interesting is that the benefit does not come from a single isolated memory rule\. Instead, the memory evolves along the trajectory and progressively sharpens the decision process\. The earlier excerpts frame the situation as a*batched pickup with asymmetric preparation times*; the middle excerpts re\-interpret that waiting period as an*opportunity for maintenance*; and the later excerpts re\-focus the agent on*overdue pickup urgency*once the food is ready\. In other words, the memory is not merely reminding the agent of generic facts\. It is dynamically reorganizing past experience into a step\-relevant operational policy\.

This matters because the useful intermediate action in this case—charging—does not immediately complete the delivery objective\. Its value is delayed and multi\-step: the agent sacrifices a short local detour in order to improve downstream execution capacity\. Without an external memory that explicitly encodes this kind of reusable operational pattern, the model would need to infer the same strategy from scratch from a noisy observation stream\. OpenClaw reduces that burden by surfacing a compact, actionable policy at exactly the moment when the waiting window appears\.

##### Insight\.

This case suggests that OpenClaw performs well not simply because it stores more text, but because it maintains apersistent, dynamically updated, and action\-oriented external memory store\. The retrieved content is not purely descriptive; it is structured as reusable operational guidance that can be directly mapped onto concrete delivery actions\. In DeliveryBench, where strong performance depends on coordinating preparation delays, battery management, and urgency across multiple steps, such memory can turn an otherwise ambiguous waiting state into a well\-structured execution plan\.

#### E\.1\.6Case Study: MAD’s Debate Structure Enables Error Correction and Solution Emergence

Thesis\.MAD maintains strong performance across memory modules \(minimum profit: 16\.63\)\. Its debate structure provides two capabilities that single\-chain reasoning often lacks: built\-in error correction and solution emergence\. We illustrate each with a concrete case\.

#### Case A: Built\-in Error Correction

Source run:gpt\-5\-mini\_mad\_simplemem\_none\_1 Environment:0001\_medium\_city\_22roads\_seed123 Key step:25

##### Task & State\.

The agent has no active orders\. It is walking at\(−374\.84m,−82\.56m\)\(\-374\.84\\text\{m\},\\;\-82\.56\\text\{m\}\)with 80% energy\. The e\-scooter is parked with 0% battery\. Current earnings are $106\.66\. Arecent\_errorindicates that the previous attempt to accept orders \#11 and \#13 failed \(“not found or already accepted by others”\)\.

##### Observation\.

Therecent\_actionslog shows thatView available orderswas already executed earlier in the action history\. The rules explicitly state: “DO NOT use VIEW\_ORDERS if the context already includes available order details or your last action is view orders\.”

##### Memory\.

Raw memory excerpt fromMemoryat step 25:

> \- Relevant available orders \(latest pool, key tradeoffs\): \- \#8: $8\.51, 18 min \-\-\- mixed hot/cold \(Curry 60°C, IceCreamCake \-12°C, Milkshake 5°C\) → needs both heat & ice\. \- \#0: $8\.06, 9 min \-\-\- all hot \(three 60°C items\) → needs heat, tighter time\. \- \#11: $7\.98, 11 min \-\-\- mixed temps \(hot \+ frozen \+ cold\) → mixed packs\. \- \#13: $6\.04, 8 min \-\-\- mostly cold items\. \- Decision priorities: secure in\-hand items into bag now, then accept orders matching available temperature resources and realistic time constraints\.

##### Debate Trace\.

- •Round 0 \(independent analysis\)\.Agent 0 proposesVIEW\_ORDERS\(\), reasoning that no active orders means it should refresh the order board\. Agent 1 also proposesVIEW\_ORDERS\(\), citing the need to refresh the list after the failed acceptance of \#11 and \#13\. Agent 2, however, notices thatrecent\_actionsalready contains aView available ordersentry and that the memory retains the order pool; it proposesACCEPT\_ORDER\(8\)\($8\.51, 18\-minute deadline\)\.
- •Round 1\.Agent 0 reads Agent 2’s argument, recognizes that repeatingVIEW\_ORDERS\(\)violates the rules and wastes a decision step, and switches toACCEPT\_ORDER\(8\)\. Agent 1 similarly acknowledges the redundancy, noting thatrecent\_erroronly indicates \#11 and \#13 failed—not \#8—and also switches toACCEPT\_ORDER\(8\)\. Agent 2 maintainsACCEPT\_ORDER\(8\)\.
- •Round 2\.All three agents converge onACCEPT\_ORDER\(8\)\.

##### Action & Result\.

The executed action isACCEPT\_ORDER\(8\)\. The agent secures the highest\-value available order without wasting a step on a redundantVIEW\_ORDERS\(\)call\.

##### Analysis\.

Two out of three agents initially proposed a redundant action that would have wasted an entire decision step\. In a single\-chain reasoning strategy such as CoT, the model has exactly one chance to make this judgment—a single oversight directly costs a step\. The debate structure provides a second \(and third\) opportunity: Agent 2’s correct reading ofrecent\_actionsis surfaced to the other agents in Round 1, and the cross\-examination process corrects the majority error before any action is executed\. Notably, the simplemem memory module—despite its narrative format—preserved the order pool with payouts and deadlines \(“\#8: $8\.51, 18 min”\), giving Agent 2 sufficient information to propose a concrete alternative rather than merely objecting to the redundancy\.

#### Case B: Solution Emergence

Source run:gpt\-5\-mini\_mad\_simple\_none\_1 Environment:0001\_medium\_city\_22roads\_seed123 Key step:30

##### Task & State\.

The agent is on an e\-scooter at\(−474\.37m,224\.04m\)\(\-474\.37\\text\{m\},\\;224\.04\\text\{m\}\)with 74% energy and 12% battery\. It is carryingOrder \#29\(Donut, Curry; $4\.08\), which is already 1 minute overtime\. The agent has arrived at the drop\-off location \(building 127, 22nd road\)\. Theephemeral\_contextpresents the delivery method choice:leave\_at\_door,knock,call, orhand\_to\_customer\.

##### Observation\.

Therecent\_actionsshow the agent has completed the full pipeline:MOVEto pickup→\\rightarrowWAIT7 min→\\rightarrowPICKUP→\\rightarrowPLACE\_FOOD\_IN\_BAG→\\rightarrowMOVEto drop\-off \(∼\\sim427\.3 m\)\. The order is overtime, so the decision is purely about which delivery method balances speed against service quality\.

##### Memory\.

Raw memory excerpt fromMemoryat step 30:

> <\#29: picked up and items placed in bag \(Donut \-\> A; Curry \-\> B\) \-\-\- currently carrying\. \- You are at the drop\-off \(building 127\) and order is OVERTIME \(deliver ASAP\)\. \- Transport: on e\-scooter, energy 74%, low battery \(12%\) \-\-\- but already at drop\-off \(7 m\)\. \- Must call DROP\_OFF\(oid=29, method=\.\.\.\) now\. Note: ‘‘hand\_to\_customer’’ requires STEP\_FORWARD/TURN\_AROUND/search; leaving at door may incur penalties\.

##### Debate Trace\.

- •Round 0 \(independent analysis\)\.Agent 0 proposesleave\_at\_door, prioritizing speed under overtime pressure\. Agent 1 proposeshand\_to\_customer, prioritizing service quality and complaint avoidance\. Agent 2 also proposeshand\_to\_customer\.
- •Round 1\.Agent 0, seeing that two other agents chosehand\_to\_customer, is persuaded and switches tohand\_to\_customer\. Agent 1, after reading Agent 0’s time\-pressure argument forleave\_at\_door, synthesizes both positions and first proposesknock—a method absent from all Round 0 proposals\. Its reasoning:knockis faster thanhand\_to\_customer\(noSTEP\_FORWARD/TURN\_AROUNDsearch required\) but provides better service thanleave\_at\_door\(the customer has an opportunity to receive the order in person\)\. Agent 2, conversely, is swayed by the speed argument and switches toleave\_at\_door\.
- •Round 2\.Agent 0 accepts the compromise logic and switches fromhand\_to\_customertoknock\. Agent 1 maintainsknock\. Agent 2 also accepts the compromise and switches fromleave\_at\_doortoknock\. Final vote: 3–0 forknock\.

##### Action & Result\.

The executed action isDROP\_OFF\(oid=29, method="knock"\)\. The agent delivers the overtime order using a method that was not proposed by any individual agent in the initial round\.

##### Analysis\.

The key observation is thatknockwas not in the initial solution set\. It emerged in Round 1 as Agent 1 synthesized the competing priorities of speed \(leave\_at\_door\) and service quality \(hand\_to\_customer\)\. This is qualitatively different from simple majority voting or persuasion: the debate produced a novel compromise that Pareto\-dominates the initial proposals along the speed–quality trade\-off\. A single\-chain strategy would have committed to one of the two extremes at the first inference pass\. The Round 1 dynamics are also noteworthy: the three agents effectively rotated positions \(Agent 0 moved toward service quality, Agent 2 moved toward speed\), while Agent 1 found the middle ground that ultimately convinced everyone in Round 2\.

#### Summary

These two cases illustrate complementary mechanisms behind MAD’s robustness across memory modules\. Error correction operates as a*filter*: multiple reasoning paths cross\-examine each other and catch individual mistakes before execution\. Solution emergence operates as a*generator*: the debate synthesizes novel actions that no single agent proposed initially\. Together, they make MAD less sensitive to the quality or format of memory input—even when the memory module provides incomplete information \(simplemem omitting resource states\) or excessive raw detail \(simple injecting unprocessed observations\), three independent reasoning paths can each extract different useful signals and reconcile them through debate\. The cost is a substantial increase in token consumption, but the result is a performance floor that exceeds the ceiling of most other strategy–memory combinations\.

#### E\.1\.7Case Study: Self\-Refine Converts Repeated Bagging Failures into a Structured Recovery Process

Source run:gpt\-5\-mini\-react\-simple\-self \_refine\.zip/agent\_1 Environment:0000\_medium\_city\_22roads\_seed42 Key steps:42–46

##### Task & State\.

The agent has already picked up Orders \#4 and \#5, but delivery cannot proceed because the food has not yet been successfully placed into the insulated bag\. This is therefore a post\-pickup recovery state rather than a normal planning state\. The bag already contains a heat pack in compartment A and an ice pack in compartment B, and the agent is acting after repeated failed bagging attempts\.

##### Observation\.

The prompt explicitly shows that both orders are*picked up, waiting for delivery*, while the food remains unplaced\. The environment also reports another failed bagging attempt:

> place\_food\_in\_bag failed: invalid compartment: A, 15,16 \-\> B

At the same time, the bag hint lists the unresolved assignment problem: Order \#4 containsCurry, and Order \#5 containsPizza,PokeBowl, andMilkshake, with compartment A heated and compartment B cooled\.

##### Reflection\.

Step 42 \(initial correction, still fragile\)\.

> The agent correctly diagnosed the previous failure \(formatting/parsing issue\) and adjusted the bag\_cmd to list each order separately\. The placement respects compartment temperature and capacity\. But the chosen action are incorrect\.

Step 43 \(turning point: inspect before retry\)\.

> VIEW\_BAG\(\) is the right next action\. It will reveal the actual valid compartment IDs/slot info so the agent can craft a valid single combined bag\_cmd and avoid repeating the same failing PLACE\_FOOD\_IN\_BAG call\. \.\.\. inspect bag to resolve the error, then issue a corrected PLACE\_FOOD\_IN\_BAG\.

Step 44 \(decomposed recovery\)\.

> Placing order 4 item 1 \(Curry\) into A is the right immediate step\. It avoids the earlier multi\-order/formatting errors by issuing a simple single\-order command\.

##### Action & Result\.

The agent first executesVIEW\_BAG\(\)instead of retrying another complex bagging command, then decomposes the repair into two simpler actions:

> Step 44: PLACE\_FOOD\_IN\_BAG\(bag\_cmd="order 4: 1 \-\> A"\) Step 45: PLACE\_FOOD\_IN\_BAG\(bag\_cmd="order 5: 1 \-\> A; 2,3 \-\> B"\)

By step 46, both orders are successfully bagged and the agent resumes delivery toward the nearer dropoff\.

##### Analysis\.

What makes this case informative is that the benefit of self\-refine does not come from a single perfect correction\. In step 42, the reflection already identifies the previous failure as a formatting issue and approves a revised bagging command\. However, the subsequent environment state in step 43 shows that this repair is still insufficient, since the action fails again with another invalid\-compartment error\. This makes the case more revealing than a clean one\-shot success: the model is not simply solving the problem immediately, but improving through iterative recovery\.

The real turning point appears in step 43, where self\-refine changes the recovery strategy itself\. Instead of continuing to optimize another all\-in\-one bagging command, the reflection explicitly recommends an inspection step,VIEW\_BAG\(\), before retrying\. This matters because the failure is no longer treated as only a textual formatting issue; it is reinterpreted as a state\-alignment problem between the agent’s command and the simulator’s actual bag representation\. In other words, self\-refine helps the model realize that the next useful action is not to “try harder,” but to look again\.

Step 44 then shows a second layer of improvement\. After inspection, the agent does not return to a large combined command\. Instead, it adopts a decomposed repair policy, placing one order first and postponing the second\. This decomposition is mechanically valuable: it reduces parser fragility, lowers action complexity, and creates an intermediate success state that makes the next action easier\. The success at steps 44–45 therefore comes not from more elaborate verbal reasoning alone, but from a better control strategy for recovery: inspect the real state, simplify the action, and restore forward progress incrementally\.

##### Insight\.

This case suggests a broader explanation for why small models with self\-refine can begin to approach stronger backbones in DeliveryBench\. The gain is not necessarily that self\-refine gives them uniformly better high\-level planning\. Rather, it provides an explicit error\-correction layer that helps convert repeated local failures into structured recovery procedures\. When tasks involve simulator\-specific action syntax, hidden state details, or brittle execution interfaces, such a correction layer can substantially reduce wasted steps and prevent the agent from getting stuck in unproductive retries\.

More generally, the case shows that self\-refine is especially useful when the model’s first attempt is locally plausible but operationally fragile\. In such situations, the main advantage is not producing a more sophisticated initial plan, but detecting that the current strategy is unstable and replacing it with a more reliable one\. Here, that replacement takes the form of an inspect\-then\-decompose policy: first recover the true state, then split a brittle composite action into simpler, more robust sub\-actions\. This kind of repair mechanism is a realistic and practically important route by which self\-refine can narrow the performance gap between smaller and larger models\.

### E\.2Minigrid

#### E\.2\.1Complete Experimental Results

Table[14](https://arxiv.org/html/2606.14674#A5.T14)summarizes the complete ablation results on the MiniGrid task suite withGPT\-5\-mini\. We report task\-level outcomes across different combinations of reasoning strategies, memory modules, and reflection settings\. Table[15](https://arxiv.org/html/2606.14674#A5.T15)further reports success rates across model scales and backbones\.

Table 14:Ablation results on the MiniGrid task suite\. We evaluate different combinations of reasoning modules, memory components, and reflection settings withGPT\-5\-minias base model\. Each row corresponds to one method configuration\. A green checkmark indicates successful task completion, while a red cross indicates failure\. Values in parentheses denote the steps used\.MethodEmptyFetchFourRoomsGoToDoorGoToObjLavaCrossLavaGapMultiRoomSimpleCross1SimpleCross2CoT \+ ChatDB✓\(17\)✗✗✗✓\(17\)✗✗✗✗✗CoT \+ MemoryBank✓\(19\)✗✗✗✓\(5\)✗✓\(7\)✗✗✗CoT \+ OpenClaw✓\(15\)✗✗✓\(67\)✓\(5\)✗✓\(35\)✗✗✗CoT \+ Simple✓\(31\)✓\(7\)✗✗✓\(9\)✗✓\(5\)✓\(54\)✗✗CoT \+ SimpleMem✓\(42\)✗✗✗✓\(73\)✗✓\(9\)✗✗✗Direct \+ Simple✓\(94\)✗✗✗✗✗✓\(5\)✓\(47\)✗✗MAD \+ Simple✓\(85\)✓\(7\)✗✗✓\(73\)✗✓\(5\)✓\(97\)✓\(100\)✗MAD \+ Simple \+ Self\-Refine✓\(7\)✗✗✗✗✗✗✓\(8\)✓\(28\)✗Plan&Solve \+ Simple✓\(19\)✗✗✗✓\(20\)✗✓\(58\)✓\(8\)✗✗Plan&Solve \+ SimpleMem✗✗✗✗✗✗✓\(14\)✗✗✗ReAct \+ ChatDB✓\(25\)✗✗✗✓\(23\)✗✗✗✗✗ReAct \+ DC✓\(7\)✓\(85\)✗✗✓\(15\)✗✓\(5\)✓\(10\)✗✗ReAct \+ DC \+ Self\-Refine✓\(7\)✗✗✗✗✗✓\(9\)✗✗✗ReAct \+ MemoryBank✓\(20\)✗✗✗✓\(5\)✗✓\(5\)✓\(17\)✗✗ReAct \+ OpenClaw✓\(49\)✗✗✗✓\(49\)✗✓\(29\)✗✗✗ReAct \+ Simple✓\(9\)✗✗✗✓\(19\)✗✓\(5\)✓\(22\)✗✗ReAct \+ Simple \+ Self\-Refine✓\(7\)✗✗✗✓\(9\)✗✓\(7\)✓\(10\)✗✗ReAct \+ SimpleMem✓\(17\)✗✗✗✓\(11\)✗✓\(5\)✗✓\(22\)✗

ReasoningMemoryReflectionQwen3\.5\-27BSuccess Rate \(SR\)Qwen3\.5\-9BSuccess Rate \(SR\)Qwen3\.5\-2BSuccess Rate \(SR\)Qwen3\.5\-0\.8BSuccess Rate \(SR\)GPT\-5 miniSuccess Rate \(SR\)NoneBaseNone0\.400\.10\.10\.3ReActBaseNone0\.50\.40\.40\.40\.4ReActChatDBNone0\.60\.50\.20\.10\.2ReActDCNone0\.50\.50\.20\.20\.5ReActSimpleMemNone0\.50\.40\.30\.20\.4ReActMemoryBankNone0\.40\.30\.30\.30\.4ReActOpenClawNone0\.30\.30\.30\.30\.3CoTBaseNone0\.50\.40\.40\.20\.5CoTChatDBNone0\.60\.40\.40\.2–CoTDCNone0\.60\.4–0\.3–CoTSimpleMemNone0\.50\.20\.40\.1–CoTMemoryBankNone0\.20\.20\.40\.2–Plan&SolveBaseNone0\.50\.20\.30\.10\.4Plan&SolveChatDBNone0\.50\.300\.1–Plan&SolveDCNone0\.30\.3–0–Plan&SolveSimpleMemNone0\.50\.20\.10–Plan&SolveMemoryBankNone0\.10\.10\.10\.1–MADBaseNone0\.50\.10\.60\.10\.6MADChatDBNone0\.40\.30\.40\.1–MADDCNone0\.50\.3–0\.1–MADSimpleMemNone0\.40\.30\.40\.1–MADMemoryBankNone0\.30\.30\.40\.3–ReActBaseSelf\-Refine0\.50\.6–0\.10\.4CoTBaseSelf\-Refine0\.30\.3–0\.3–CoTSimpleMemSelf\-Refine0\.40\.3–0\.3–Plan&SolveBaseSelf\-Refine0\.60\.1–0–MADMemoryBankSelf\-Refine0\.40\.3–0\.2–ReActBaseReflexion0\.50\.3–0\.20\.1ReActDCReflexion0\.50\.4–0\.2–ReActMemoryBankReflexion0\.30\.10\.50\.3–CoTBaseReflexion0\.40\.6–0\.2–CoTSimpleMemReflexion0\.50\.4–0\.4–Plan&SolveBaseReflexion0\.50\.2–0–MADMemoryBankReflexion0\.60\.3–0\.2–

Table 15:Complete main results onMiniGridfor Qwen3\.5\-27B, Qwen3\.5\-9B, Qwen3\.5\-2B, and Qwen3\.5\-0\.8B, and GPT\-5 mini\. All entries report success rate \(SR\) only\. Some method combinations were not evaluated due to time and resource constraints, and their missing results are marked as\-\-\.
#### E\.2\.2Case A

![Refer to caption](https://arxiv.org/html/2606.14674v1/x6.png)Figure 10:Minigrid: ReAct\+Simple\+None \(Failure\)\.In Figure[10](https://arxiv.org/html/2606.14674#A5.F10), the agent correctly infers the goal direction and turns right to align with it, but fails to account for the nearby lava barrier\. While the action is locally plausible, it does not lead to a valid path\. By prioritizing goal alignment over environmental constraints, the agent produces an ineffective decision, reflecting short\-horizon planning and insufficient risk evaluation\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/x7.png)Figure 11:Minigrid: MAD\+Simple\+None \(Success\)\.In Figure[11](https://arxiv.org/html/2606.14674#A5.F11), the agents identify a safe passage through the lava and move forward into the empty cell, enabling correct progression\. MAD employs a three\-agent, three\-round debate process that encourages a conservative, lava\-avoidance strategy\. At each step, the agents converge on a safe action with consistent agreement, effectively filtering out risky movements\.

#### E\.2\.3Case B

![Refer to caption](https://arxiv.org/html/2606.14674v1/x8.png)Figure 12:Minigrid: React\+Simple\+None \(False\)\.In Figure[14](https://arxiv.org/html/2606.14674#A5.F14), the agent observes a purple ball directly in front and incorrectly treats it as an obstacle that blocks forward movement\. It decides to pick up the ball in order to clear the path, despite the mission being to retrieve the purple key\. This action is driven by local affordances rather than task relevance, resulting in a redundant and improper decision\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/x9.png)Figure 13:Minigrid: Cot\+Simple\+None \(Success\)\.In Figure[13](https://arxiv.org/html/2606.14674#A5.F13), the agent recognizes that the purple ball is not the target object and instead focuses on locating the purple key\. It chooses to turn and explore alternative directions rather than interacting with the irrelevant object\. This leads to a goal\-consistent action, correctly avoiding distraction from non\-essential elements in the environment\.

#### E\.2\.4Case C

![Refer to caption](https://arxiv.org/html/2606.14674v1/x10.png)Figure 14:Minigrid: React\+Simple\+None \(False\)\.In Figure[14](https://arxiv.org/html/2606.14674#A5.F14), the agent observes a wall directly ahead and decides to turn right to search for an opening\. However, this behavior leads to repeated local exploration along the same row without entering new regions\. The agent becomes trapped in a local oscillation pattern, failing to effectively explore alternative directions and ultimately exceeding the step limit\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/x11.png)Figure 15:Minigrid: Mad\+Simple\+None \(Success\)\.In Figure[15](https://arxiv.org/html/2606.14674#A5.F15), the agents infer that they are near the left boundary and choose to turn left to move toward the map interior\. Despite partial disagreement during the debate, the selected action enables continued exploration into new regions\. This avoids local oscillation and allows the agent to reach the goal, albeit with some inefficiency eventually\.

### E\.3Alfred

#### E\.3\.1Complete Experimental Results

Table[16](https://arxiv.org/html/2606.14674#A5.T16)summarizes the complete results on the ALFRED task suite\. We report performance across different combinations of reasoning strategies, memory modules, and reflection settings\.

Table 16:Ablation results on the ALFRED task suite\. We evaluate different combinations of reasoning modules \(CoT, Direct, MAD, ReAct, and Plan\-and\-Solve\), memory components \(ChatDB, DC, Memory Bank, Simple Memory, or none\), and reflection settings withgpt\-5\-minias backbone model\. Each row corresponds to one method configuration\. For readability, method names are written without underscores\. Unless a method name explicitly includesself\-refine, the reflection module is set to none\. A green checkmark indicates successful task completion, while a red cross indicates failure\. Values in parentheses denote subgoal completion rates\.Method1Pick2Pick23Heat4Cool5Clean6Light7MobileCoT \+ ChatDB✓✗\(0/2\)✗\(0/3\)✓✗\(0/3\)✗\(0/2\)✗\(1/3\)CoT \+ DC✓✗\(1/2\)✗\(1/3\)✗\(1/3\)✗\(0/3\)✗\(0/2\)✓CoT \+ SimpleMem✗\(0/1\)✗\(1/2\)✗\(0/3\)✗\(1/3\)✗\(0/3\)✗\(0/2\)✗\(0/3\)CoT \+ Simple✓✗\(0/2\)✗\(1/3\)✗\(1/3\)✗\(0/3\)✗\(1/2\)✗\(0/3\)Direct \+ Simple✗\(0/1\)✗\(0/2\)✗\(0/3\)✗\(1/3\)✗\(0/3\)✗\(0/2\)✗\(0/3\)MAD \+ Simple✗\(0/1\)✗\(0/2\)✗\(0/3\)✗\(1/3\)✗\(1/3\)✗\(1/2\)✗\(0/3\)Plan&Solve \+ Simple✗\(0/1\)✗\(1/2\)✗\(1/3\)✗\(0/3\)✗\(0/3\)✗\(0/2\)✗\(0/3\)ReAct \+ ChatDB✗\(0/1\)✗\(0/2\)✓✓✗\(0/3\)✗\(1/2\)✗\(0/3\)ReAct \+ DC✓✗\(0/2\)✗\(0/3\)✗\(1/3\)✗\(0/3\)✗\(1/2\)✗\(0/3\)ReAct \+ Memory Bank✓✗\(0/2\)✗\(1/3\)✓✗\(0/3\)✗\(0/2\)✗\(0/3\)ReAct \+ SimpleMem✓✗\(0/2\)✗\(0/3\)✗\(1/3\)✗\(0/3\)✗\(0/2\)✗\(0/3\)ReAct \+ Simple✓✗\(1/2\)✗\(1/3\)✗\(1/3\)✗\(0/3\)✗\(1/2\)✗\(0/3\)ReAct \+ Simple \+ Self\-Refine✓✗\(1/2\)✗\(1/3\)✗\(1/3\)✗\(0/3\)✗\(1/2\)✗\(0/3\)Success Rate10/170/171/173/170/170/171/17

ReasoningMemoryReflectionQwen3\.5\-27BSuccess Rate \(SR\)Qwen3\.5\-9BSuccess Rate \(SR\)Qwen3\.5\-2BSuccess Rate \(SR\)Qwen3\.5\-0\.8BSuccess Rate \(SR\)GPT\-5 miniSuccess Rate \(SR\)NoneBaseNone00000ReActBaseNone0\.40\.50\.200\.4ReActChatDBNone0\.50\.50\.100\.4ReActDCNone0\.50\.20\.200\.4ReActSimpleMemNone0\.60\.40\.100\.2ReActMemoryBankNone0\.30\.30\.200\.1ReActOpenClawNone0\.30\.20\.100\.1CoTBaseNone0\.30\.40\.100\.2CoTChatDBNone0\.60\.30\.10–CoTDCNone0\.40\.400–CoTSimpleMemNone0\.50\.30\.10–CoTMemoryBankNone0\.40\.30\.10–Plan&SolveBaseNone00\.1000\.2Plan&SolveChatDBNone0\.10\.300–Plan&SolveDCNone0\.1000–Plan&SolveSimpleMemNone0\.2000–Plan&SolveMemoryBankNone0\.3000–MADBaseNone0\.40\.30\.200MADChatDBNone0\.10\.30\.10–MADDCNone0\.10\.30\.10\.1–MADSimpleMemNone0000\.1–MADMemoryBankNone0000–ReActBaseSelf\-Refine0\.10\.30\.100\.5CoTBaseSelf\-Refine0\.20\.4–––CoTSimpleMemSelf\-Refine0\.10\.3–––Plan&SolveBaseSelf\-Refine0\.10\.1–––MADMemoryBankSelf\-Refine0\.10\.4–––ReActBaseReflexion0\.50\.30\.100\.3ReActDCReflexion0\.20\.30\.20–ReActMemoryBankReflexion0\.40\.100–CoTBaseReflexion0\.40\.2–––CoTSimpleMemReflexion0\.70\.4–––Plan&SolveBaseReflexion00\.1–––MADMemoryBankReflexion0\.2––––

Table 17:Complete main results onALFREDfor Qwen3\.5\-27B, Qwen3\.5\-9B, Qwen3\.5\-2B, and Qwen3\.5\-0\.8B, and GPT\-5 mini\. All entries report success rate \(SR\) only\. Some method combinations were not evaluated due to time and resource constraints, and their missing results are marked as\-\-\.
#### E\.3\.2Case A

![Refer to caption](https://arxiv.org/html/2606.14674v1/x12.png)Figure 16:Alfred: Cot\+DC\+None \(False\)\.In Figure[16](https://arxiv.org/html/2606.14674#A5.F16), the agent relies on rule\-based memory but fails to update it according to state changes\. After placing the bread into the fridge, it loses track of the object state and incorrectly assumes the task is complete\. As a result, the agent prematurely signals completion without retrieving the cooled bread, reflecting a failure to track task progress\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/x13.png)Figure 17:Alfred: Cot\+ChatDB\+None \(Success\)\.In Figure[17](https://arxiv.org/html/2606.14674#A5.F17), the agent maintains structured memory that explicitly records state transitions across steps\. It correctly tracks that the bread has been cooled and is currently inside the fridge, and infers the remaining objective\. Based on this, it retrieves the bread and proceeds toward the countertop, leading to a consistent and goal\-aligned action sequence\.

#### E\.3\.3Case B

![Refer to caption](https://arxiv.org/html/2606.14674v1/x14.png)Figure 18:Alfred: React\+DC\+None \(False\)\.In Figure[18](https://arxiv.org/html/2606.14674#A5.F18), the agent focuses on nearby drawers and decides to move forward to inspect them, assuming they are likely to contain the target object\. However, this results in a local search policy that repeatedly prioritizes nearby containers without exploring the broader scene\. The agent fails to expand its search space, leading to inefficient exploration and an inability to locate the key object\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/x15.png)Figure 19:Alfred: Cot\+DC\+None \(Success\)\.In Figure[19](https://arxiv.org/html/2606.14674#A5.F19), the agent explicitly adopts a global search strategy by rotating in place to scan the environment\. This allows it to reveal a wider view of the scene and identify relevant objects beyond the immediate vicinity\. By expanding its observation space before acting, the agent avoids local search bias and proceeds with a more effective exploration strategy\.

### E\.4RoboTHOR

ReasoningMemoryReflectionQwen3\.5\-27BSuccess Rate \(SR\)Qwen3\.5\-9BSuccess Rate \(SR\)Qwen3\.5\-2BSuccess Rate \(SR\)Qwen3\.5\-0\.8BSuccess Rate \(SR\)GPT\-5 miniSuccess Rate \(SR\)NoneBaseNone0\.10000\.1ReActBaseNone0\.10000\.2ReActChatDBNone0\.2000\.10\.2ReActDCNone0\.10\.100\.10\.1ReActSimpleMemNone0\.10\.1000\.1ReActMemoryBankNone0\.20\.20\.100\.3ReActOpenClawNone0\.20\.2000\.1CoTBaseNone00000CoTChatDBNone0\.20–0\.1–CoTDCNone00–0\.1–CoTSimpleMemNone0\.10\.100\.1–CoTMemoryBankNone0\.20\.10\.10\.1–Plan&SolveBaseNone0\.10000\.1Plan&SolveChatDBNone00\.2–0–Plan&SolveDCNone00\.1–0\.1–Plan&SolveSimpleMemNone00\.200\.1–Plan&SolveMemoryBankNone0\.20\.200–MADBaseNone0\.20\.1000\.1MADChatDBNone0\.20\.2–0–MADDCNone00–0\.1–MADSimpleMemNone00\.1–0\.2–MADMemoryBankNone0\.1––0\.1–ReActBaseSelf\-Refine0\.10\.100\.10\.2CoTBaseSelf\-Refine0\.10–0–CoTSimpleMemSelf\-Refine00\.1–0\.1–Plan&SolveBaseSelf\-Refine0\.10\.1–0–MADMemoryBankSelf\-Refine0\.10\.1–0\.1–ReActBaseReflexion0\.10\.20\.100\.4ReActDCReflexion0\.10\.200\.1–ReActMemoryBankReflexion0\.20\.30\.10–CoTBaseReflexion0\.10\.1–0\.1–CoTSimpleMemReflexion00\.1–0–Plan&SolveBaseReflexion00\.1–0–MADMemoryBankReflexion0\.20\.1–0\.1–

Table 18:Complete main results onRoboTHOR\. All entries report success rate \(SR\) only\. Some method combinations were not evaluated due to time and resource constraints, and their missing results are marked as\-\-\.#### E\.4\.1Case 1: MemoryBank helps breaking the loop

##### Memory

Failure Episode \(Television Task\):Episode 2, Steps 1–5\. Agent position:x=2\.50,y=0\.90,z=−1\.50x=2\.50,y=0\.90,z=\-1\.50\(static\)\. Target: Television \(success≤1\.50\\leq 1\.50m\)\.

- •Step 1: MoveAhead succeeded \(agent pose moved fromx=3\.00x=3\.00tox=2\.50x=2\.50\)\.
- •Steps 2–5: Four consecutive RotateRight actions\. Agent heading changes: yaw270∘→315∘→0∘→44\.6∘270^\{\\circ\}\\to 315^\{\\circ\}\\to 0^\{\\circ\}\\to 44\.6^\{\\circ\}\.
- •Reasoning trace: Steps 2–5 all report “I don’t see television in current view,” followed by “I’ll rotate right to scan a new direction\.” No MoveAhead follows any rotation; no new spatial exploration occurs\.

Success Episode \(Basketball Task\):Episode 1, Steps 186–187\. Agent position:x=8\.00,y=0\.90,z=−2\.50x=8\.00,y=0\.90,z=\-2\.50\. Target: BasketBall\.

- •Step 186: MoveAhead succeeded\. Agent moves within goal threshold \(≤1\.50\\leq 1\.50m\)\.
- •Step 187: Stop executed\. Episode terminated with success=True\.
- •Environment diagnostics: front\_goal\_no\_forward=0, front\_key\_no\_pick\_up=0, front\_door\_no\_toggle=0 \(path fully clear\)\.

##### insight

The root cause isrotation\-exploration conflation: the agent treats heading changes \(yaw\-only\) identically to spatial moves \(\(x,z\)\(x,z\)changes\), ignoring that rotations at static position yield no new observables\. The television task exemplifies this—four consecutive rotations at position\(2\.50,0\.90,−1\.50\)\(2\.50,0\.90,\-1\.50\)generate identical diagnostics, yet the agent re\-evaluates each as information\-generating\.

Three correctives: \(1\)No rotation without spatial progress: if position unchanged and target unseen, execute MoveAhead instead of rotating again\. \(2\)Headed commitment: commit to one orientation, then execute 2–3 MoveAhead before re\-scanning\. \(3\)Distance trigger: if distance≤\\leqgoal\_radius, execute Stop immediately\. These mechanisms prevent the rotate\-in\-place dead\-loop and enforce the spatial\-movement discipline demonstrated in the successful basketball episode\.

## Appendix FWeb Interface for Interactive Module Configuration

To improve usability and lower the barrier to experimentation,AgentSpecprovides a lightweight web\-based interface for interactively configuring agent pipelines\. The interface exposes the major components of the framework, including the environment, perception adapter, reasoning method, memory backend, and reflection module\. Users can choose different implementations for each component through dropdown menus and directly generate runnable Python setup code from the selected configuration\.

Figure[20](https://arxiv.org/html/2606.14674#A6.F20)shows two example configurations of the interface\. The first example illustrates a MiniGrid setup with Tree of Thoughts reasoning, A\-MEM \(Associative memory\), and Self\-Refine reflection\. The second example shows a DeliveryBench setup with RAP\(Reasoning via Planning\) reasoning, Dynamic Cheatsheet \(DC\) Memory, and Self\-Refine reflection\. These examples highlight the modular design ofAgentSpec: the same interface can be used to compose different agent systems across environments and experimental settings without modifying the underlying framework code manually\.

This interface is intended as a practical entry point for both research and prototyping\. It allows users to quickly instantiate different combinations of modules, inspect how design choices map to executable configurations, and reproduce experimental settings more conveniently\.

![Refer to caption](https://arxiv.org/html/2606.14674v1/figures/playground_minigrid.png)

![Refer to caption](https://arxiv.org/html/2606.14674v1/figures/playground_deliverybench.png)

Figure 20:Web interface ofAgentSpecfor interactive module configuration\. Users can select implementations for environment, perception, reasoning, memory, and reflection, and then generate runnable Python setup code\.
AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Similar Articles

AgentSPEX: An Agent SPecification and EXecution Language

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

Very Large-Scale Multi-Agent Simulation in AgentScope

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

Submit Feedback

Similar Articles

AgentSPEX: An Agent SPecification and EXecution Language
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
Very Large-Scale Multi-Agent Simulation in AgentScope
Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents