ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

arXiv cs.CL 06/01/26, 04:00 AM Papers
Summary
ExpGraph is a model-agnostic framework that enables LLM agents to reuse past experiences via a self-evolving graph of skills and failures, improving task performance by 12–21% without retraining the executor.
arXiv:2605.30712v1 Announce Type: new Abstract: Large language model (LLM) agents have shown strong capabilities in reasoning, tool use, and multi-step interaction, but they often solve tasks from scratch and fail to reuse successful strategies or failure lessons from prior experience. Fine-tuning on collected experience can improve reuse, but it is inflexible when stronger or more suitable executors emerge. We propose ExpGraph, a model-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without parameter updates. ExpGraph summarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self-evolving experience graph, and retrieves useful experiences through graph diffusion and utility-aware ranking. A lightweight retrieval copilot is trained with reinforcement learning using feedback that compares executor performance with and without retrieved experiences, while the graph is updated online from downstream task outcomes. We evaluate ExpGraph on ExpSuite, covering question answering, mathematical reasoning, code generation, and multi-step agentic environments including ALFWorld and AppWorld. ExpGraph improves over the strongest baseline by 12.2% and 4.7% on static tasks with smaller and larger executors, and by 21.4% and 12.7% in agentic environments, while reducing average interaction steps by 12.7% and 21.6%. Ablations show that graph-structured experience, utility-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:28 AM
# ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents
Source: [https://arxiv.org/html/2605.30712](https://arxiv.org/html/2605.30712)
Tao Feng1, Chongrui Ye1, Tianyang Luo1, Jingjun Xu1, Xueqiang Xu1, Haozhen Zhang2Zhigang Hua3, Yan Xie3, Shuang Yang3, Ge Liu1, Jiaxuan You11University of Illinois Urbana\-Champaign2Nanyang Technological University3Meta Monetization AI

###### Abstract

Large language model \(LLM\) agents have shown strong capabilities in reasoning, tool use, and multi\-step environment interaction, yet they often solve each task from scratch and fail to systematically reuse successful strategies or failure lessons accumulated from prior interactions\. A common solution is to fine\-tune the executor on collected experience, but this becomes increasingly inflexible as LLMs evolve rapidly: when stronger or more suitable executors emerge, executor\-specific training may need to be repeated\. To address this limitation, we proposeExpGraph, a model\-agnostic experience learning framework that enables frozen and replaceable LLM executors to improve through external experience reuse without modifying their parameters\.ExpGraphsummarizes historical trajectories into reusable skills and failure lessons, organizes them as nodes in a self\-evolving experience graph, and connects related experiences to support retrieval beyond flat nearest\-neighbor matching\. For each task, a lightweight retrieval copilot adaptively controls graph diffusion and utility\-aware ranking, retrieving experiences that are both task\-relevant and historically useful for the frozen executor\. The copilot is optimized with reinforcement learning using utility\-grounded feedback that compares executor performance with and without retrieved experiences, while the experience graph is updated online from downstream task outcomes\. We evaluateExpGraphon ExpSuite, covering single\-turn question answering, mathematical reasoning, code generation, and multi\-step agentic environments including ALFWorld and AppWorld\. Across static tasks,ExpGraphimproves over the strongest baseline by12\.2%12\.2\\%with the smaller executor and4\.7%4\.7\\%with the larger executor\. In agentic environments, the gains further increase to21\.4%21\.4\\%and12\.7%12\.7\\%, respectively, while reducing average interaction steps by12\.7%12\.7\\%and21\.6%21\.6\\%compared with the most efficient baseline\. Ablation studies show that graph\-structured experience, utility\-aware ranking, and adaptive retrieval jointly enable effective experience reuse across diverse tasks and executor models, providing a practical and executor\-agnostic path for LLM agents to learn from experience without retraining the underlying executor\. Our code forExpGraphwill be released at[https://github\.com/ulab\-uiuc/ExpGraph](https://github.com/ulab-uiuc/ExpGraph)\.

## 1Introduction

LLM agents have demonstrated strong capabilities in complex tasks requiring reasoning, tool use, and multi\-step environment interaction\(Yaoet al\.,[2022](https://arxiv.org/html/2605.30712#bib.bib96); Shinnet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib48)\)\. A key bottleneck, however, is that most agents still operate as*one\-shot executors*: each task is solved largely from scratch, and successful strategies, failure lessons, and transferable insights accumulated from prior interactions are discarded rather than systematically reused\. A natural remedy is to fine\-tune the executor on its own experience, but this solution becomes increasingly inflexible as LLM capabilities evolve rapidly\. Whenever a stronger or more suitable executor is released, executor\-specific training may need to be repeated, making experience learning tightly coupled to a particular model instance\. This limitation is especially problematic because many capable LLMs are either too large, expensive to update, or inaccessible for parameter\-level modification\. These observations motivate a more flexible research question:how can an LLM agent learn from accumulated experience while keeping the executor itself frozen and replaceable?

Existing experience learning methods as summarized in Table[1](https://arxiv.org/html/2605.30712#S1.T1), provide partial solutions to this problem, including textual experience distillation\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib49)\), memory organization\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib92)\), utility\-aware experience selection\(Zhanget al\.,[2026](https://arxiv.org/html/2605.30712#bib.bib94)\), and adaptive retrieval or search policies\(Jianget al\.,[2025b](https://arxiv.org/html/2605.30712#bib.bib95)\)\. However, these capabilities are usually developed in isolation, and reusable experience is often treated as isolated records or locally matched candidates\. Building a unified experience learning framework is therefore non\-trivial\.First, surface relevance is not the same as experience utility\.An experience that appears similar to the current task may provide little downstream benefit, while a useful experience may encode a transferable strategy, shared sub\-goal, or failure pattern that is not among the nearest neighbors of the task embedding\.Second, useful experience is often relational rather than isolated\.Past trajectories may be connected through common strategies, environmental constraints, or recurring mistakes, so treating them as independent text entries misses important relations among experiences\.Third, retrieval must adapt to both the task and the executor\.Some tasks benefit from broad exploration over related experience neighborhoods, while others require focused selection of high\-utility experiences\. Meanwhile, different executors vary in reasoning, planning, and instruction\-following ability, so an experience learning system should improve the executor through external context without assuming that the executor itself can be retrained\.

Table 1:Comparison with representative experience learning methods\.ExpGraphis the only framework that jointly supports graph\-structured experience, graph diffusion, utility\-aware ranking, and adaptive retrieval for effective experience reuse\.MethodGraph\-Structured ExperienceGraph DiffusionUtility\-Aware RankingAdaptive RetrievalExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib49)\)✗✗✗✗Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib92)\)✓✗✗✗MemRL\(Zhanget al\.,[2026](https://arxiv.org/html/2605.30712#bib.bib94)\)✗✗✓✗S3\(Jianget al\.,[2025b](https://arxiv.org/html/2605.30712#bib.bib95)\)✗✗✗✓ExpGraph✓✓✓✓

To address these challenges, we proposeExpGraph, a model\-agnostic experience learning framework that improves frozen LLM executors through a self\-evolving relational memory of reusable experience and a trainable retrieval copilot\. Rather than modifying the executor,ExpGraphtreats it as a replaceable task solver and learns how to provide useful experience through the input context\. This design decouples experience learning from executor training: when stronger or different LLMs become available, the same external experience system can be reused or adapted without retraining the executor itself\. Specifically,ExpGraphsummarizes historical trajectories into compact experience units, including skills distilled from successful trajectories and lessons distilled from failures\. These units are organized as nodes in an experience graph, where edges connect semantically or strategically related experiences\. The graph allows retrieval to move beyond flat nearest\-neighbor matching by expanding from initially matched experiences to related ones that may share transferable strategies, sub\-goals, or failure patterns with the current task\. On top of this relational memory,ExpGraphuses a lightweight retrieval copilot to predict task\-adaptive retrieval controls, determining both how broadly retrieval explores the graph and how strongly final ranking balances semantic relevance against historical utility\. To learn which experiences are actually useful,ExpGraphuses utility\-grounded feedback from downstream task performance\. The executor is evaluated both with and without retrieved experiences, allowing the retrieval system to estimate whether selected experiences truly improve the executor rather than merely matching the task semantically\. This feedback optimizes the retrieval copilot and updates experience\-node utility statistics, gradually favoring experiences that are not only relevant but empirically beneficial\. Throughout this process, the executor LLM is never updated, enablingExpGraphto support different frozen executors across model scales, capabilities, and deployment settings\.

We evaluateExpGraphon ExpSuite, covering both single\-turn static tasks \(question answering, mathematical reasoning, and code generation\) and multi\-step agentic environments \(ALFWorld and AppWorld\)\. Across static tasks,ExpGraphimproves over the strongest baseline by12\.2%12\.2\\%with the smaller executor and4\.7%4\.7\\%with the larger executor\. The advantage becomes more pronounced in agentic environments, whereExpGraphimproves the weighted average score over the strongest baseline by21\.4%21\.4\\%and12\.7%12\.7\\%with the smaller and larger executors, respectively\. Meanwhile,ExpGraphalso improves decision efficiency, reducing average interaction steps by12\.7%12\.7\\%and21\.6%21\.6\\%compared with the most efficient competing baseline\. These results suggest that modeling relations among experiences and learning utility\-aware adaptive retrieval are especially valuable when tasks require long\-horizon decision\-making and when the executor must improve through external experience rather than parameter updates\. Ablation studies further confirm that experience relations, graph diffusion, utility\-aware ranking, and adaptive retrieval copilot training each contribute to the overall gains\.

## 2Preliminaries

### 2\.1Model\-Agnostic LLM Agents

We consider an LLM agent that solves a taskx∈𝒳x\\in\\mathcal\{X\}by producing an outputyy, which can be an action, answer, or code sequence, and receives a task scores=S\(x,y\)∈ℝs=S\(x,y\)\\in\\mathbb\{R\}from the environment or evaluator\. The agent is built around an executor LLMπexec\\pi\_\{\\mathrm\{exec\}\}, which maps the task input to an output, i\.e\.,y=πexec\(x\)y=\\pi\_\{\\mathrm\{exec\}\}\(x\)\. We adopt a*model\-agnostic*view of the executor: the experience learning mechanism should not depend on the internal architecture, parameters, gradients, logits, or training procedure ofπexec\\pi\_\{\\mathrm\{exec\}\}\. Instead,πexec\\pi\_\{\\mathrm\{exec\}\}is treated as a frozen and replaceable task solver, which can be closed\-source, costly to update, specialized for a domain, or newly released\. The improvement mechanism interacts with the executor only through its input\-output behavior and task\-level feedback\. This setting decouples experience learning from executor training, allowing the same external experience system to improve different frozen executors through input context rather than parameter updates\.

### 2\.2Experience\-Augmented Learning

To improve the executor through its input, we equip the agent with an external experience systemℳ\\mathcal\{M\}that stores reusable knowledge distilled from historical trajectories\. A trajectory is denoted asτ=\(x,ξ,y,s\)\\tau=\(x,\\xi,y,s\), wherexxis the task input,ξ\\xidenotes the intermediate execution process,yyis the final response or action sequence, ands∈ℝs\\in\\mathbb\{R\}is the task score\. Specifically,ξ\\xicorresponds to the agent interaction trace in agentic environments, and to the intermediate thinking process in question\-answering or reasoning tasks\. Each trajectory is summarized into a compact natural\-language experience unite=Summarize\(τ\)e=\\mathrm\{Summarize\}\(\\tau\)and stored inℳ\\mathcal\{M\}\. At execution time, the agent retrieves a subset of experiencesE⊆ℳE\\subseteq\\mathcal\{M\}and injects them into the executor’s input context, yieldingy=πexec\(x,E\)y=\\pi\_\{\\mathrm\{exec\}\}\(x,E\)\. The task scores=S\(x,y\)s=S\(x,y\)then depends indirectly on the choice ofEE\.

Sinceπexec\\pi\_\{\\mathrm\{exec\}\}is fixed, the learnable component is the*retrieval policy*πret\\pi\_\{\\mathrm\{ret\}\}, which selectsEEfromℳ\\mathcal\{M\}given the current task\. The learning objective is to maximize expected task performance over a task distribution𝒟\\mathcal\{D\}:

maxπret𝔼x∼𝒟\[S\(x,πexec\(x,E\)\)\],E∼πret\(⋅∣x,ℳ\)\.\\max\_\{\\pi\_\{\\mathrm\{ret\}\}\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\\!\\left\[\\,S\\\!\\left\(x,\\,\\pi\_\{\\mathrm\{exec\}\}\\\!\\left\(x,\\,E\\right\)\\right\)\\,\\right\],\\quad E\\sim\\pi\_\{\\mathrm\{ret\}\}\(\\cdot\\mid x,\\mathcal\{M\}\)\.\(1\)This formulation decouples execution from experience learning:πexec\\pi\_\{\\mathrm\{exec\}\}solves the task, whileπret\\pi\_\{\\mathrm\{ret\}\}decides which experiences should be provided as context\.

## 3ExpGraph: Utility\-Guided Experience Graph Retrieval

### 3\.1Overview

![Refer to caption](https://arxiv.org/html/2605.30712v1/x1.png)Figure 1:Overview ofExpGraph\.ExpGraphenables a frozen and replaceable executor LLM to improve through a self\-evolving experience graph and a trainable retrieval copilot\. For each incoming taskxtx\_\{t\}, the task is embedded ashxth\_\{x\_\{t\}\}and passed to the retrieval copilotπrett\\pi\_\{\\mathrm\{ret\}\}^\{t\}, which predicts two adaptive controls:RtR\_\{t\}for graph diffusion depth andWtW\_\{t\}for the similarity–utility trade\-off\. Retrieval is performed on the current experience graphGtG^\{t\}through three steps:\(a\) Semantic Seeding, which selects a seed setS0S\_\{0\}by cosine similarity;\(b\) Graph Diffusion, which expands from the seeds using personalized PageRank controlled byρ\\rho; and\(c\) Utility\-Aware Ranking, which combines semantic relevance and utility confidence to select the top\-KKexperiencesEtE\_\{t\}\. The frozen executorπexec\\pi\_\{\\mathrm\{exec\}\}is then evaluated with and without retrieved experiences, producingswithts\_\{\\mathrm\{with\}\}^\{t\}andswithoutts\_\{\\mathrm\{without\}\}^\{t\}\. Their difference, together with the absolute task score, forms the rewardrt=\(switht−swithoutt\)\+ηswithtr\_\{t\}=\(s\_\{\\mathrm\{with\}\}^\{t\}\-s\_\{\\mathrm\{without\}\}^\{t\}\)\+\\eta s\_\{\\mathrm\{with\}\}^\{t\}\. This reward drives a co\-evolution process: it updates the retrieval copilot via PPO and updates the experience graph by refining visited\-node utilities, adding new experience nodes, connecting them to similar neighbors, and pruning low\-quality nodes when necessary\. Only the retrieval copilot and experience graph evolve; the executor LLM remains frozen throughout\.ExpGraphis a model\-agnostic experience learning framework for LLM agents\. Given a taskxx,ExpGraphimproves a frozen executor LLMπexec\\pi\_\{\\mathrm\{exec\}\}by retrieving useful external experiences without modifying the executor\. The key idea is to decouple execution from experience learning: rather than updating the executor, a retrieval copilot learns which experiences to provide as context, making the framework compatible with arbitrary executor instances\. As shown in Figure[1](https://arxiv.org/html/2605.30712#S3.F1),ExpGraphoperates as a closed loop over three components\. First, historical trajectories are compressed into experience units and organized as a graph\-structured experience system\. Second, a retrieval copilot predicts task\-adaptive controls to navigate the graph via semantic seeding, graph diffusion, and utility\-aware ranking\. Third, downstream feedback updates both the copilot and graph statistics, biasing future retrieval toward experiences that are empirically useful rather than merely semantically related\.

### 3\.2Experience Graph Construction

Trajectory\-to\-experience conversion\. Let a historical trajectory be denoted asτ=\(x,ξ,y,s\)\\tau=\(x,\\xi,y,s\), wherexxis the task input,ξ\\xidenotes the intermediate execution process,yyis the final response or action sequence, ands∈ℝs\\in\\mathbb\{R\}is the task score returned by the environment or evaluator\. Specifically,ξ\\xicorresponds to the agent interaction trace in agentic environments, and to the intermediate thinking process in question\-answering or reasoning tasks\.ExpGraphconverts each trajectory into a compact natural\-language experience unit:

e=Summarize\(τ\)\.e=\\mathrm\{Summarize\}\(\\tau\)\.\(2\)The summarizer does not aim to preserve the full trajectory\. Instead, it extracts reusable knowledge from the trajectory\. High\-scoring trajectories are distilled into*skills*, such as successful reasoning patterns, planning strategies, or task\-specific heuristics\. Low\-scoring trajectories are distilled into*lessons*, such as failure modes, invalid actions, or constraints to avoid\.

Each experience unit becomes a nodev=\(ev,hv,uv,nv\)∈Vv=\(e\_\{v\},h\_\{v\},u\_\{v\},n\_\{v\}\)\\in Vin the experience graph, whereeve\_\{v\}is the textual experience,hvh\_\{v\}is its embedding,uvu\_\{v\}is its estimated utility, andnvn\_\{v\}is its retrieval count\. The utilityuvu\_\{v\}and countnvn\_\{v\}are updated online from downstream feedback, allowing the graph to gradually distinguish useful experiences from merely relevant ones\.

Graph construction\. Given a set of experience nodesVV,ExpGraphconstructs a sparse undirected graphG=\(V,E\)G=\(V,E\)\. When inserting a new nodeviv\_\{i\},ExpGraphconnects it to semantically similar existing nodes:

\(vi,vj\)∈E⟺vj∈𝒩K\(vi\)andcos⁡\(hvi,hvj\)≥θ,\(v\_\{i\},v\_\{j\}\)\\in E\\quad\\Longleftrightarrow\\quad v\_\{j\}\\in\\mathcal\{N\}\_\{K\}\(v\_\{i\}\)\\ \\text\{ and \}\\ \\cos\(h\_\{v\_\{i\}\},\\,h\_\{v\_\{j\}\}\)\\geq\\theta,\(3\)where𝒩K\(vi\)\\mathcal\{N\}\_\{K\}\(v\_\{i\}\)denotes the top\-KKnearest neighbors ofviv\_\{i\}under cosine similarity, andθ\\thetais a similarity threshold\. This construction keeps the graph sparse while preserving local neighborhoods of related experiences\.

### 3\.3Utility\-Guided Graph Retrieval

Retrieval copilot\.ExpGraphtrains a lightweight retrieval copilotπret\\pi\_\{\\mathrm\{ret\}\}to control how the graph is searched\. The copilot does not generate the final answer or action\. Instead, given a taskxx, it outputs two discrete control variables:

\(R,W\)∼πret\(⋅∣x\),\(R,\\,W\)\\;\\sim\\;\\pi\_\{\\mathrm\{ret\}\}\(\\cdot\\mid x\),\(4\)whereR∈\{0,…,100\}R\\in\\\{0,\\ldots,100\\\}controls the breadth of graph exploration, andW∈\{0,…,100\}W\\in\\\{0,\\ldots,100\\\}controls the trade\-off between semantic relevance and historical utility\. We rescale them asρ=R/100\\rho=R/100andλ=W/100\\lambda=W/100, whereρ,λ∈\[0,1\]\\rho,\\lambda\\in\[0,1\]\. Intuitively,ρ\\rhocontrols how closely retrieval stays to the initial semantic seeds: a largerρ\\rhoinduces a higher restart probability, confining diffusion to the immediate neighborhood of the seeds, while a smallerρ\\rhoallows probability mass to propagate further across the graph;λ\\lambdadetermines how strongly the final ranking favors historically useful experiences over purely similar ones\.

Semantic seeding\. Given taskxxwith embeddinghxh\_\{x\},ExpGraphfirst retrieves the top\-mmnodes by cosine similarity:

S0=TopMv∈V⁡cos⁡\(hx,hv\)\.S\_\{0\}=\\operatorname\{TopM\}\_\{v\\in V\}\\;\\cos\(h\_\{x\},\\,h\_\{v\}\)\.\(5\)The seed setS0S\_\{0\}provides an initial set of task\-relevant experiences\. However, these seeds are not directly used as the final retrieval result, because high semantic similarity does not necessarily imply high downstream utility\.

Graph diffusion\. Starting fromS0S\_\{0\},ExpGraphexpands retrieval over the experience graph using personalized PageRank\. The restart distribution is defined as

q\(v\)=\{1/\|S0\|,v∈S0,0,v∉S0,q\(v\)=\\begin\{cases\}1/\|S\_\{0\}\|,&v\\in S\_\{0\},\\\\ 0,&v\\notin S\_\{0\},\\end\{cases\}\(6\)and the diffusion iterates as

pt\+1=α\(ρ\)q\+\(1−α\(ρ\)\)Anorm⊤pt,p\_\{t\+1\}\\;=\\;\\alpha\(\\rho\)\\,q\\;\+\\;\\bigl\(1\-\\alpha\(\\rho\)\\bigr\)\\,A\_\{\\mathrm\{norm\}\}^\{\\top\}\\,p\_\{t\},\(7\)whereAnormA\_\{\\mathrm\{norm\}\}is the row\-normalized adjacency matrix andα\(ρ\)∈\(0,1\)\\alpha\(\\rho\)\\in\(0,1\)is a restart probability that increases monotonically withρ\\rho\. A largerρ\\rhoyields a higher restart probability, which confines probability mass closer to the semantic seeds and reduces the scope of diffusion; a smallerρ\\rhoallows the distribution to spread further across the graph\. After convergence, the top\-ranked nodes form the candidate setC=TopLv∈V⁡p\(v\)C=\\operatorname\{TopL\}\_\{v\\in V\}p\(v\)\.

Utility\-aware ranking\. For each candidatev∈Cv\\in C,ExpGraphestimates its utility with an upper\-confidence score:

bv=uv\+clog⁡\(N\+1\)max⁡\(nv,1\),b\_\{v\}\\;=\\;u\_\{v\}\\;\+\\;c\\,\\sqrt\{\\frac\{\\log\(N\+1\)\}\{\\max\(n\_\{v\},\\,1\)\}\},\(8\)whereNNis the total number of retrieval events andc\>0c\>0is an exploration coefficient\. This score encourages the retriever to exploit experiences with high estimated utility while still exploring under\-tested nodes\.

The final retrieval score combines semantic relevance and utility confidence:

score\(v∣x\)=\(1−λ\)sim^\(x,v\)\+λb^v,\\mathrm\{score\}\(v\\mid x\)\\;=\\;\(1\-\\lambda\)\\,\\widehat\{\\mathrm\{sim\}\}\(x,v\)\\;\+\\;\\lambda\\,\\widehat\{b\}\_\{v\},\(9\)wheresim^\(x,v\)\\widehat\{\\mathrm\{sim\}\}\(x,v\)andb^v\\widehat\{b\}\_\{v\}are normalized withinCC\. The retrieved experience set is

ER,W\(x\)=TopKv∈C⁡score\(v∣x\),E\_\{R,W\}\(x\)=\\operatorname\{TopK\}\_\{v\\in C\}\\;\\mathrm\{score\}\(v\\mid x\),\(10\)and its textual contents are concatenated with the task input and passed to the frozen executor:

ywith=πexec\(x,ER,W\(x\)\)\.y\_\{\\mathrm\{with\}\}=\\pi\_\{\\mathrm\{exec\}\}\\bigl\(x,\\,E\_\{R,W\}\(x\)\\bigr\)\.\(11\)

### 3\.4Learning from Utility Feedback

Utility\-grounded reward\. To learn which retrieved experiences are actually useful,ExpGraphevaluates the executor under two conditions\. With retrieved experiences:

swith=S\(x,πexec\(x,ER,W\(x\)\)\),s\_\{\\mathrm\{with\}\}=S\\bigl\(x,\\,\\pi\_\{\\mathrm\{exec\}\}\(x,\\,E\_\{R,W\}\(x\)\)\\bigr\),\(12\)and without any retrieved experience:

swithout=S\(x,πexec\(x,∅\)\),s\_\{\\mathrm\{without\}\}=S\\bigl\(x,\\,\\pi\_\{\\mathrm\{exec\}\}\(x,\\,\\emptyset\)\\bigr\),\(13\)whereS\(⋅\)S\(\\cdot\)denotes the task\-specific evaluator\. The retrieval reward is

r\(x,ER,W\)=swith\(x,ER,W\)−swithout\(x\)\+ηswith\(x,ER,W\)\.r\(x,E\_\{R,W\}\)=s\_\{\\mathrm\{with\}\}\(x,E\_\{R,W\}\)\-s\_\{\\mathrm\{without\}\}\(x\)\+\\eta\\,s\_\{\\mathrm\{with\}\}\(x,E\_\{R,W\}\)\.\(14\)The first term measures the marginal gain of retrieval over the no\-experience baseline, and the second term encourages high absolute task quality, preventing the policy from favoring retrievals that only improve over a weak baseline but still lead to poor task performance\. The coefficientη≥0\\eta\\geq 0controls the strength of this quality regularization\.

Copilot policy optimization\. The retrieval copilot is optimized with PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.30712#bib.bib97)\)to maximize the expected utility\-grounded reward:

maxπret⁡𝔼x∼𝒟𝔼\(R,W\)∼πret\(⋅∣x\)\[r\(x,ER,W\)\]\.\\max\_\{\\pi\_\{\\mathrm\{ret\}\}\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\,\\mathbb\{E\}\_\{\(R,W\)\\sim\\pi\_\{\\mathrm\{ret\}\}\(\\cdot\\mid x\)\}\\\!\\left\[\\,r\(x,\\,E\_\{R,W\}\)\\,\\right\]\.\(15\)Because the reward comes directly from downstream task performance, the copilot learns to retrieve experiences that improve the frozen executor rather than experiences that only match heuristic similarity scores\. The executor LLMπexec\\pi\_\{\\mathrm\{exec\}\}is never updated\.

Online graph evolution\. The same reward also updates the experience graph\. For each retrieved nodev∈ER,W\(x\)v\\in E\_\{R,W\}\(x\),ExpGraphupdates its retrieval count and utility estimate:

nv←nv\+1,n\_\{v\}\\leftarrow n\_\{v\}\+1,\(16\)uv←\(1−β\)uv\+βr\(x,ER,W\),u\_\{v\}\\leftarrow\(1\-\\beta\)\\,u\_\{v\}\+\\beta\\,r\(x,\\,E\_\{R,W\}\),\(17\)whereβ∈\(0,1\]\\beta\\in\(0,1\]is the update rate\. In addition, the completed trajectoryτ′=\(x,ξ′,ywith,swith\)\\tau^\{\\prime\}=\(x,\\xi^\{\\prime\},y\_\{\\mathrm\{with\}\},s\_\{\\mathrm\{with\}\}\)is summarized into a new candidate experience node, filtered for near\-duplicates, and inserted into the graph using Eq\.[3](https://arxiv.org/html/2605.30712#S3.E3)\. When the graph exceeds its capacity budget, nodes with low utility and low retrieval frequency are evicted\. In this way, the graph continuously absorbs new skills and lessons while suppressing experiences that are rarely useful \(see full training algorithm in Appendix[C](https://arxiv.org/html/2605.30712#A3)\)\.

## 4Experiments

Table 2:Performance comparison on ExpSuite\-Static with 10 diverse tasks\.Results are grouped by task category: Question Answering, Mathematical Reasoning, and Code Generation\.Boldandunderlinedenote the best and second\-best results\.ModelMethodQuestion AnsweringReasoningCodingARC\-CCommonsenseQAGPQAMMLUOBQAGSM8KGSM\-SymbolicMATHHumanEval\+MBPP\+Avg\.Llama\-3\.2\-3B\-Instruct\(Small LLM\)No Memory51\.5254\.3818\.3342\.8954\.2169\.5660\.4437\.6443\.5938\.7551\.91Retrieval\-Centric Experience Learning BaselinesReasoningBank71\.7261\.5721\.6754\.0072\.6770\.4456\.4445\.5851\.2857\.5060\.65ExpeL44\.7057\.7516\.6746\.4443\.0577\.1180\.4423\.3620\.5122\.5051\.69LightMem71\.4664\.0426\.6758\.0072\.8944\.8942\.0049\.2148\.7242\.5056\.18Mem067\.9359\.5528\.3356\.4465\.8352\.0039\.7828\.1241\.0360\.0052\.15AWM72\.4760\.4515\.0053\.3369\.2555\.3338\.8945\.8053\.8560\.0055\.51MemRL67\.4260\.0023\.3340\.2267\.2083\.7868\.0055\.3346\.1557\.5062\.00LLM\-Centric Experience Learning BaselinesIRCoT62\.6359\.7815\.0038\.0063\.3377\.7872\.6729\.4851\.2850\.0056\.58Search\-o166\.6760\.2228\.3350\.0062\.4178\.4473\.3338\.5530\.7728\.7559\.57S365\.0059\.0024\.0044\.0064\.0082\.0074\.0053\.0048\.0055\.0061\.93ExpGraph74\.0063\.2027\.0060\.0074\.0085\.0082\.0057\.0056\.0062\.0069\.57Llama\-3\.1\-8B\-Instruct\(Large LLM\)No Memory70\.2063\.1523\.3353\.1169\.7070\.8962\.6739\.6843\.5958\.7560\.25Retrieval\-Centric Experience Learning BaselinesReasoningBank82\.3270\.3428\.3363\.7878\.3680\.6774\.2247\.6258\.9762\.5069\.75ExpeL82\.5875\.7321\.6766\.6782\.6991\.1188\.8945\.8035\.9066\.2574\.43LightMem81\.5772\.1323\.3360\.8980\.8781\.5672\.0049\.2151\.2868\.7569\.85Mem083\.3370\.1120\.0067\.1180\.6490\.6786\.0047\.8564\.1077\.5073\.94AWM81\.5773\.9318\.3360\.2278\.5981\.7876\.4447\.1764\.1076\.2570\.31MemRL84\.3473\.7123\.3367\.5684\.0593\.3380\.0056\.4635\.9065\.0075\.20LLM\-Centric Experience Learning BaselinesIRCoT83\.0875\.2830\.0067\.1182\.2388\.6784\.0051\.9348\.7271\.2574\.68Search\-o181\.3171\.6928\.3364\.4477\.9089\.3384\.6748\.0746\.1563\.7572\.43S382\.0073\.5027\.0065\.5081\.5091\.0086\.0053\.0059\.0072\.0074\.89ExpGraph86\.0075\.0028\.3369\.0086\.0095\.0090\.0058\.0066\.0079\.0078\.75

Table 3:Performance comparison on ExpSuite\-Agentic across ALFWorld and AppWorld tasks\.We report success rate \(SR\) for ALFWorld, pass rate \(PR\) for AppWorld, and the weighted average score across all test tasks\.Boldandunderlinedenote the best and second\-best results\.Qwen3\-32B \(Small LLM\)Gemini\-3\.1\-Flash\-Lite \(Large LLM\)ALFWorldAppWorldAvg\.ALFWorldAppWorldAvg\.ALF\-SeenALF\-UnseenTest\-NTest\-CScore\#Steps↓\\downarrowALF\-SeenALF\-UnseenTest\-NTest\-CScore\#Steps↓\\downarrowMethodSR\#Steps↓\\downarrowSR\#Steps↓\\downarrowPR\#Steps↓\\downarrowPR\#Steps↓\\downarrowSR\#Steps↓\\downarrowSR\#Steps↓\\downarrowPR\#Steps↓\\downarrowPR\#Steps↓\\downarrowNo\-Memory0\.19341\.20\.35842\.50\.30130\.80\.21631\.50\.25134\.70\.62727\.90\.64926\.30\.34233\.00\.23236\.60\.38332\.9Prompt\-based Agentic BaselinesReAct0\.31436\.70\.47038\.60\.30032\.20\.21831\.90\.28933\.80\.65724\.10\.63425\.10\.32724\.80\.32116\.10\.42620\.5Reflexion0\.23640\.60\.40338\.70\.31631\.30\.21932\.70\.26934\.60\.63123\.30\.60423\.70\.48221\.90\.31115\.60\.44219\.4Retrieval\-Centric Experience Learning BaselinesReasoningBank0\.37935\.30\.26142\.60\.29425\.60\.22224\.30\.26829\.20\.67229\.10\.81324\.10\.45323\.10\.41225\.50\.52525\.4ExpeL0\.54327\.50\.29141\.40\.26735\.10\.28225\.60\.32330\.20\.71621\.40\.63428\.70\.43621\.10\.29928\.30\.44625\.8LightMem0\.49229\.60\.25039\.20\.28534\.20\.23737\.50\.29035\.80\.70127\.80\.70927\.80\.34231\.10\.29734\.20\.43631\.6Mem00\.30037\.30\.18742\.80\.20833\.40\.15735\.20\.19536\.40\.58230\.60\.63422\.40\.32731\.50\.22934\.80\.36931\.5AWM0\.37935\.30\.32840\.10\.28028\.80\.22833\.00\.27833\.70\.73124\.70\.80618\.80\.50720\.90\.31525\.80\.49723\.6MemRL0\.25740\.30\.20249\.60\.45218\.60\.28924\.70\.30229\.90\.61928\.90\.71626\.10\.40513\.50\.31716\.70\.44619\.5LLM\-Centric Experience Learning BaselinesIRCoT0\.33637\.90\.27640\.60\.52811\.40\.36117\.80\.37623\.40\.75427\.00\.87319\.60\.46918\.70\.34925\.40\.52023\.4Search\-o10\.32138\.00\.11247\.30\.45416\.50\.26514\.20\.28723\.70\.76122\.40\.79423\.50\.52320\.70\.39824\.60\.54323\.3S30\.58623\.50\.42522\.50\.5549\.70\.35015\.00\.44016\.50\.77121\.80\.80621\.50\.53018\.50\.40822\.00\.55321\.1ExpGraph0\.70020\.00\.75418\.00\.5718\.80\.39313\.60\.53414\.40\.85017\.60\.88117\.00\.57112\.80\.48414\.80\.62315\.2

We evaluateExpGraphon ExpSuite, comprising ExpSuite\-Static for single\-turn reasoning and generation tasks and ExpSuite\-Agentic for multi\-step interactive decision\-making\. Across both settings, the executor LLM is kept frozen, and all methods are evaluated without manually curated few\-shot examples\. We compare against prompting baselines, retrieval\-centric experience learning methods, and LLM\-centric experience learning methods\. Dataset statistics, implementation details, and extended analysis are provided in Appendix[E](https://arxiv.org/html/2605.30712#A5),[B](https://arxiv.org/html/2605.30712#A2), and[G](https://arxiv.org/html/2605.30712#A7)\.

Task description\. ExpSuite consists of two settings\.\(i\) ExpSuite\-Staticincludes ten benchmarks across three categories: question answering \(ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2605.30712#bib.bib80)\), CommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2605.30712#bib.bib78)\), GPQA\(Reinet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib51)\), MMLU\(Hendryckset al\.,[2021a](https://arxiv.org/html/2605.30712#bib.bib77)\), OBQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.30712#bib.bib79)\)\), mathematical reasoning \(GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30712#bib.bib73)\), GSM\-Symbolic\(Mirzadehet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib74)\), MATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.30712#bib.bib75)\)\), and code generation \(HumanEval\+\(Liuet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib100)\), MBPP\+\(Liuet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib100)\)\)\. We report accuracy, exact match, and Pass@1\(Chen,[2021](https://arxiv.org/html/2605.30712#bib.bib58)\)respectively\.\(ii\) ExpSuite\-Agenticincludes ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2605.30712#bib.bib20)\)\(Seen/Unseen splits; SR and \#Steps\) and AppWorld\(Trivediet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib90)\)\(Test\-Normal/Test\-Challenge splits; PR and \#Steps\)\. For executor models, we use Llama\-3\.2\-3B\-Instruct \(Small LLM\) and Llama\-3\.1\-8B\-Instruct \(Large LLM\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib102)\)for ExpSuite\-Static, and Qwen3\-32B \(Small LLM\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib103)\)and Gemini\-3\.1\-Flash\-Lite \(Large LLM\)\(DeepMind,[2026](https://arxiv.org/html/2605.30712#bib.bib104)\)for ExpSuite\-Agentic\.

Baselines\. We compareExpGraphwith four groups of baselines\.\(i\) No\-Memorydirectly uses the frozen executor without external experiences\.\(ii\) Retrieval\-centricbaselines maintain external experience records, including ReasoningBank\(Ouyanget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib44)\), ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib49)\), LightMem\(Fanget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib91)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib92)\), AWM\(Wanget al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib93)\), and MemRL\(Zhanget al\.,[2026](https://arxiv.org/html/2605.30712#bib.bib94)\)\.\(iii\) LLM\-centricbaselines use an LLM\-based policy to retrieve or exploit external information, including IRCoT\(Trivediet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib42)\), Search\-o1\(Liet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib43)\), and S3\(Jianget al\.,[2025b](https://arxiv.org/html/2605.30712#bib.bib95)\)\.\(iv\) Prompt\-based agenticbaselines \(ExpSuite\-Agentic only\) include ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.30712#bib.bib96)\)and Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib48)\)\. For fair comparison, all experience\-based baselines are given the same number of experiences asExpGraph, so that differences reflect experience utilization rather than data quantity\.

### 4\.1ExpGraphOutperforms General Prompt\-based Baselines and Experience Learning Methods

We evaluateExpGraphon ExpSuite, covering both single\-turn static tasks and multi\-step agentic environments\. Results are reported in Table[2](https://arxiv.org/html/2605.30712#S4.T2)and Table[3](https://arxiv.org/html/2605.30712#S4.T3)\. We have the following observations\.

ExpGraphAchieves the Best Overall Performance Across Static and Agentic Settings\.ExpGraphachieves the best average performance across all evaluated settings and executor models\. On ExpSuite\-Static,ExpGraphimproves the average score over the strongest baseline by 12\.2% with the small executor and 4\.7% with the large executor, with consistent gains across question answering, mathematical reasoning, and code generation\. On ExpSuite\-Agentic,ExpGraphimproves the weighted average score over the strongest baseline by 21\.4% and 12\.7%, respectively, while reducing average steps over the most efficient baseline by 12\.7% and 21\.6%\.

Experience Reuse Is Especially Beneficial for Weaker Executors and Agentic Tasks\. The relative gains reveal two trends\. First, smaller executors benefit more fromExpGraph: the relative gains are 12\.2% vs\. 4\.7% on ExpSuite\-Static and 21\.4% vs\. 12\.7% on ExpSuite\-Agentic, suggesting that retrieved experiences provide greater value when the executor has weaker built\-in reasoning or planning ability\. Second, gains are consistently larger in agentic settings than in static ones, indicating that experience reuse becomes more valuable as tasks require longer\-horizon decision\-making, where prior trajectories can directly guide actions and reduce unnecessary exploration\.

![Refer to caption](https://arxiv.org/html/2605.30712v1/x2.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.30712v1/x3.png)\(b\)
![Refer to caption](https://arxiv.org/html/2605.30712v1/x4.png)\(c\)

Figure 2:Zero\-shot transfer across different executor shifts\.\(a\)Small\-to\-large transfer: transferring the learned experience graph and retrieval copilot from a smaller executor to a larger executor\. \(b\)Large\-to\-small transfer: transferring experience components from a larger executor to a smaller executor\. \(c\)Non\-reasoning\-to\-reasoning transfer: transferring experience components across executors with different reasoning capabilities\.
### 4\.2ExpGraphExhibits Superior Generalization Capabilities Across Different LLM Executors

We evaluate executor transfer under three settings\.Small\-to\-large transferuses the small executor as source and the large executor as target, testing whether experiences from cheaper models benefit stronger frozen LLMs\.Large\-to\-small transferreverses this direction, testing whether high\-quality experiences from stronger models remain useful for weaker executors\.Non\-reasoning\-to\-reasoning transferuses non\-reasoning executors as source and reasoning\-capable executors as target, testing whether experiences transfer across reasoning capability gaps\. Specifically, for ExpSuite\-Static, we transfer from Llama\-3\.1\-8B\-Instruct to DeepSeek\-R1\-Distill\-Llama\-8B\(Guoet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib65)\); for ExpSuite\-Agentic, we transfer from Gemini\-3\.1\-Flash\-Lite to Claude\-Sonnet\-4\(Anthropic,[2025](https://arxiv.org/html/2605.30712#bib.bib101)\)\. For each setting, we compare Graph\-only Transfer, Copilot\-only Transfer, Graph\+Copilot Transfer, and target\-specificExpGraph\.

ExpGraphEnables Small\-to\-Large Transfer with Minimal Performance Loss\. Experience components learned from smaller executors transfer effectively to larger executors\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(a\), Graph\+Copilot Transfer performs closest to target\-specificExpGraphacross all datasets\. While Graph\-only Transfer reuses task experiences and Copilot\-only Transfer preserves learned retrieval behavior, transferring both components together is consistently more effective, suggesting that strong transfer requires preserving both the experience graph and the retrieval policy jointly\.

ExpGraphShows Large\-to\-Small Transfer Is Harder but Still Effective\. Transferring from stronger executors to weaker executors is more challenging, but still provides clear benefits\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(b\), all transfer variants perform below target\-specificExpGraph, and the overall radar area is smaller than in the small\-to\-large setting\. This is expected, as experiences from stronger executors may contain richer reasoning patterns and action complexity that weaker executors cannot fully exploit\. Even so, Graph\+Copilot Transfer remains the strongest zero\-shot variant across most domains, indicating that weaker executors can still benefit from high\-quality transferred experience\.

ExpGraphTransfers Strongly from Non\-Reasoning to Reasoning Executors\. Experience components learned from non\-reasoning executors generalize well to reasoning\-capable executors\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(c\), Graph\+Copilot Transfer approaches target\-specificExpGraphand shows particularly strong performance on ALFWorld and AppWorld\. This suggests that reasoning\-capable executors can better interpret and adapt transferred experiences\. The experience graph encodes reusable procedural knowledge, which reasoning executors integrate under new states and constraints, indicating that transferred components are not executor\-specific artifacts but capture experience structure that generalizes across reasoning capabilities\.

![Refer to caption](https://arxiv.org/html/2605.30712v1/x5.png)Figure 3:ExpGraphrequires graph structure, utility feedback, and similarity\-aware memory management to achieve robust gains\.We compareExpGraphwith four ablation variants across five evaluation domains: QA, Reasoning, Coding, ALFWorld, and AppWorld\.
### 4\.3Ablation Studies ValidateExpGraph’s Key Components

To understand the contribution of each component inExpGraph, we conduct ablation studies across five evaluation domains: QA, Reasoning, Coding, ALFWorld, and AppWorld\. These variants examine the effects of similarity\-aware experience management, graph\-structured experience organization, graph diffusion, and utility\-aware ranking\. Results are reported in Figure[3](https://arxiv.org/html/2605.30712#S4.F3)\.

- •w/o Similarity Filtering: Removes similarity\-based filtering during experience insertion, allowing redundant experiences to be repeatedly retained\.
- •Flat Experience: Replaces the experience graph with a flat experience pool and retrieves top\-KKexperiences purely by semantic similarity\.
- •w/o Graph Diffusion: Keeps the graph structure but disables graph expansion, retrieving only from the initial semantic seed nodes\.
- •w/o Utility Ranking: Ranks candidate experiences only by semantic similarity, without using historical utility statistics\.

As shown in Figure[3](https://arxiv.org/html/2605.30712#S4.F3), removing any component consistently degrades performance, confirming thatExpGraphbenefits from the joint design of experience management, graph structure, graph diffusion, and utility feedback\. Among all variants,Flat Experienceshows the largest drop across both executors, indicating that isolated experience entries are insufficient for effective reuse and that graph structure is essential for connecting related skills, failure lessons, and transferable strategies beyond nearest\-neighbor retrieval\.w/o Graph Diffusionperforms worse especially on ALFWorld and AppWorld, suggesting that useful experiences are often structurally related rather than directly retrieved as semantic seeds\. RemovingUtility Rankinghurts performance across settings, showing that semantic relevance alone cannot reliably identify experiences that improve downstream execution\. Finally,w/o Similarity Filteringcauses a smaller but consistent decline, indicating that redundancy control helps maintain a high\-quality experience graph\.

## 5Additional Related Work

Reinforcement learning has become an important mechanism for improving LLMs and LLM agents, from PPO\-based alignment\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.30712#bib.bib30); Liuet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib31)\)to DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib32)\), process reward models\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib33)\), self\-play\(Chenet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib34)\), self\-correction\(Kumaret al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib35)\), and recent reasoning\-oriented variants such as GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib36)\), Dr\.GRPO\(Liuet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib37)\), GSPO\(Zhenget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib38)\), and Clip\-Cov\(Cuiet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib39)\)\. Beyond directly optimizing LLM policies, experience\-learning systems store and reuse prior interactions: ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.30712#bib.bib96)\)and Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib48)\)use reasoning traces and verbal feedback, while ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib49)\), ReasoningBank\(Ouyanget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib44)\), LightMem\(Fanget al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib91)\), Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib92)\), and AWM\(Wanget al\.,[2024](https://arxiv.org/html/2605.30712#bib.bib93)\)construct external memories from past trajectories; IRCoT\(Trivediet al\.,[2023](https://arxiv.org/html/2605.30712#bib.bib42)\), Search\-o1\(Liet al\.,[2025](https://arxiv.org/html/2605.30712#bib.bib43)\), S3\(Jianget al\.,[2025b](https://arxiv.org/html/2605.30712#bib.bib95)\), and MemRL\(Zhanget al\.,[2026](https://arxiv.org/html/2605.30712#bib.bib94)\)further introduce learned retrieval or episodic\-memory policies\. However, these methods either update the LLM policy in a model\-dependent way or store experiences as flat memories retrieved by semantic similarity, struggling to distinguish merely relevant experiences from those that truly improve downstream performance\.ExpGraphaddresses this by keeping the executor frozen, organizing past successes and failures into a graph\-structured memory, and training an external retrieval copilot with utility\-grounded feedback—enabling executor\-agnostic experience learning without modifying the underlying LLM agent\.

## 6Conclusion

We presentExpGraph, a model\-agnostic experience learning framework that improves frozen LLM executors through external experience reuse\.ExpGraphorganizes past trajectories into a self\-evolving experience graph, uses utility\-aware ranking to select performance\-improving experiences, and trains a retrieval copilot to adapt selection across tasks and executors\. Experiments on ExpSuite demonstrate improvements in both task performance and decision efficiency across static and agentic settings, establishing graph\-structured experience learning as a flexible path for enabling LLM agents to learn from experience without retraining the executor\.

## References

- \[1\]\(2025\-05\)System card: Claude Opus 4 & Claude Sonnet 4\.Technical reportAnthropic\.External Links:[Link](https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf)Cited by:[§4\.2](https://arxiv.org/html/2605.30712#S4.SS2.p1.1)\.
- \[2\]M\. Chen\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[3\]Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu\(2024\)Self\-play fine\-tuning converts weak language models to strong language models\.arXiv preprint arXiv:2401\.01335\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[4\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§F\.2](https://arxiv.org/html/2605.30712#A6.SS2.p5.1),[Table 1](https://arxiv.org/html/2605.30712#S1.T1.6.1.3.1),[§1](https://arxiv.org/html/2605.30712#S1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[5\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.1](https://arxiv.org/html/2605.30712#A4.SS1.SSS1.p1.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[6\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.CoRR\.External Links:2110\.14168Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.2](https://arxiv.org/html/2605.30712#A4.SS1.SSS2.p1.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[7\]G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen,et al\.\(2025\)The entropy mechanism of reinforcement learning for reasoning language models\.arXiv preprint arXiv:2505\.22617\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[8\]G\. DeepMind\(2026\-03\)Gemini 3\.1 Flash\-Lite model card\.Technical reportGoogle DeepMind\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-flash-lite/)Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[9\]J\. Fang, X\. Deng, H\. Xu, Z\. Jiang, Y\. Tang, Z\. Xu, S\. Deng, Y\. Yao, M\. Wang, S\. Qiao,et al\.\(2025\)Lightmem: lightweight and efficient memory\-augmented generation\.arXiv preprint arXiv:2510\.18866\.Cited by:[§F\.2](https://arxiv.org/html/2605.30712#A6.SS2.p4.2),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[10\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[11\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§4\.2](https://arxiv.org/html/2605.30712#S4.SS2.p1.1)\.
- \[12\]P\. Han, R\. Kocielnik, P\. Song, R\. Debnath, D\. Mobbs, A\. Anandkumar, and R\. M\. Alvarez\(2025\)The personality illusion: revealing dissociation between self\-reports & behavior in llms\.External Links:2509\.03730,[Link](https://arxiv.org/abs/2509.03730)Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[13\]P\. Han, X\. Xu, K\. Xuan, P\. Song, S\. Ouyang, R\. Tian, Y\. Jiang, C\. Qian, P\. Jiang, J\. Sun,et al\.\(2026\)Steer2Adapt: dynamically composing steering vectors elicits efficient adaptation of llms\.arXiv preprint arXiv:2602\.07276\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[14\]H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu\(2024\)Webvoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[15\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.1](https://arxiv.org/html/2605.30712#A4.SS1.SSS1.p4.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[16\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.2](https://arxiv.org/html/2605.30712#A4.SS1.SSS2.p3.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[17\]P\. Jansen, M\. Côté, T\. Khot, E\. Bransom, B\. Dalvi Mishra, B\. P\. Majumder, O\. Tafjord, and P\. Clark\(2024\)Discoveryworld: a virtual environment for developing and evaluating automated scientific discovery agents\.Advances in Neural Information Processing Systems37,pp\. 10088–10116\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[18\]P\. Jiang, J\. Lin, Z\. Shi, Z\. Wang, L\. He, Y\. Wu, M\. Zhong, P\. Song, Q\. Zhang, H\. Wang,et al\.\(2025\)Adaptation of agentic ai\.arXiv preprint arXiv:2512\.16301\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[19\]P\. Jiang, X\. Xu, J\. Lin, J\. Xiao, Z\. Wang, J\. Sun, and J\. Han\(2025\)S3: you don’t need that much data to train a search agent via rl\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 21610–21628\.Cited by:[Table 1](https://arxiv.org/html/2605.30712#S1.T1.6.1.5.1),[§1](https://arxiv.org/html/2605.30712#S1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[20\]M\. Khalifa, R\. Agarwal, L\. Logeswaran, J\. Kim, H\. Peng, M\. Lee, H\. Lee, and L\. Wang\(2025\)Process reward models that think\.arXiv preprint arXiv:2504\.16828\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[21\]A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs,et al\.\(2024\)Training language models to self\-correct via reinforcement learning\.arXiv preprint arXiv:2409\.12917\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[22\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[Appendix B](https://arxiv.org/html/2605.30712#A2.SS0.SSS0.Px8.p1.1)\.
- \[23\]X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou\(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.arXiv preprint arXiv:2501\.05366\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[24\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[25\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[26\]J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang\(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.Advances in neural information processing systems36,pp\. 21558–21572\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.3](https://arxiv.org/html/2605.30712#A4.SS1.SSS3.p1.1),[§D\.1\.3](https://arxiv.org/html/2605.30712#A4.SS1.SSS3.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[27\]Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin\(2025\)Understanding r1\-zero\-like training: a critical perspective\.arXiv preprint arXiv:2503\.20783\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[28\]I\. Loshchilov and F\. Hutter\(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[Appendix B](https://arxiv.org/html/2605.30712#A2.SS0.SSS0.Px1.p1.12)\.
- \[29\]K\. Lu and T\. M\. Lab\(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/on\-policy\-distillationExternal Links:[Document](https://dx.doi.org/10.64434/tml.20251026)Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[30\]Q\. Ma, H\. Zhou, T\. Liu, J\. Yuan, P\. Liu, Y\. You, and H\. Yang\(2023\)Let’s reward step by step: step\-level reward model as the navigators for reasoning\.arXiv preprint arXiv:2310\.10080\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[31\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? A new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,pp\. 2381–2391\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.1](https://arxiv.org/html/2605.30712#A4.SS1.SSS1.p5.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[32\]I\. Mirzadeh, K\. Alizadeh, H\. Shahrokhi, O\. Tuzel, S\. Bengio, and M\. Farajtabar\(2025\)GSM\-symbolic: understanding the limitations of mathematical reasoning in large language models\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.2](https://arxiv.org/html/2605.30712#A4.SS1.SSS2.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[33\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[34\]S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang,et al\.\(2025\)Reasoningbank: scaling agent self\-evolving with reasoning memory\.arXiv preprint arXiv:2509\.25140\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[35\]Y\. Pan, D\. Kong, S\. Zhou, C\. Cui, Y\. Leng, B\. Jiang, H\. Liu, Y\. Shang, S\. Zhou, T\. Wu,et al\.\(2024\)Webcanvas: benchmarking web agents in online environments\.arXiv preprint arXiv:2406\.12373\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[36\]J\. Park, B\. Min, K\. Son, J\. Y\. Song, X\. Ma, and J\. Kim\(2026\)Choicemates: supporting unfamiliar online decision\-making with multi\-agent conversational interactions\.InProceedings of the 31st International Conference on Intelligent User Interfaces,pp\. 1526–1550\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[37\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[38\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.1](https://arxiv.org/html/2605.30712#A4.SS1.SSS1.p3.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[39\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[Appendix B](https://arxiv.org/html/2605.30712#A2.SS0.SSS0.Px1.p1.12),[§3\.4](https://arxiv.org/html/2605.30712#S3.SS4.p2.2)\.
- \[40\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[41\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.30712#S1.p1.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[42\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.2\.1](https://arxiv.org/html/2605.30712#A4.SS2.SSS1.p1.1),[§D\.2\.1](https://arxiv.org/html/2605.30712#A4.SS2.SSS1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[43\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\)CommonsenseQA: A question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.1\.1](https://arxiv.org/html/2605.30712#A4.SS1.SSS1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[44\]X\. Tang, T\. Hu, M\. Ye, Y\. Shao, X\. Yin, S\. Ouyang, W\. Zhou, P\. Lu, Z\. Zhang, Y\. Zhao,et al\.\(2025\)Chemagent: self\-updating library in large language models improves chemical reasoning\.arXiv preprint arXiv:2501\.06590\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[45\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 10014–10037\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[46\]H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian\(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1),[§D\.2\.2](https://arxiv.org/html/2605.30712#A4.SS2.SSS2.p1.1),[§D\.2\.2](https://arxiv.org/html/2605.30712#A4.SS2.SSS2.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[47\]M\. Turpin, J\. Michael, E\. Perez, and S\. Bowman\(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.Advances in Neural Information Processing Systems36,pp\. 74952–74965\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[48\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§F\.2](https://arxiv.org/html/2605.30712#A6.SS2.p6.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[49\]J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese\(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.External Links:2504\.12516,[Link](https://arxiv.org/abs/2504.12516)Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[50\]X\. Xu, J\. Xiao, J\. Barry, M\. El\-karef, J\. Zou, P\. Jiang, Y\. Zhang, M\. J\. Giammona, G\. Mel, and J\. Han\(2026\)Zero\-shot open\-schema entity structure discovery\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7547–7561\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[51\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2605.30712#S4.p2.1)\.
- \[52\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2605.30712#S1.p1.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[53\]T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei\(2026\)On\-policy context distillation for language models\.External Links:2602\.12275,[Link](https://arxiv.org/abs/2602.12275)Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[54\]M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou\(2024\)Textgrad: automatic" differentiation" via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px1.p1.1)\.
- \[55\]S\. Zhang, J\. Wang, R\. Zhou, J\. Liao, Y\. Feng, Z\. Li, Y\. Zheng, W\. Zhang, Y\. Wen, Z\. Li,et al\.\(2026\)Memrl: self\-evolving agents via runtime reinforcement learning on episodic memory\.arXiv preprint arXiv:2601\.03192\.Cited by:[Table 1](https://arxiv.org/html/2605.30712#S1.T1.6.1.4.1),[§1](https://arxiv.org/html/2605.30712#S1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[56\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§F\.2](https://arxiv.org/html/2605.30712#A6.SS2.p3.1),[Table 1](https://arxiv.org/html/2605.30712#S1.T1.6.1.2.1),[§1](https://arxiv.org/html/2605.30712#S1.p2.1),[§4](https://arxiv.org/html/2605.30712#S4.p3.1),[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.
- \[57\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[Appendix A](https://arxiv.org/html/2605.30712#A1.SS0.SSS0.Px2.p1.1)\.
- \[58\]C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§5](https://arxiv.org/html/2605.30712#S5.p1.1)\.

Contents of Appendix

## Appendix ALimitations, Future Work, and Broader Impact

##### Limitations\.

AlthoughExpGraphprovides a model\-agnostic way to improve frozen LLM agents through graph\-structured experience reuse, our study has several limitations\. First, our experiments focus on a representative set of static reasoning benchmarks, including question answering\[[5](https://arxiv.org/html/2605.30712#bib.bib80),[43](https://arxiv.org/html/2605.30712#bib.bib78),[38](https://arxiv.org/html/2605.30712#bib.bib51),[15](https://arxiv.org/html/2605.30712#bib.bib77),[31](https://arxiv.org/html/2605.30712#bib.bib79)\], mathematical reasoning\[[6](https://arxiv.org/html/2605.30712#bib.bib73),[32](https://arxiv.org/html/2605.30712#bib.bib74),[16](https://arxiv.org/html/2605.30712#bib.bib75)\], code generation\[[26](https://arxiv.org/html/2605.30712#bib.bib100)\], and agentic environments\[[42](https://arxiv.org/html/2605.30712#bib.bib20),[46](https://arxiv.org/html/2605.30712#bib.bib90)\]\. While these tasks cover diverse forms of experience reuse, future work could further evaluateExpGraphon longer\-horizon real\-world applications, such as web browsing\[[14](https://arxiv.org/html/2605.30712#bib.bib4),[35](https://arxiv.org/html/2605.30712#bib.bib3),[49](https://arxiv.org/html/2605.30712#bib.bib2)\], scientific discovery\[[17](https://arxiv.org/html/2605.30712#bib.bib17),[54](https://arxiv.org/html/2605.30712#bib.bib18),[44](https://arxiv.org/html/2605.30712#bib.bib21),[50](https://arxiv.org/html/2605.30712#bib.bib16)\], and collaborative multi\-agent workflows\[[18](https://arxiv.org/html/2605.30712#bib.bib15),[36](https://arxiv.org/html/2605.30712#bib.bib14)\]\. Second, the experience graph is constructed using embedding\-based semantic similarity with fixed hyperparameters, including neighbor size, similarity threshold, graph capacity, and utility update rate\. Although these choices work well empirically, more adaptive graph construction and pruning strategies may further improve robustness across domains with different memory distributions\.

##### Future Work\.

Several directions remain open for future investigation\. First, the experience graph is currently constructed using embedding\-based semantic similarity with fixed hyperparameters; replacing these with adaptive graph construction and pruning strategies could improve robustness across diverse domains\. Second, the retrieval copilot is trained with scalar utility\-grounded rewards from the downstream task outcomes, richer feedback signals, such as process\-level credit from intermediate task milestones\[[30](https://arxiv.org/html/2605.30712#bib.bib5),[20](https://arxiv.org/html/2605.30712#bib.bib8)\], could sharpen the learning signal further\. Third, the agent currently leverages retrieved experiences through prompt\-based conditioning\. However, prompt\-level instructions are known to translate inconsistently into the model’s downstream behavior\[[12](https://arxiv.org/html/2605.30712#bib.bib11),[13](https://arxiv.org/html/2605.30712#bib.bib10),[47](https://arxiv.org/html/2605.30712#bib.bib9)\], leaving room for mechanisms that more tightly couple experience utilization with the policy itself, such as fine\-tuning on experience\-grounded trajectories\[[29](https://arxiv.org/html/2605.30712#bib.bib6)\], distilling experience\-conditioned reasoning into the model’s parameters\[[53](https://arxiv.org/html/2605.30712#bib.bib13),[57](https://arxiv.org/html/2605.30712#bib.bib12)\]\.

##### Broader Impact\.

ExpGraphenables LLM agents to accumulate and transfer task knowledge through an external graph\-structured memory, without modifying model parameters\. This reduces the need for expensive retraining and lowers deployment costs in resource\-constrained settings\. The executor\-agnostic design further ensures that safety\-aligned, closed\-source models can be adopted without the alignment risks associated with fine\-tuning\. We encourage future deployments ofExpGraphto be accompanied by robust evaluation protocols, human\-in\-the\-loop checkpoints, and auditable retrieval logs\.

## Appendix BImplementation Details

We implementExpGraphwith two main components: a lightweight retrieval copilotπret\\pi\_\{\\mathrm\{ret\}\}and a self\-evolving experience graphG=\(V,E\)G=\(V,E\)\. The executor LLMπexec\\pi\_\{\\mathrm\{exec\}\}is kept frozen throughout all experiments, and only the retrieval copilot and graph memory are updated during training\.

##### Retrieval copilot\.

The retrieval copilotπret\\pi\_\{\\mathrm\{ret\}\}is instantiated with Qwen2\.5\-3B\-Instruct111[https://huggingface\.co/Qwen/Qwen2\.5\-3B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)\. Given a taskxx, the copilot outputs two discrete control variables\(R,W\)∼πret\(⋅∣x\)\(R,W\)\\sim\\pi\_\{\\mathrm\{ret\}\}\(\\cdot\\mid x\), whereR∈\{0,…,100\}R\\in\\\{0,\\ldots,100\\\}controls the breadth of graph diffusion andW∈\{0,…,100\}W\\in\\\{0,\\ldots,100\\\}controls the trade\-off between semantic relevance and historical utility, as described in Eq\.[4](https://arxiv.org/html/2605.30712#S3.E4)\. We rescale them asρ=R/100\\rho=R/100andλ=W/100\\lambda=W/100\. The copilot is optimized with Proximal Policy Optimization \(PPO\)\[[39](https://arxiv.org/html/2605.30712#bib.bib97)\]using the utility\-grounded reward in Eq\.[14](https://arxiv.org/html/2605.30712#S3.E14), withη=1\\eta=1\. We use the verl distributed RL framework with Fully Sharded Data Parallel \(FSDP\) for parameter and gradient offloading\. The actor learning rate is5×10−65\\times 10^\{\-6\}and the critic learning rate is1×10−51\\times 10^\{\-5\}, both using AdamW\[[28](https://arxiv.org/html/2605.30712#bib.bib70)\]with cosine learning\-rate decay and no warmup\. The KL penalty coefficient is set to0\.010\.01, and gradient checkpointing is enabled to reduce memory usage\. The maximum prompt length is 4096 tokens and the maximum response length is 500 tokens\. During training, we sample copilot outputs with temperature1\.01\.0; during evaluation, we use greedy decoding\. We train with a batch size of 16 for up to 500 gradient steps across 2 epochs\. Checkpoints are saved every 20 steps and validation is performed every 30 steps\.

##### Experience graph\.

The experience graph stores each experience as a nodev=\(ev,hv,uv,nv\)v=\(e\_\{v\},h\_\{v\},u\_\{v\},n\_\{v\}\), whereeve\_\{v\}is the natural\-language experience,hvh\_\{v\}is its text embedding,uvu\_\{v\}is its estimated utility, andnvn\_\{v\}is its retrieval count\. We use Contriever222[https://huggingface\.co/facebook/contriever](https://huggingface.co/facebook/contriever)as the embedding model for both task inputs and experience nodes\. When a new node is inserted into the graph, it is connected to its top\-Knn=5K\_\{\\mathrm\{nn\}\}=5nearest neighbors under cosine similarity if the similarity is above the thresholdθ=0\.3\\theta=0\.3, following Eq\.[3](https://arxiv.org/html/2605.30712#S3.E3)\. The graph capacity is capped at\|V\|max=2,000\|V\|\_\{\\max\}=2\{,\}000nodes\. When the capacity is exceeded, we evict nodes with low utility and low retrieval frequency\. For semantic seeding, we retrieve the top\-m=10m=10nodes by cosine similarity as the seed setS0S\_\{0\}in Eq\.[5](https://arxiv.org/html/2605.30712#S3.E5)\. After personalized PageRank expansion in Eq\.[7](https://arxiv.org/html/2605.30712#S3.E7), we construct the candidate set and apply utility\-aware ranking using Eqs\.[8](https://arxiv.org/html/2605.30712#S3.E8)–[9](https://arxiv.org/html/2605.30712#S3.E9)\. The final retrieved set contains the top\-K=10K=10experiences, as defined in Eq\.[10](https://arxiv.org/html/2605.30712#S3.E10)\. The UCB exploration coefficient is set toc=1\.0c=1\.0\. For online graph updates, we update the retrieval count and utility estimate according to Eqs\.[16](https://arxiv.org/html/2605.30712#S3.E16)–[17](https://arxiv.org/html/2605.30712#S3.E17), with exponential moving average rateβ=0\.1\\beta=0\.1\.

##### Experience summarization\.

Each completed trajectoryτ′=\(x,ξ′,ywith,swith\)\\tau^\{\\prime\}=\(x,\\xi^\{\\prime\},y\_\{\\mathrm\{with\}\},s\_\{\\mathrm\{with\}\}\)is summarized into a compact experience unite′=Summarize\(τ′\)e^\{\\prime\}=\\mathrm\{Summarize\}\(\\tau^\{\\prime\}\)following Eq\.[2](https://arxiv.org/html/2605.30712#S3.E2)\. The summarizer extracts reusable information rather than preserving the full trajectory\. High\-scoring trajectories are summarized into skills, such as successful reasoning patterns, planning strategies, or task\-specific heuristics\. Low\-scoring trajectories are summarized into lessons, such as failure modes, invalid actions, or constraints to avoid\. ForExpSuite\-Static, we use Qwen2\.5\-3B\-Instruct in zero\-shot mode as the summarizer\. ForExpSuite\-Agentic, we use Gemini\-3\.1\-Flash\-Lite to generate environment\-specific experience descriptions\. Detailed prompts are provided in Appendix[F](https://arxiv.org/html/2605.30712#A6)\.

##### Cold\-start initialization\.

Before RL training, we initialize the experience graph with a cold\-start procedure\. We sample a subset of training tasks, execute them with the frozen executorπexec\\pi\_\{\\mathrm\{exec\}\}without retrieved experiences, summarize the resulting trajectories, and insert the generated experience nodes into an initially empty graph\. This initialization provides the copilot with a non\-empty graph for early\-stage retrieval and stabilizes training\.

##### Executor LLMs\.

To evaluate whetherExpGraphis executor\-agnostic, we use different frozen executor LLMs across the two ExpSuite settings\. ForExpSuite\-Static, we use Llama\-3\.2\-3B\-Instruct as the small executor and Llama\-3\.1\-8B\-Instruct as the large executor, both accessed through the NVIDIA NIM API333[https://build\.nvidia\.com](https://build.nvidia.com/)\. ForExpSuite\-Agentic, we use Qwen3\-32B444[https://huggingface\.co/Qwen/Qwen3\-32B](https://huggingface.co/Qwen/Qwen3-32B)as the small executor and Gemini\-3\.1\-Flash\-Lite as the large executor\. Qwen3\-32B is accessed through the NVIDIA NIM API, while Gemini\-3\.1\-Flash\-Lite is accessed through the Google Gemini API\. In all cases, the executor is used only through its input\-output interface and is never updated during training or evaluation\.

##### Baseline memory construction\.

For fair comparison, all retrieval\-centric experience learning baselines are provided with the same historical trajectories asExpGraphwhenever applicable\. This includes ReasoningBank, ExpeL, LightMem, Mem0, AWM, and MemRL\. Each baseline converts the trajectories into its own memory representation using the corresponding memory construction strategy and the prompts described in Appendix[F](https://arxiv.org/html/2605.30712#A6)\. This ensures that performance differences mainly reflect how each method stores, retrieves, and exploits experience, rather than differences in the available historical data\.

##### Training and evaluation\.

During training, each sampled task is evaluated twice: once with retrieved experiences and once without retrieved experiences\. The two scores are used to compute the utility\-grounded reward in Eq\.[14](https://arxiv.org/html/2605.30712#S3.E14), which simultaneously updates the retrieval copilot through PPO and updates graph utility statistics through exponential moving average\. After each task, the completed trajectory is summarized into a new candidate experience, filtered for near\-duplicates, and inserted into the graph\. During evaluation, both the copilot and graph are fixed\. The copilot outputs deterministic control variables via greedy decoding, retrieves the top\-KKexperiences, and passes them to the frozen executor without any further graph or policy updates\.

##### Infrastructure\.

All copilot training experiments are conducted on 4 NVIDIA A6000 GPUs with 48GB memory each using BF16 mixed precision\. We use vLLM\[[22](https://arxiv.org/html/2605.30712#bib.bib99)\]for efficient rollout generation with tensor parallelism size 1 and GPU memory utilization set to0\.30\.3\. The experience graph server runs on CPU and communicates with the training loop through HTTP\. A single training run forExpSuite\-Agenticon ALFWorld with 500 steps takes approximately 40 hours, while a training run forExpSuite\-Staticwith 300 steps takes approximately 20 hours\.

## Appendix CTraining Procedure ofExpGraph

Algorithm[1](https://arxiv.org/html/2605.30712#alg1)summarizes the full training procedure ofExpGraph\. At each iteration, the copilot samples retrieval controls\(R,W\)\(R,W\)for a taskxxand uses them to retrieve experiences from the current graph\. The frozen executor is then evaluated with and without the retrieved experiences to compute the utility\-grounded reward\. This reward is used in two ways: it updates the retrieval copilot through PPO and updates the utility statistics of retrieved graph nodes through exponential moving average\. Finally, the completed trajectory is summarized into a new experience, filtered for near\-duplicates, and inserted into the graph\. At inference time, both the copilot and the graph are fixed; the copilot outputs deterministic control variables via greedy decoding, and the retrieved experiences are directly passed to the frozen executor without further updates\.

Algorithm 1Training Procedure ofExpGraph0:Task distribution

𝒟\\mathcal\{D\}, executor

πexec\\pi\_\{\\mathrm\{exec\}\}, initial copilot

πret\\pi\_\{\\mathrm\{ret\}\}, initial graph

G=\(V,E\)G=\(V,E\), reward weight

η\\eta, utility update rate

β\\beta
1:foreach training iterationdo

2:Sample task

x∼𝒟x\\sim\\mathcal\{D\}
3:// Retrieval

4:Sample

\(R,W\)∼πret\(⋅∣x\)\(R,W\)\\sim\\pi\_\{\\mathrm\{ret\}\}\(\\cdot\\mid x\)\(Eq\.[4](https://arxiv.org/html/2605.30712#S3.E4)\)

5:Compute seed set

S0S\_\{0\}via semantic seeding\(Eq\.[5](https://arxiv.org/html/2605.30712#S3.E5)\)

6:Expand candidate set

CCvia personalized PageRank\(Eq\.[7](https://arxiv.org/html/2605.30712#S3.E7)\)

7:Retrieve

ER,W\(x\)E\_\{R,W\}\(x\)via utility\-aware ranking\(Eqs\.[8](https://arxiv.org/html/2605.30712#S3.E8)–[10](https://arxiv.org/html/2605.30712#S3.E10)\)

8:// Execution

9:

swith←S\(x,πexec\(x,ER,W\(x\)\)\)s\_\{\\mathrm\{with\}\}\\leftarrow S\\bigl\(x,\\,\\pi\_\{\\mathrm\{exec\}\}\(x,\\,E\_\{R,W\}\(x\)\)\\bigr\)\(Eq\.[12](https://arxiv.org/html/2605.30712#S3.E12)\)

10:

swithout←S\(x,πexec\(x,∅\)\)s\_\{\\mathrm\{without\}\}\\leftarrow S\\bigl\(x,\\,\\pi\_\{\\mathrm\{exec\}\}\(x,\\,\\emptyset\)\\bigr\)\(Eq\.[13](https://arxiv.org/html/2605.30712#S3.E13)\)

11:// Reward

12:

r←\(swith−swithout\)\+η⋅swithr\\leftarrow\(s\_\{\\mathrm\{with\}\}\-s\_\{\\mathrm\{without\}\}\)\+\\eta\\cdot s\_\{\\mathrm\{with\}\}\(Eq\.[14](https://arxiv.org/html/2605.30712#S3.E14)\)

13:// Copilot update

14:Update

πret\\pi\_\{\\mathrm\{ret\}\}with PPO using reward

rr\(Eq\.[15](https://arxiv.org/html/2605.30712#S3.E15)\)

15:// Graph update

16:foreach retrieved node

v∈ER,W\(x\)v\\in E\_\{R,W\}\(x\)do

17:

nv←nv\+1n\_\{v\}\\leftarrow n\_\{v\}\+1\(Eq\.[16](https://arxiv.org/html/2605.30712#S3.E16)\)

18:

uv←\(1−β\)uv\+βru\_\{v\}\\leftarrow\(1\-\\beta\)\\,u\_\{v\}\+\\beta\\,r\(Eq\.[17](https://arxiv.org/html/2605.30712#S3.E17)\)

19:endfor

20:// Graph growth

21:Summarize

τ′=\(x,ξ′,ywith,swith\)\\tau^\{\\prime\}=\(x,\\xi^\{\\prime\},y\_\{\\mathrm\{with\}\},s\_\{\\mathrm\{with\}\}\)into experience

e′e^\{\\prime\}\(Eq\.[2](https://arxiv.org/html/2605.30712#S3.E2)\)

22:Filter near\-duplicates; evict low\-utility nodes if capacity exceeded

23:Insert

e′e^\{\\prime\}into

GG\(Eq\.[3](https://arxiv.org/html/2605.30712#S3.E3)\)

24:endfor

24:Trained copilot

πret\\pi\_\{\\mathrm\{ret\}\}, evolved experience graph

GG

## Appendix DDataset Descriptions

We describe all evaluation datasets in ExpSuite below, categorized by their corresponding settings in Table[4](https://arxiv.org/html/2605.30712#A4.T4)\. ExpSuite consists of two complementary groups: ExpSuite\-Static and ExpSuite\-Agentic\. ExpSuite\-Static evaluates experience reuse in single\-turn input\-output tasks, including question answering, mathematical reasoning, and code generation\. ExpSuite\-Agentic evaluates experience\-guided decision\-making in multi\-step interactive environments\. For all datasets, we report task\-specific performance metrics, and for interactive tasks we additionally report the number of environment steps to measure decision efficiency\.

Table 4:Detailed summary of datasets used in ExpSuite\.We categorize datasets by setting, task type, and evaluation metric\.DatasetTaskMetricExpSuite\-StaticARC\-CQuestion AnsweringAccuracyCommonsenseQAQuestion AnsweringAccuracyGPQAQuestion AnsweringAccuracyMMLUQuestion AnsweringAccuracyOBQAQuestion AnsweringAccuracyGSM8KMathematical ReasoningExact MatchGSM\-SymbolicMathematical ReasoningExact MatchMATHMathematical ReasoningExact MatchHumanEval\+Code GenerationPass@1MBPP\+Code GenerationPass@1ExpSuite\-AgenticALFWorld\-SeenSeen Interactive Task CompletionSR / \#StepsALFWorld\-UnseenUnseen Interactive Task CompletionSR / \#StepsAppWorld\-Test\-NNormal App\-based Workflow ExecutionSR / \#StepsAppWorld\-Test\-CCompositional App\-based Workflow ExecutionSR / \#Steps### D\.1ExpSuite\-Static

#### D\.1\.1Question Answering

ARC\(AI2 Reasoning Challenge\)\[[5](https://arxiv.org/html/2605.30712#bib.bib80)\]contains 7\.8K science questions drawn from grade\-school standardized exams\. Following common practice, we adopt the Challenge split \(ARC\-C\), which retains only questions that cannot be solved by simple retrieval or lexical co\-occurrence baselines and therefore demand genuine reasoning\.

CommonsenseQA\[[43](https://arxiv.org/html/2605.30712#bib.bib78)\]consists of 12\.2K multiple\-choice questions built on top of ConceptNet, where answering correctly requires drawing on implicit, everyday world knowledge about concepts and their relations rather than facts stated in the question itself\.

GPQA\(Graduate\-level Google\-Proof QA\)\[[38](https://arxiv.org/html/2605.30712#bib.bib51)\]is a challenging set of 448 multiple\-choice questions in biology, physics, and chemistry\. The questions are crafted so that web search is largely unhelpful while domain experts can still answer them, making the benchmark a probe of deep subject\-matter expertise rather than surface lookup\.

MMLU\(Massive Multitask Language Understanding\)\[[15](https://arxiv.org/html/2605.30712#bib.bib77)\]covers 57 subjects across STEM, the humanities, and the social sciences\. Its multiple\-choice questions span difficulties from elementary school to professional level, jointly testing factual recall and reasoning\.

OpenBookQA\[[31](https://arxiv.org/html/2605.30712#bib.bib79)\]contains 5\.9K elementary\-school science questions modeled after open\-book exams\. Solving each item requires combining a given scientific fact with additional commonsense knowledge, which probes multi\-hop reasoning over scientific concepts\.

#### D\.1\.2Mathematical Reasoning

GSM8K\[[6](https://arxiv.org/html/2605.30712#bib.bib73)\]is a collection of 8\.5K grade\-school math word problems, each solvable in 2–8 sequential arithmetic steps over the four basic operations\. We evaluate by exact match on the final numeric answer\.

GSM\-Symbolic\[[32](https://arxiv.org/html/2605.30712#bib.bib74)\]is a symbolic variant of GSM8K in which concrete numbers are replaced by variables, shifting the task from numerical calculation to symbolic manipulation and offering a cleaner test of mathematical understanding versus surface memorization\.

MATH\[[16](https://arxiv.org/html/2605.30712#bib.bib75)\]consists of 12\.5K competition\-level problems spanning algebra, geometry, number theory, probability, counting, and precalculus\. Drawn from contests such as AMC and AIME, the problems require both multi\-step reasoning and substantial mathematical background\.

#### D\.1\.3Code Generation

HumanEval\+is an extension of HumanEval\[[26](https://arxiv.org/html/2605.30712#bib.bib100)\]with a substantially expanded test suite that reduces false positives\. It comprises 164 Python programming problems specified by a function signature and docstring, where models must produce code that passes all hidden tests\.

MBPP\+\[[26](https://arxiv.org/html/2605.30712#bib.bib100)\]strengthens the Mostly Basic Python Problems benchmark with more rigorous tests, covering 974 crowd\-sourced entry\-level Python tasks that target standard algorithmic patterns and basic programming proficiency\.

### D\.2ExpSuite\-Agentic

#### D\.2\.1ALFWorld

ALFWorld\-Seen\[[42](https://arxiv.org/html/2605.30712#bib.bib20)\]is a text\-based interactive household benchmark in which an agent reads natural\-language observations and issues action commands step by step to complete household goals\. TheSeenevaluation split \(valid\_seen\) comprises 140 episodes spanning six task types \(e\.g\.,pick\-and\-place,pick\-heat\-then\-place,look\-at\-in\-light\); room layouts overlap with training scenarios, so the agent encounters familiar spatial configurations but with novel task instances, isolating generalization at the instruction level\.

ALFWorld\-Unseen\[[42](https://arxiv.org/html/2605.30712#bib.bib20)\]shares the same six task types but evaluates onvalid\_unseen: 134 episodes set in entirely new room layouts with novel object placements, providing a stricter out\-of\-distribution test of the agent’s ability to generalize beyond environments seen during training\. For both splits we report task success rate \(SR\) and average environment steps per episode \(\#Steps\)\.

#### D\.2\.2AppWorld

AppWorld\-Test\-N\[[46](https://arxiv.org/html/2605.30712#bib.bib90)\]is a controlled benchmark in which an agent executes day\-to\-day workflows across nine simulated applications \(e\.g\., Gmail, Spotify, Amazon, Todoist, Venmo\) by writing Python code that calls their APIs\. TheNormaltest split \(test\_normal\) consists of 56 scenarios \(168 tasks\) representing routine single\-app or straightforward cross\-app operations\.

AppWorld\-Test\-C\[[46](https://arxiv.org/html/2605.30712#bib.bib90)\]evaluates agents on theCompositionalsplit \(test\_challenge\): 139 scenarios \(417 tasks\) that require coordinating complex multi\-app sequences, resolving cross\-service dependencies, and handling compositional reasoning across 457 APIs—forming a substantially harder probe of planning and tool\-use capabilities\. We report task success rate \(SR\) and mean environment steps \(\#Steps\) for both splits\.

## Appendix EDataset Statistics

In this section, we present detailed statistics for each dataset\. Specifically, the statistics for ExpSuite\-Static and ExpSuite\-Agentic are provided in Table[5](https://arxiv.org/html/2605.30712#A5.T5)and Table[6](https://arxiv.org/html/2605.30712#A5.T6), respectively\. For each ExpSuite\-Static dataset, we randomly sample 1,500 instances and divide them into training, validation, and test sets following a 5:2:3 ratio; for datasets whose original size is smaller than 1,500 \(GPQA, HumanEval\+, and MBPP\+\), we apply the same ratio over all available instances\. For ExpSuite\-Agentic, we adopt the official splits released by each benchmark: ALFWorld provides separateSeenandUnseentest environments, while AppWorld shares a single training set across two test variants,Test\-N\(normal\) andTest\-C\(challenge\)\.

Table 5:ExpSuite\-Static Data Statistics\.HEval\+ denotes HumanEval\+\.SplitMathQACodeGSM8KGSM\-symMATHMMLUCSQAOBQAARC\-CGPQAHEval\+MBPP\+Train7507507507507507507509965132Valid300300300300300300300392652Test450450450450450450450603980Table 6:ExpSuite\-Agentic Data Statistics\.SplitALFWorldAppWorldSeenUnseenTest\-NTest\-CTrain3,5533,5539090Test140134168417
## Appendix FPrompt Usage

This section documents the prompt templates used across all baselines on both ExpSuite\-Static \(QA, Math, and Coding tasks\) and ExpSuite\-Agentic \(ALFWorld and AppWorld tasks\)\.

### F\.1Copilot Prompting

The retrieval copilotπret\\pi\_\{\\mathrm\{ret\}\}receives a task description and outputs two control parameters\(R,W\)\(R,W\)that govern how experiences are retrieved from the experience graph \(see §[3\.3](https://arxiv.org/html/2605.30712#S3.SS3)\)\. We design task\-category\-specific prompt templates that share a common structure but differ in role descriptions and task framing\. All templates instruct the copilot to output its parameters in the format<search\>R:W</search\>, optionally preceded by brief reasoning in<think\>\.\.\.</think\>tags\. Below we describe the three prompt categories and their design rationale\.

##### ExpSuite\-Agentic: ALFWorld\.

For ALFWorld, the copilot is framed as a*retrieval strategy controller for an embodied household agent*\. The prompt presents the task description \(e\.g\.,pick\_and\_place\-Mug\-None\-Cabinet\-312\) and explains howRRandWWcontrol graph diffusion scope and utility weighting, respectively\. Because ALFWorld tasks fall into a small number of recurring categories \(e\.g\., pick\-and\-place, heat\-then\-place, examine\-in\-light\), the copilot can learn to associate task types with effective retrieval strategies—for instance, using highWWfor familiar task types where historical utility is reliable, and lowRRfor novel configurations that benefit from broader exploration\. The full prompt is shown in Table[7](https://arxiv.org/html/2605.30712#A6.T7)\.

##### ExpSuite\-Agentic: AppWorld\.

For AppWorld, the copilot is framed as a*retrieval strategy controller for a coding agent in an app\-based environment*\. The prompt structure is identical to ALFWorld, with the task description replaced by a natural\-language instruction \(e\.g\., “What is the title of the most\-liked song in my Spotify playlists”\)\. AppWorld tasks are more diverse than ALFWorld in terms of required APIs and multi\-app dependencies, so the copilot must learn to vary its retrieval strategy across a wider range of task structures\. The full prompt is shown in Table[8](https://arxiv.org/html/2605.30712#A6.T8)\.

##### ExpSuite\-Static: QA, Math, and Code\.

For ExpSuite\-Static, the copilot is framed as a*retrieval strategy controller for a reasoning agent*\. Each task category uses a tailored role description and answer format \(letter for QA, numerical value for math, Python function for code\), but the retrieval control interface remains the same: the copilot outputs\(R,W\)\(R,W\)to control graph diffusion and utility\-aware ranking\. Unlike the agentic setting where the copilot and executor are separate models, in ExpSuite\-Static the copilot additionally performs the reasoning and answer generation itself after receiving the retrieved experiences\. The full prompts for question answering, mathematical reasoning, and code generation are shown in Tables[9](https://arxiv.org/html/2605.30712#A6.T9),[10](https://arxiv.org/html/2605.30712#A6.T10), and[11](https://arxiv.org/html/2605.30712#A6.T11), respectively\.

Table 7:Copilot prompt template for ALFWorld\(ExpSuite\-Agentic\)\. The copilot outputs retrieval parameters\(R,W\)\(R,W\)for graph\-based experience retrieval\.\{task\_description\}is replaced with the ALFWorld task identifier\.<\|im\_start\|\>userYou are a retrieval strategy controller for an embodied agent in the ALFRED household environment\.Task:\{task\_description\}You control how experiences are retrieved from a knowledge graph by setting two parameters:R \(0\-100\): Graph exploration scope\.Low R \(0\-30\): Explore widely through connected experiences in the graph \(broad discovery\)\.High R \(70\-100\): Stay close to the most similar experiences \(precise, local search\)\.W \(0\-100\): Selection preference between semantic similarity and proven effectiveness\.Low W \(0\-30\): Prefer experiences most similar to this task \(safe, like basic search\)\.High W \(70\-100\): Prefer experiences that historically led to task success \(trust past results\)\.When W=0, retrieval is identical to basic cosine search\. Increase W to leverage historical feedback\.Output format:<search\>R:W</search\>You may reason briefly in<think\>\.\.\.</think\>tags first\.Examples:\- New/unfamiliar task type→\\rightarrow<search\>20:10</search\>\- Common task with good past data→\\rightarrow<search\>80:70</search\>\- Moderate confidence→\\rightarrow<search\>50:40</search\>Output your retrieval parameters now\. Complete the tag:<search\><question\>\{task\_description\}</question\><\|im\_end\|\>Table 8:Copilot prompt template for AppWorld\(ExpSuite\-Agentic\)\. Structure is identical to ALFWorld; only the role description and task content differ\.<\|im\_start\|\>userYou are a retrieval strategy controller for a coding agent in the AppWorld environment\.Task:\{task\_description\}You control how experiences are retrieved from a knowledge graph by setting two parameters:R \(0\-100\): Graph exploration scope\.Low R \(0\-30\): Explore widely through connected experiences in the graph \(broad discovery\)\.High R \(70\-100\): Stay close to the most similar experiences \(precise, local search\)\.W \(0\-100\): Selection preference between semantic similarity and proven effectiveness\.Low W \(0\-30\): Prefer experiences most similar to this task \(safe, like basic search\)\.High W \(70\-100\): Prefer experiences that historically led to task success \(trust past results\)\.When W=0, retrieval is identical to basic cosine search\. Increase W to leverage historical feedback\.Output format:<search\>R:W</search\>You may reason briefly in<think\>\.\.\.</think\>tags first\.Examples:\- New/unfamiliar task type→\\rightarrow<search\>20:10</search\>\- Common task with good past data→\\rightarrow<search\>80:70</search\>\- Moderate confidence→\\rightarrow<search\>50:40</search\>Output your retrieval parameters now\. Complete the tag:<search\><question\>\{task\_description\}</question\><\|im\_end\|\>Table 9:Copilot prompt template for Question Answering\(ExpSuite\-Static\)\. The copilot first outputs\(R,W\)\(R,W\)to retrieve experiences, then reasons and answers using the retrieved context\.\{question\}and\{choices\}are replaced with the task content\.<\|im\_start\|\>systemYou are a helpful assistant skilled in logical reasoning and multi\-domain knowledge\. Solve problems step by step, showing your work clearly\.<\|im\_end\|\><\|im\_start\|\>userAnswer the following multiple\-choice question\.You must reason inside<think\>and</think\>\.You can search for relevant problem\-solving experiences using<search\>R:W</search\>, where R \(0\-100\) controls graph exploration scope and W \(0\-100\) controls the trade\-off between semantic similarity and historical utility\.The search returns lessons learned from similar questions solved before \(both successful strategies and common mistakes\)\.You can search multiple times\.When ready, provide ONLY the answer letter inside<answer\>and</answer\>\.For example:<answer\> B </answer\>Question:\{question\}\{choices\}Analyze each option and explain your reasoning\.Write your final answer as:<answer\> \[letter\] </answer\><\|im\_end\|\>Table 10:Copilot prompt template for Mathematical Reasoning\(ExpSuite\-Static\)\.\{question\}is replaced with the math problem\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics and numerical reasoning\. Solve problems step by step, showing your work clearly\.<\|im\_end\|\><\|im\_start\|\>userSolve the following math problem step by step\.You must reason inside<think\>and</think\>\.You can search for relevant problem\-solving experiences from past attempts using<search\>R:W</search\>, where R \(0\-100\) controls graph exploration scope and W \(0\-100\) controls the trade\-off between semantic similarity and historical utility\.The search returns lessons from similar problems that were solved before \(both successful strategies and common mistakes\)\.You can search multiple times\.When ready, provide ONLY the final numerical answer inside<answer\>and</answer\>\.For example:<answer\> 42 </answer\>Question:\{question\}Think step by step, showing your calculations\.Write your final answer as:<answer\> \[number\] </answer\><\|im\_end\|\>Table 11:Copilot prompt template for Code Generation\(ExpSuite\-Static\)\.\{task\},\{function\_signature\}, and\{test\_cases\}are replaced with the coding problem specification\.<\|im\_start\|\>systemYou are an expert Python programmer\. Solve coding problems step by step\.<\|im\_end\|\><\|im\_start\|\>userSolve the following coding problem\.You must reason inside<think\>and</think\>\.You can search for relevant coding experiences using<search\>R:W</search\>, where R \(0\-100\) controls graph exploration scope and W \(0\-100\) controls the trade\-off between semantic similarity and historical utility\.The search returns lessons from similar coding tasks solved before \(both successful patterns and common bugs\)\.You can search multiple times\.When ready, provide your complete Python function inside<answer\>and</answer\>\.Task:\{task\}Function signature:\{function\_signature\}Your code should pass these tests:\{test\_cases\}Think through your approach, then write the Python function\.Write your final code as:<answer\> \[python code\] </answer\><\|im\_end\|\>

### F\.2Experience Construction

Each baseline that maintains an experience bank processes training rollout trajectories via LLM calls to produce structured memory items, but the construction strategy differs substantially across methods\. For ExpSuite\-Static, all methods share a common set of construction prompts described below\. For ExpSuite\-Agentic, each method additionally uses environment\-specific prompts tailored to ALFWorld household manipulation and AppWorld API\-based workflows, described alongside the static prompts for each method below\.

ReasoningBankuses a unified extractor that processes each trajectory into a structured memory item\. For ExpSuite\-Static, successful trajectories are processed with the prompt in Table[12](https://arxiv.org/html/2605.30712#A6.T12); failed trajectories use Table[13](https://arxiv.org/html/2605.30712#A6.T13)\. For ExpSuite\-Agentic, ALFWorld trajectories use Table[14](https://arxiv.org/html/2605.30712#A6.T14), and AppWorld trajectories use Table[15](https://arxiv.org/html/2605.30712#A6.T15)\.

ExpeL\[[56](https://arxiv.org/html/2605.30712#bib.bib49)\]maintains a global rule set updated via two critique passes\. A*compare\-critique*pass pairs each success with failures from the same dataset and asks the LLM to emit ADD/EDIT/REMOVE/AGREE operations over existing rules \(Table[16](https://arxiv.org/html/2605.30712#A6.T16)\)\. An*all\-success*pass fires everyS=8S=8successes per dataset and similarly refines rules from positive evidence alone \(Table[17](https://arxiv.org/html/2605.30712#A6.T17)\)\. In addition, every rollout is stored verbatim as an exemplar memory item \(withtitle=‘‘exemplar’’\), and finalized rules are stored as items withtitle=‘‘rule’’\. For ExpSuite\-Agentic, ExpeL uses environment\-specific critique prompts for ALFWorld \(Table[18](https://arxiv.org/html/2605.30712#A6.T18)\) and AppWorld \(Table[19](https://arxiv.org/html/2605.30712#A6.T19)\)\.

LightMem\[[9](https://arxiv.org/html/2605.30712#bib.bib91)\]implements a three\-stage STM→\\toLTM pipeline\. A short\-term memory buffer \(sizeK=5K=5\) is periodically flushed to long\-term memory via a MemoryExtractor LLM call using Table[12](https://arxiv.org/html/2605.30712#A6.T12)and Table[13](https://arxiv.org/html/2605.30712#A6.T13)\. A separate offline consolidation pass then reconciles similar LTM entries using the UPDATE/DELETE/IGNORE prompt in Table[20](https://arxiv.org/html/2605.30712#A6.T20)\. For ExpSuite\-Agentic, LightMem uses environment\-specific extraction prompts for ALFWorld \(Table[21](https://arxiv.org/html/2605.30712#A6.T21)\) and AppWorld \(Table[22](https://arxiv.org/html/2605.30712#A6.T22)\)\.

Mem0\[[4](https://arxiv.org/html/2605.30712#bib.bib92)\]runs a two\-step write pipeline per rollout\. Step 1 extracts salient facts from the \(problem, response\) pair via Table[23](https://arxiv.org/html/2605.30712#A6.T23)\. Step 2 retrieves the top\-KKmost similar existing memories, then asks the LLM to emit ADD/UPDATE/DELETE/NONE operations via Table[24](https://arxiv.org/html/2605.30712#A6.T24)\. For successful rollouts, a third reflect step generates a procedural summary via Table[25](https://arxiv.org/html/2605.30712#A6.T25)\. For ExpSuite\-Agentic, Mem0 uses environment\-specific prompts for ALFWorld \(Table[26](https://arxiv.org/html/2605.30712#A6.T26)\) and AppWorld \(Table[27](https://arxiv.org/html/2605.30712#A6.T27)\)\.

AWM\[[48](https://arxiv.org/html/2605.30712#bib.bib93)\]induces a per\-dataset workflow from a pool of up toN=5N=5successful trajectories\. The induction prompt \(Table[28](https://arxiv.org/html/2605.30712#A6.T28)\) asks the LLM to extract repetitive reasoning patterns as step\-by\-step workflows with\{placeholder\}names; the resulting workflow string is stored as a plain key–value entry indexed by dataset name and prepended verbatim at inference time without retrieval\. For ExpSuite\-Agentic, AWM uses environment\-specific induction prompts for ALFWorld \(Table[29](https://arxiv.org/html/2605.30712#A6.T29)\) and AppWorld \(Table[30](https://arxiv.org/html/2605.30712#A6.T30)\)\.

ExpGraphuses a unified graph\-based extractor to construct structured experience graphs from trajectories\. For ExpSuite\-Static, the construction prompt is given in Table[31](https://arxiv.org/html/2605.30712#A6.T31)and Table[32](https://arxiv.org/html/2605.30712#A6.T32)\. For ExpSuite\-Agentic, ALFWorld trajectories use Table[33](https://arxiv.org/html/2605.30712#A6.T33)and Table[34](https://arxiv.org/html/2605.30712#A6.T34); AppWorld trajectories use Table[35](https://arxiv.org/html/2605.30712#A6.T35)\.

### F\.3Answering Prompting

At inference time, each method conditions the base language model on a different form of retrieved or pre\-compiled experience\. We describe the three inference regimes below and summarize the corresponding prompt templates in Tables[36](https://arxiv.org/html/2605.30712#A6.T36)–[51](https://arxiv.org/html/2605.30712#A6.T51)\.

##### No Memory\.

Chain\-of\-thought uses no retrieved experience; the model solves tasks directly with chain\-of\-thought instructions \(Table[36](https://arxiv.org/html/2605.30712#A6.T36)–[40](https://arxiv.org/html/2605.30712#A6.T40)\)\.

##### Retrieval\-centric methods\.

ReasoningBank, ExpeL, LightMem, Mem0, AWM, and IRCoT all share the same base inference prompt \(Table[41](https://arxiv.org/html/2605.30712#A6.T41)–[45](https://arxiv.org/html/2605.30712#A6.T45)\), differing only in the content of\{retrieved\_memories\}:

- •ReasoningBank: top\-KKmemory items retrieved by dense embedding similarity, formatted as\#\# Retrieved Memory: \.\.\.
- •ExpeL: distilled rules \(up to 10\) prepended as a numbered list, plus top\-10 exemplar trajectories retrieved by KNN\.
- •LightMem: LTM items after STM→\\toLTM periodic consolidation, retrieved by dense embedding\.
- •Mem0: top\-10 procedural summaries maintained via ADD/UPDATE/DELETE/NONE operations on prior \(problem, response\) pairs\.
- •AWM: full per\-dataset\-type workflow block prepended without retrieval \(O\(1\) dict lookup\)\.
- •IRCoT: BM25\-retrieved raw rollout pairs formatted asExample:\\nQuestion:\.\.\.\\nSolution:\.\.\., accumulated across up to 3 iterative rounds\.

##### Search\-o1\.

Search\-o1 instructs the model to issue explicit memory search queries via<\|begin\_search\_query\|\>tags\. Retrieved items are first summarized by a separate LLM call \(Table[46](https://arxiv.org/html/2605.30712#A6.T46)\) before being injected as search results\. Inference prompts follow the same structure \(Table[47](https://arxiv.org/html/2605.30712#A6.T47)–[51](https://arxiv.org/html/2605.30712#A6.T51)\)\.

Table 12:Prompt for extracting a memory item from a successful trajectory \(ReasoningBank; also used by LightMem’s MemoryExtractor\) in ExpSuite\-Static\.<\|im\_start\|\>systemYou will be given a user query and the corresponding trajectory that represents how an agent successfully accomplished the task\.GuidelinesYou need to extract and summarize useful insights in the format of memory items based on the agent’s successful trajectory\. The goal of summarized memory items is to be helpful and generalizable for future similar tasks\.Important Notes\- You must first think why the trajectory is successful, and then summarize the insights\.\- You can extract only 1 memory item from the trajectory\.\- You must not repeat similar or overlapping items\.\- Do not mention specific values or string contents, but rather focus on generalizable insights\.Output FormatYour output must strictly follow the Markdown format shown below:\# Memory Item 1\#\# Title:<the title of the memory item\>\#\# Description:<one\-sentence summary of the memory item\>\#\# Content:<1–3 sentences describing the insights learned\><\|im\_end\|\><\|im\_start\|\>userQuery:\{question\}Trajectory:\{reasoning\_trajectory\}<\|im\_end\|\>Table 13:Prompt for extracting a memory item from a failed trajectory \(ReasoningBank; also used by LightMem’s MemoryExtractor\) in ExpSuite\-Static\.<\|im\_start\|\>systemYou will be given a user query and the corresponding trajectory that represents how an agent attempted the task but failed\.GuidelinesYou need to extract and summarize useful insights based on the agent’s failed trajectory\. The goal of summarized memory items is to be helpful and generalizable for future similar tasks\.Important Notes\- You must first reflect on why the trajectory failed, then summarize lessons learned to prevent similar failures\.\- You can extract only 1 memory item from the trajectory\.\- You must not repeat similar or overlapping items\.\- Focus on generalizable insights, not specific values\.Output FormatYour output must strictly follow the Markdown format shown below:\# Memory Item 1\#\# Title:<the title of the memory item\>\#\# Description:<one\-sentence summary of the memory item\>\#\# Content:<1–3 sentences describing the lessons learned\><\|im\_end\|\><\|im\_start\|\>userQuery:\{question\}Expected Answer:\{expected\_answer\}Trajectory:\{reasoning\_trajectory\}<\|im\_end\|\>Table 14:Prompt for ReasoningBank extracting a memory item in ALFWorld\.You will be given a finished household trajectory \(task goal, action–observation log\) from a text\-based ALFWorld environment, plus its outcome \(success or failure\)\. Distill the trajectory into a compact*reasoning rationale*that a future agent can reuse on a structurally similar task\.Output exactly one Markdown block of the form:\#\# Memory ItemTitle:\{short reusable name\}Description:\{1–2 sentence task pattern\}Content:\{step\-by\-step heuristic; cite key receptacles and verbs as schema variables, e\.g\. “go to \{recep\}” / “cool \{X\} with fridge 1”\}Trajectory:\{trajectory\_text\}Outcome:\{success\|\|failure\}Table 15:Prompt for ReasoningBank extracting a memory item in AppWorld\.You are reading a finished AppWorld task trajectory: an instruction, a sequence ofpythoncode calls against the AppWorld API, and a finalcomplete\_task\(answer=…\)call \(or failure\)\. Distill this trajectory into a compact reusable rationale a future agent can retrieve when it sees a similar task\.Output exactly one Markdown block of the form:\#\# Memory ItemTitle:\{short reusable name\}Description:\{which app\(s\); what intent\}Content:\{step\-by\-step plan, naming the key APIs and parameter shapes; substitute concrete IDs/emails with \{placeholder\} variables\}Instruction:\{instruction\}Trajectory:\{code log\}Outcome:\{success\|\|failure\}Table 16:Prompt for ExpeL compare\-critique pass \(success vs\. failure pair\) in ExpSuite\-Static\.The LLM emits ADD/EDIT/REMOVE/AGREE operations over the current rule set\.\{existing\_rules\}is replaced by the numbered rule list;\{task\},\{success\_history\}, and\{fail\_history\}are filled from the rollout pair\.systemYou are an advanced reasoning agent that can add, edit or remove rules from your existing rule set, based on forming new critiques of past task trajectories\. You will be given two previous reasoning task trials: one successful and one unsuccessful\. You need to analyze what made the successful trial succeed and the unsuccessful trial fail\.userHere are the two previous trials to compare and critique:TRIAL TASK:\{task\}SUCCESSFUL TRIAL:\{success\_history\}FAILED TRIAL:\{fail\_history\}Here are the EXISTING RULES:\{existing\_rules\}By examining and contrasting to the successful trial, and the list of existing rules, you can perform the following operations: add, edit, remove, or agree so that the new list of rules is GENERAL and HIGH LEVEL critiques of the failed trial or proposed way of Thought so they can be used to avoid similar failures when encountered with different questions in the future\.The available operations are: AGREE \(if the existing rule is strongly relevant\), REMOVE \(if a rule is contradictory or duplicated\), EDIT \(if a rule can be enhanced\), ADD \(if a new, distinct rule is needed\)\. Format:AGREE <N\>: <rule\>REMOVE <N\>: <rule\>EDIT <N\>: <new rule\>ADD <N\>: <new rule\>Do not mention the trials in the rules — all rules should be GENERALLY APPLICABLE\. Each rule should be concise\. Do at most 4 operations; each existing rule can receive at most 1 operation\.<\|im\_end\|\>Table 17:Prompt for ExpeL all\-success critique pass in ExpSuite\-Static\(fired everyS=8S=8successes per dataset\)\.\{success\_history\}contains the concatenated successful trajectories;\{existing\_rules\}is the current numbered rule list\.systemYou are an advanced reasoning agent that can add, edit or remove rules from your existing rule set, based on forming new critiques of past task trajectories\. You will be given successful reasoning task trials\. Analyze them to extract general reasoning insights\.userHere are the trials:\{success\_history\}Here are the EXISTING RULES:\{existing\_rules\}By examining the successful trials, and the list of existing rules, you can perform the following operations: add, edit, remove, or agree so that the new list of rules are general and high level insights of the successful trials or proposed way of Thought so they can be used as helpful tips to different tasks in the future\.The available operations follow the same format as the compare\-critique pass \(AGREE / REMOVE / EDIT / ADD\)\. Focus on REMOVE rules first, and only ADD a rule if it is very insightful and distinct from existing rules\. Do not mention the trials in the rules\.<\|im\_end\|\>Table 18:Prompt for ExpeL extracting a memory item in ALFWorld\.You are an advanced reasoning agent that improves by reflecting on past ALFWorld trials\. You are given two trials on similar household tasks: one successful and one unsuccessful\. Compare them and decide which existing rules in your rule set should be*ADDED*,*EDITED*, or*REMOVED*, or which can stay \(*AGREE*\)\. Each rule is a one\-sentence heuristic for a structural sub\-problem \(e\.g\. “before placing two of the same item, locate both before relocating”\)\.Output one operation per line, in the form:ADD<rule text\>EDIT<rule\_id\>→<new text\>REMOVE<rule\_id\>AGREE<rule\_id\>Existing rules:\{rules\}Successful trial:\{traj\_succ\}Failed trial:\{traj\_fail\}Table 19:Prompt for ExpeL extracting a memory item in AppWorld\.You are an advanced reasoning agent that improves by reflecting on past AppWorld coding trials\. You are given two finished trials on similar API\-call tasks: one whose finalcomplete\_taskcall passed the test cases and one that did not\. Compare them and emit ADD/EDIT/REMOVE/AGREE operations over the existing rule set\. Each rule is a one\-sentence heuristic about API usage \(e\.g\. “always callloginbefore any per\-app endpoint”\)\.Output one operation per line, same format as the ALFWorld variant \(Table[18](https://arxiv.org/html/2605.30712#A6.T18)\)\.Existing rules:\{rules\}Successful trial:\{code\_log\_succ\}Failed trial:\{code\_log\_fail\}Table 20:Prompt for LightMem offline LTM consolidation in ExpSuite\-Static\.For each target memory with at least one similar candidate \(cosine similarity≥0\.8\\geq 0\.8\), the LLM decides whether to UPDATE, DELETE, or IGNORE the target\.\{target\_memory\}and\{candidate\_memories\}are filled from the LTM\.systemYou are a memory management assistant\. Your task is to decide whether the target memory should be updated, deleted, or ignored based on the candidate source memories\.Decision rules:1\.Update: If the target and candidate memories describe the same fact but are not fully consistent \(candidates provide more detail or refinements\), update the target by integrating the additional information\.2\.Delete: If the target and candidate memories contain a direct conflict, the candidates \(more recent\) take precedence\. Delete the target\.3\.Ignore: If target and candidates are unrelated, take no action\.Additional guidance: Use only the information provided\. Do not invent details\. Your operation always applies to the target memory; do not modify candidate memories\.The output must be a JSON object:\{"action": "update" \| "delete" \| "ignore", "new\_memory": \{\.\.\.\}\}\(new\_memoryrequired only whenaction = "update"\)userTarget memory:\{target\_memory\}Candidate memories:\{candidate\_memories\}Table 21:Prompt for LightMem extracting a memory item in ALFWorld\.You are a Personal Information Extractor\. The input is a buffer of recently observed ALFWorld turns, each tagged with a topic and a timestamp\. Extract*all*salient facts about the agent’s environment and behaviour: object locations, completed sub\-goals, durable preferences, errors\.Format each fact as:FACT:\{one\-sentence statement\}\[\{ts\}\]\(source\_id=\{seq\}\)\[…verbatim few\-shot examples omitted; full prompt atbaselines/lightmem/memory\.py:81…\]Conversation buffer:\{buffer\}Table 22:Prompt for LightMem extracting a memory item in AppWorld\.You are a Personal Information Extractor\. The input is a buffer of recently executed AppWorld turns: code chunks the agent ran, plus the JSON response from eachapis\.\*call\. Extract salient durable facts: app credentials/tokens, IDs of relevant entities \(user, contact, note, playlist, transaction\), API quirks, and successful subroutines\.Format each fact as:FACT:\{one\-sentence statement\}\[\{ts\}\]\(source\_id=\{seq\}\)\[…verbatim few\-shot examples omitted; full prompt atbaselines/lightmem/memory\.py:81…\]Conversation buffer:\{buffer\}Table 23:Prompt for Mem0 Step 1: fact extraction from a \(problem, response\) pair in ExpSuite\-Static\.The LLM returns a JSON object\{"facts": \[\.\.\.\]\}listing salient procedural facts\.systemYou are a Personal Information Organizer, specialized in accurately storing facts, user memories, and preferences\. Your primary role is to extract relevant pieces of information from conversations and organize them into distinct, manageable facts\.Extract facts and preferences and return them in JSON format:\{"facts": \["<fact 1\>", "<fact 2\>", \.\.\.\]\}Guidelines: store personal preferences, important details, plans, activity preferences, professional details, and other relevant information\. Do not extract anything from system messages\. Detect the user’s language and record facts in the same language\. If nothing relevant is found, return an empty list\.userProblem:\{question\}Solution:\{response\}Table 24:Prompt for Mem0 Step 2: operation decision \(ADD / UPDATE / DELETE / NONE\) in ExpSuite\-Static\.\{old\_memories\}contains the top\-KKretrieved existing memory entries as a JSON list;\{new\_facts\}contains the facts extracted in Step 1\.userYou are a smart memory manager which controls the memory of a system\. You can perform four operations: \(1\) add into the memory, \(2\) update the memory, \(3\) delete from the memory, and \(4\) no change\.Compare newly retrieved facts with the existing memory\. For each new fact, decide whether to ADD \(new information not present\), UPDATE \(same topic but different content; keep the more informative version\), DELETE \(direct contradiction; newer facts take precedence\), or NONE \(already present or irrelevant\)\.Current memory:\{old\_memories\}New retrieved facts:\{new\_facts\}Return your response strictly in the following JSON structure:\{"memory": \[\{"id": "<id\>", "text": "<content\>", "event": "ADD\|UPDATE\|DELETE\|NONE","old\_memory": "<old content\>"\}\]\}\(old\_memoryrequired only whenevent = "UPDATE"\)Do not return anything except the JSON\.Table 25:Prompt for Mem0 Step 3: procedural summary generation in ExpSuite\-Static\(applied to successful rollouts only\)\. The LLM produces a structured, step\-by\-step summary of the agent’s execution history\.systemYou are a memory summarization system that records and preserves the complete interaction history between a human and an AI agent\. You are provided with the agent’s execution history\. Your task is to produce a comprehensive summary that contains every detail necessary for the agent to continue the task without ambiguity\.Structure your output as follows:Overview: Task objective and progress status \(completion percentage and milestones\)\.Sequential Agent Actions\(numbered steps\): For each step include \(1\) the precise agent action and all parameters, \(2\) the exact, unaltered action result verbatim, and \(3\) embedded metadata: key findings, navigation history, errors and challenges, and current context\.Guidelines: preserve every output verbatim; use chronological order; include exact URLs, element indexes, error messages, and JSON responses\. Output only the structured summary with no additional commentary\.userProblem:\{question\}Solution:\{response\}Outcome: SUCCESSTable 26:Prompt for Mem0 extracting a memory item in ALFWorld\.You are a Personal Information Organizer specialised in storing facts about an agent inside a text\-based ALFWorld environment\. Read the recent action–observation log and extract all atomic facts of the form “\{subject\} \{predicate\} \{object\}”\. Examples include: object locations \(“soapbottle 1 is in cabinet 4”\), task progress \(“user has cleaned the dishsponge”\), and reusable verbs \(“coolXXrequires opening fridge 1 first”\)\.Today’s date is\{today\}\. Return a JSON object:\{"facts": \["\{fact 1\}", "\{fact 2\}", …\]\}\[…verbatim few\-shot examples omitted …\]Conversation:\{turns\}Table 27:Prompt for Mem0 extracting a memory item in AppWorld\.You are a Personal Information Organizer specialised in storing facts about a user’s apps and AppWorld task history\. Read the recent code\-and\-response log and extract all atomic facts of the form “\{subject\} \{predicate\} \{object\}”\. Examples: tokens \(“user has venmo access\_token”\), entities \(“simple\_note id 3084 is the songs note”\), API patterns \(“apis\.spotify\.update\_playlistacceptstitlekeyword”\)\.Today’s date is\{today\}\. Return a JSON object:\{"facts": \["\{fact 1\}", "\{fact 2\}", …\]\}\[…verbatim few\-shot examples omitted …\]Conversation:\{turns\}Table 28:Prompt for AWM workflow induction in ExpSuite\-Static\.Applied once per dataset to a pool of up toN=5N=5successful trajectories\.\{examples\}is a block ofProblem: \.\.\. / Solution: \.\.\.entries; the induced workflow is stored as a plain string and prepended verbatim at inference time\.userGiven a list of solved reasoning problems from the same category, your task is to extract the common workflows to solve these problems\.Each given problem contains a problem statement and a solution\. You need to find the repetitive reasoning patterns across multiple problems, and extract each as a workflow\.Each workflow should be a commonly\-reused reasoning strategy\. Do not generate similar or overlapping workflows\. Each workflow should have at least two steps\. Represent problem\-specific values with descriptive\{placeholder\}names\.Try to generate as many workflows as needed to cover all the problems in the input list\.\#\# Concrete Examples\{examples\}\#\# Summary WorkflowsTable 29:Prompt for AWM extracting a memory item in ALFWorld\.You are an Agent Workflow Memory inducer\. You are given up toN=5N\{=\}5successful ALFWorld trajectories that share the same task type \(e\.g\.pick\_clean\_then\_place\_in\_recep\)\. Extract the repetitive reasoning pattern as a step\-by\-step*workflow*with concrete\-value placeholders\.Output format:Workflow for \{task\_type\}:Step 1: \{instruction with \{object\} and \{recep\} placeholders\}Step 2: …Step k: \{final placement step\}Each workflow should have at least two steps\. Do not generate similar or overlapping workflows; represent task\-specific values with descriptive\{placeholder\}names\.Successful trajectories:\{trajectories\}Table 30:Prompt for AWM extracting a memory item in AppWorld\.You are an Agent Workflow Memory inducer\. You are given up toN=5N\{=\}5successful AppWorld trajectories that share the same task type\. Extract the repetitive reasoning pattern as a step\-by\-step*workflow*with concrete\-value placeholders\.Output format:Workflow for \{task\_type\}:Step 1: \{API call template with \{access\_token\}, \{user\_email\}, \{note\_id\} placeholders\}Step 2: …Step k: apis\.supervisor\.complete\_task\(answer=\{result\}\)Each workflow should have at least two steps\. Substitute concrete IDs / emails / strings with\{placeholder\}names\.Successful trajectories:\{trajectories\}Table 31:Prompt for ExpGraph extracting a memory item from a successful trajectory in ExpSuite\-Static\.<\|im\_start\|\>systemYou are an expert at distilling problem\-solving experiences into concise, reusable lessons\. Be brief and generalizable\.<\|im\_end\|\><\|im\_start\|\>userA search/QA question was answered correctly\.Question:\{question\}Thought:\{thought\}Answer:\{answer\}Extract a reusable search skill \(2–3 sentences\):\-planning\_pattern: Abstract the logic using generic terms\[Entity\],\[Attribute\],\[Time\_Period\]\- What key strategy led to success?Do NOT include specific names, numbers, or answers\. Focus on the transferable strategy\.Output format:SKILL: \[your skill text\]<\|im\_end\|\>Table 32:Prompt for ExpGraph extracting a memory item from a failed trajectory in ExpSuite\-Static\.<\|im\_start\|\>systemYou are an expert at distilling problem\-solving experiences into concise, reusable lessons\. Be brief and generalizable\.<\|im\_end\|\><\|im\_start\|\>userA search/QA question was answered incorrectly\.Question:\{question\}Thought:\{thought\}Incorrect answer:\{answer\}Extract a reusable lesson \(2–3 sentences\):\-trigger\_condition: What kind of question caused the error?\-bad\_action: What went wrong?Do NOT include specific names\. Focus on the transferable lesson\.Output format:SKILL: \[your lesson text\]<\|im\_end\|\>Table 33:Prompt for ExpGraph extracting a memory item from a successful ALFWorld trajectory\.<\|im\_start\|\>systemYou are an expert at analyzing household robot trajectories\. Extract specific, actionable lessons from the provided trajectory\.userAn ALFWorld household task was completed successfully\.Task:\{task\}Full trajectory \(action→\\toobservation\):\{trajectory\}Extract a reusable skill \(3–5 sentences\)\. Include:1\. The general task category \(pick\_and\_place,heat\_then\_place,clean\_then\_place,cool\_then\_place,examine\_in\_light,pick\_two\)2\. The concrete step\-by\-step strategy that worked3\. Common locations where target objects are found \(e\.g\. “soapbar is usually on countertop, bathtubbasin, or shelf”\)Be specific\. Use actual object/location types \(countertop,sinkbasin,microwave\)\.Output format:SKILL: \[your skill text\]<\|im\_end\|\>Table 34:Prompt for ExpGraph extracting a memory item from a failed ALFWorld trajectory\.<\|im\_start\|\>systemYou are an expert at analyzing household robot trajectories\. Extract specific, actionable lessons from the provided trajectory\.userAn ALFWorld household task failed after all steps\.Task:\{task\}Trajectory:\{trajectory\}Extract a reusable lesson \(3–5 sentences\)\. Include:1\. The general task category2\. What specific mistake was made3\. What the agent should have done differentlyOutput format:SKILL: \[your lesson text\]<\|im\_end\|\>Table 35:Prompt for ExpGraph extracting a memory item from an AppWorld episode\.\{STATUS\}is replaced bySUCCESSFUL,FAILED, orPARTIAL\.<\|im\_start\|\>userAnalyze this\{STATUS\}AppWorld code generation episode\.Task:\{task\}Trajectory:\{trajectory\}Extract ONE concise, actionable skill\. Focus on SPECIFIC API details, not generic advice\.Format:\[Short Title\]Specific principle\. When: trigger condition\.Good examples:\-\[Spotify Login Fields\]apis\.spotify\.login\(\)requiresemailandpasswordas keyword arguments…\-\[Venmo Transactions Return Dict\]apis\.venmo\.list\_transactions\(\)returns a dict with key"transactions"…\-\[File System Read Path\]Useapis\.file\_system\.read\_file\(file\_path=\.\.\.\)with absolute paths…BAD examples \(do NOT generate\):\- “Always inspect API docs first” — too generic\- “Verify API method names” — too vagueRules: Include SPECIFIC app names, API method names, parameter names, return value structures\.Output ONLY the skill line\.<\|im\_end\|\>Table 36:Prompt for No Memory in Question Answering\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. Solve problems step by step, showing your work clearly\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with your final answer wrapped in<answer\> </answer\>tags\.\- ALWAYS use<answer\> </answer\>tags\.\- Example:<answer\> A </answer\><\|im\_end\|\><\|im\_start\|\>userAnswer this question:Question:\{question\}\{choices\}Analyze each option and explain your reasoning\.Write your final answer as:<answer\> \[letter\] </answer\><\|im\_end\|\>Table 37:Prompt for No Memory in Math Reasoning\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. Solve problems step by step, showing your work clearly\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with\#\#\#\#followed by your final answer\.\- NEVER use \\boxed\{\} or anyLaTeXboxing format\.\- ALWAYS use\#\#\#\#format, even for mathematical expressions\.\- Example:\#\#\#\# 42or\#\#\#\# xˆ2 \+ 1<\|im\_end\|\><\|im\_start\|\>userSolve this math word problem:\{question\}Think step by step, showing your calculations\.Write your final answer as:\#\#\#\# \[number\]<\|im\_end\|\>Table 38:Prompt for No Memory in Code Generation\.<\|im\_start\|\>systemYou are an expert Python programmer\. Solve coding problems step by step\.CRITICAL FORMAT REQUIREMENT:\- Provide your final solution in a‘‘‘pythoncode block\.\- Example format:‘‘‘python\\ndef solution\(\): pass\\n‘‘‘<\|im\_end\|\><\|im\_start\|\>userTask:\{task\}Your code should pass these tests:\{test\_cases\}Think through your approach, then write the Python function\.REMEMBER: Provide your code in a‘‘‘pythoncode block\.<\|im\_end\|\>Table 39:Prompt for No Memory in ALFWorld\.\[SYSTEM\] You are controlling a text\-based ALFWorld environment\. Choose the NEXT action as ONE admissible command string\. Output only the command, copied verbatim from the admissible list\.\[USER\] Task:\{objective\}Interaction history so far:\{history\}Current observation:\{current\_obs\}Admissible actions:\{admissible\}Action:Table 40:Prompt for No Memory in AppWorld\.\[SYSTEM\] You are an autonomous AppWorld agent\. You complete each task by writing Python code that callsapis\.<app\>\.<endpoint\>\(…\)and finallyapis\.supervisor\.complete\_task\(answer=…\)\. Wrap each turn’s code in triple\-backtick fences\.\[USER\] Instruction:\{instruction\}Available apps and their docs:\{api\_docs\_summary\}Recent execution history:\{history\}Reason step by step, then write the next code block\.Table 41:Prompt for retrieval\-centric baselines \(ReasoningBank, ExpeL, LightMem, Mem0, AWM, IRCoT\) in Question Answering\.\{retrieved\_memories\}is replaced by the method\-specific memory context\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. Solve problems step by step, showing your work clearly\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with your final answer wrapped in<answer\> </answer\>tags\.\- ALWAYS use<answer\> </answer\>tags\.\- Example:<answer\> A </answer\><\|im\_end\|\><\|im\_start\|\>user\{retrieved\_memories\}Answer this question:Question:\{question\}\{choices\}Analyze each option and explain your reasoning\.Write your final answer as:<answer\> \[letter\] </answer\><\|im\_end\|\>Table 42:Prompt for retrieval\-centric baselines in Math Reasoning\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. Solve problems step by step, showing your work clearly\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with\#\#\#\#followed by your final answer\.\- NEVER use \\boxed\{\} or anyLaTeXboxing format\.\- ALWAYS use\#\#\#\#format, even for mathematical expressions\.\- Example:\#\#\#\# 42or\#\#\#\# xˆ2 \+ 1<\|im\_end\|\><\|im\_start\|\>user\{retrieved\_memories\}Solve this math word problem:\{question\}Think step by step, showing your calculations\.Write your final answer as:\#\#\#\# \[number\]<\|im\_end\|\>Table 43:Prompt for retrieval\-centric baselines in Code Generation\.<\|im\_start\|\>systemYou are an expert Python programmer\. Solve coding problems step by step\.CRITICAL FORMAT REQUIREMENT:\- Provide your final solution in a‘‘‘pythoncode block\.\- Example format:‘‘‘python\\ndef solution\(\): pass\\n‘‘‘<\|im\_end\|\><\|im\_start\|\>user\{retrieved\_memories\}Task:\{task\}Your code should pass these tests:\{test\_cases\}Think through your approach, then write the Python function\.REMEMBER: Provide your code in a‘‘‘pythoncode block\.<\|im\_end\|\>Table 44:Prompt for retrieval\-centric baselines in ALFWorld\.\[SYSTEM\] You are controlling a text\-based ALFWorld environment\. Choose the NEXT action as ONE admissible command string\. Output only the command, copied verbatim from the admissible list\.\[USER\] Task:\{objective\}\{retrieved\_memories\}Interaction history so far:\{history\}Current observation:\{current\_obs\}Admissible actions:\{admissible\}Action:Table 45:Prompt for retrieval\-centric baselines in AppWorld\.\[SYSTEM\] You are an autonomous AppWorld agent\. You complete each task by writing Python code that callsapis\.<app\>\.<endpoint\>\(…\)and finallyapis\.supervisor\.complete\_task\(answer=…\)\. Wrap each turn’s code in triple\-backtick fences\.\[USER\] Instruction:\{instruction\}\{retrieved\_memories\}Available apps and their docs:\{api\_docs\_summary\}Recent execution history:\{history\}Reason step by step, then write the next code block\.Table 46:Prompt for summarizing retrieved memories in Search\-o1\.<\|im\_start\|\>userTask Instruction:You are tasked with reading and analyzing memories based on the following inputs:Previous Reasoning Steps,Current Search Query, andSearched Memories\. Your objective is to extract relevant and helpful information for theCurrent Search Queryfrom theSearched Memoriesand seamlessly integrate this information into thePrevious Reasoning Stepsto continue reasoning for the original question\.Guidelines:1\. Carefully review the content of each searched memory\. Identify factual information that is relevant to the Current Search Query\.2\. Select the information that directly contributes to advancing the Previous Reasoning Steps\. Ensure that the extracted information is accurate and relevant\.Output Format:\- If the memories provide helpful information: present the information beginning with\*\*Final Information\*\*\*\*Final Information\*\*\[Helpful information\]\- If the memories do not provide helpful information:\*\*Final Information\*\*No helpful information found\.Inputs:\- Previous Reasoning Steps:\{prev\_reasoning\}\- Current Search Query:\{search\_query\}\- Searched Memories:\{retrieved\_memories\}<\|im\_end\|\>Table 47:Prompt for Search\-o1 in Question Answering\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. You have the ability to perform memory searches to help you answer the user’s question accurately\.To perform a search: write<\|begin\_search\_query\|\>your query here<\|end\_search\_query\|\>\.The system will then search and analyze relevant memories, returning helpful information in the format<\|begin\_search\_result\|\>…results…<\|end\_search\_result\|\>\.Once you have all the information you need, continue your reasoning\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with your final answer wrapped in<answer\> </answer\>tags\.\- ALWAYS use<answer\> </answer\>tags\.\- Example:<answer\> A </answer\><\|im\_end\|\><\|im\_start\|\>userAnswer this question:Question:\{question\}\{choices\}Analyze each option and explain your reasoning\.Write your final answer as:<answer\> \[letter\] </answer\><\|im\_end\|\>Table 48:Prompt for Search\-o1 in Math Reasoning\.<\|im\_start\|\>systemYou are a helpful assistant skilled in mathematics, logical reasoning, and programming\. You have the ability to perform memory searches to help you answer the user’s question accurately\.To perform a search: write<\|begin\_search\_query\|\>your query here<\|end\_search\_query\|\>\.The system will return helpful information in the format<\|begin\_search\_result\|\>…results…<\|end\_search\_result\|\>\.CRITICAL FORMAT REQUIREMENT:\- You MUST end your response with\#\#\#\#followed by your final answer\.\- NEVER use \\boxed\{\} or anyLaTeXboxing format\.\- Example:\#\#\#\# 42<\|im\_end\|\><\|im\_start\|\>userSolve this math word problem:\{question\}Think step by step, showing your calculations\.Write your final answer as:\#\#\#\# \[number\]<\|im\_end\|\>Table 49:Prompt for Search\-o1 in Code Generation\.<\|im\_start\|\>systemYou are an expert Python programmer\. You have the ability to perform memory searches to help you answer the user’s question accurately\.To perform a search: write<\|begin\_search\_query\|\>your query here<\|end\_search\_query\|\>\.The system will return helpful information in the format<\|begin\_search\_result\|\>…results…<\|end\_search\_result\|\>\.CRITICAL FORMAT REQUIREMENT:\- Provide your final solution in a‘‘‘pythoncode block\.<\|im\_end\|\><\|im\_start\|\>userTask:\{task\}Your code should pass these tests:\{test\_cases\}Think through your approach, then write the Python function\.REMEMBER: Provide your code in a‘‘‘pythoncode block\.<\|im\_end\|\>Table 50:Prompt for Search\-o1 in ALFWorld\.\[SYSTEM\] You are controlling a text\-based ALFWorld environment\. You may interleave reasoning with explicit memory queries\. To query the experience corpus, emit:⟨\\langle\|begin\_search\_query\|⟩\\rangle\{your sub\-question\}⟨\\langle\|end\_search\_query\|⟩\\rangle\. The system will return summarised search results inside⟨\\langle\|begin\_search\_result\|⟩\\rangle…⟨\\langle\|end\_search\_result\|⟩\\rangle\. After at mostS=3S\{=\}3rounds, output the next admissible action\.\[USER\] Task:\{objective\}Interaction history:\{history\}Current observation:\{current\_obs\}Admissible actions:\{admissible\}Reason and \(optionally\) search; then output Action:⟨\\langlecmd⟩\\rangle\.Table 51:Prompt for Search\-o1 in AppWorld\.\[SYSTEM\] You are an autonomous AppWorld agent\. You may interleave reasoning with explicit memory queries\. To query the experience corpus, emit:⟨\\langle\|begin\_search\_query\|⟩\\rangle\{your sub\-question\}⟨\\langle\|end\_search\_query\|⟩\\rangle\. The system will return summarised search results inside⟨\\langle\|begin\_search\_result\|⟩\\rangle…⟨\\langle\|end\_search\_result\|⟩\\rangle\. After at mostS=3S\{=\}3rounds, output the next code block\.\[USER\] Instruction:\{instruction\}Available apps and docs:\{api\_docs\_summary\}Recent execution history:\{history\}Reason and \(optionally\) search; then write the next code block in triple\-backtick fences\.

## Appendix GExperimental Result Analysis

### G\.1ExpGraphOutperforms General Prompt\-based Baselines and Experience Learning Methods

We evaluateExpGraphon ExpSuite, covering both single\-turn static tasks and multi\-step agentic environments\. Results are reported in Table[2](https://arxiv.org/html/2605.30712#S4.T2)and Table[3](https://arxiv.org/html/2605.30712#S4.T3)\. We have the following observations\.

ExpGraphAchieves the Best Overall Performance Across Static and Agentic Settings\.ExpGraphachieves the best average performance across all evaluated settings and executor models\. On ExpSuite\-Static,ExpGraphimproves the average score over the strongest baseline by 12\.2% with the small executor and 4\.7% with the large executor\. These gains are consistent across question answering, mathematical reasoning, and code generation tasks, rather than being driven by a single benchmark\. On ExpSuite\-Agentic,ExpGraphfurther improves the weighted average score over the strongest baseline by 21\.4% with the small executor and 12\.7% with the large executor\. Meanwhile, compared with the most efficient non\-ExpGraphbaseline,ExpGraphreduces average steps by 12\.7% and 21\.6%, respectively\. These results show thatExpGraphimproves both final task performance and multi\-step decision efficiency\.

Experience Reuse Is Especially Beneficial for Weaker Executors and Agentic Tasks\. The relative gains reveal two important trends\. First,ExpGraphbrings larger improvements to smaller executors\. On ExpSuite\-Static, the relative gain is 12\.2% for the small executor, compared with 4\.7% for the large executor; on ExpSuite\-Agentic, the gain is 21\.4% for the small executor, compared with 12\.7% for the large executor\. This suggests that external experience reuse is especially useful when the executor has weaker built\-in reasoning or planning ability, because retrieved experiences provide task\-solving guidance without modifying the executor\. Second, the gains are larger in agentic environments than in static tasks\. For the small executor, the relative improvement increases from 12\.2% on static tasks to 21\.4% on agentic tasks; for the large executor, it increases from 4\.7% to 12\.7%\. This indicates that experience reuse becomes more valuable as tasks require longer\-horizon decision\-making, where past successes and failures can guide actions, avoid repeated mistakes, and reduce unnecessary exploration\. Together, these results support the central motivation ofExpGraph: graph\-structured experience retrieval is most beneficial when the executor must solve complex tasks under limited internal adaptation, especially in interactive settings where prior trajectories directly improve both success and step efficiency\.

### G\.2ExpGraphExhibits Superior Generalization Capabilities Across Different LLM Executors

We evaluate executor transfer under three settings\.Small\-to\-large transferuses the small executor as the source and the large executor as the target, testing whether experience components learned with cheaper models can benefit stronger frozen LLMs\.Large\-to\-small transferuses the large executor as the source and the small executor as the target, testing whether high\-quality experiences learned from stronger models can still be exploited by weaker models\.Non\-reasoning\-to\-reasoning transferuses non\-reasoning or weak\-reasoning executors, Gemini\-3\.1\-Flash\-Lite and Qwen\-32B, as the source, and reasoning\-capable executors, Claude\-Sonnet\-4 and DeepSeek\-R1\-Distill\-Llama\-8B, as the target\. For each setting, we compare Graph\-only Transfer, Copilot\-only Transfer, Graph\+Copilot Transfer, and target\-specificExpGraph\.

ExpGraphEnables Small\-to\-Large Transfer with Minimal Performance Loss\. Experience components learned from smaller executors transfer effectively to larger executors\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(a\), Graph\+Copilot Transfer performs closest to target\-specificExpGraphacross all datasets\. Graph\-only Transfer reuses task\-level skills and failure lessons, while Copilot\-only Transfer preserves part of the learned retrieval behavior\. Transferring both components together is consistently more effective, suggesting that strong transfer requires preserving both the experience graph and the policy that decides how to use it\. This result shows thatExpGraphcan train experience components with cheaper small executors and reuse them with larger frozen LLMs without target\-side retraining\.

ExpGraphShows Large\-to\-Small Transfer Is Harder but Still Effective\. Transferring from stronger executors to weaker executors is more challenging, but still provides clear benefits\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(b\), all transfer variants perform below target\-specificExpGraph, and the overall radar area is smaller than in the small\-to\-large setting\. This is expected because experiences from stronger executors may contain more complex reasoning patterns, longer action plans, or denser constraints that weaker executors cannot fully exploit\. Even so, Graph\+Copilot Transfer remains the strongest zero\-shot variant across most domains, indicating that weaker executors can still benefit from high\-quality transferred experience when both the graph structure and retrieval policy are preserved\. The comparison with Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(a\) further shows that upward transfer is easier than downward transfer\.

ExpGraphTransfers Strongly from Non\-Reasoning to Reasoning Executors\. Experience components learned from non\-reasoning executors generalize well to reasoning\-capable executors\. As shown in Figure[2](https://arxiv.org/html/2605.30712#S4.F2)\(c\), Graph\+Copilot Transfer approaches target\-specificExpGraphand shows particularly strong performance on ALFWorld and AppWorld\. This suggests that reasoning\-capable executors can better interpret and adapt transferred experiences for multi\-step decision\-making\. The experience graph provides reusable procedural knowledge, including successful plans and failure lessons, while the reasoning executor integrates these experiences under new states and constraints\. These results indicate that the transferred components are not merely executor\-specific artifacts, but encode reusable experience utility across different reasoning capabilities\.

### G\.3Ablation Studies ValidateExpGraph’s Key Components

To understand the contribution of each component inExpGraph, we conduct ablation studies across five evaluation domains: QA, Reasoning, Coding, ALFWorld, and AppWorld\. These variants examine the effects of similarity\-aware experience management, graph\-structured experience organization, graph diffusion, and utility\-aware ranking\. Results are reported in Figure[3](https://arxiv.org/html/2605.30712#S4.F3)\.

- •w/o Similarity Filtering: Removes similarity\-based filtering during experience insertion, allowing redundant experiences to be repeatedly retained\.
- •Flat Experience: Replaces the experience graph with a flat experience pool and retrieves top\-KKexperiences purely by semantic similarity\.
- •w/o Graph Diffusion: Keeps the graph structure but disables graph expansion, retrieving only from the initial semantic seed nodes\.
- •w/o Utility Ranking: Ranks candidate experiences only by semantic similarity, without using historical utility statistics\.

As shown in Figure[3](https://arxiv.org/html/2605.30712#S4.F3), removing any component consistently degrades performance, confirming thatExpGraphbenefits from the joint design of experience management, graph structure, graph diffusion, and utility feedback\. Among all variants,Flat Experienceshows the largest drop across both small and large executors, indicating that treating experiences as isolated entries is insufficient for effective reuse\. This demonstrates the value of organizing experiences as a graph, which connects related skills, failure lessons, and transferable strategies beyond nearest\-neighbor retrieval\. Thew/o Graph Diffusionvariant also performs worse, especially on ALFWorld and AppWorld, suggesting that useful experiences are often structurally related rather than directly retrieved as semantic seeds\. RemovingUtility Rankingfurther hurts performance, showing that semantic relevance alone cannot reliably identify experiences that improve downstream execution\. Finally,w/o Similarity Filteringcauses a smaller but consistent decline, indicating that redundancy control helps maintain a compact and high\-quality experience graph\. Overall, these results validate all four design choices inExpGraph\.

## Appendix HCase Studies

This appendix presents representative successful examples for each baseline across the three ExpSuite\-Static task categories: Question Answering \(QA\), Math Reasoning, and Code Generation\. For each method, we report the input question, the ground truth answer, the retrieved memory context \(if applicable\), and the model\-generated response \(abbreviated where necessary\)\.

No Memory case studies are in Table[52](https://arxiv.org/html/2605.30712#A8.T52)–[56](https://arxiv.org/html/2605.30712#A8.T56), ReasoningBank in Table[57](https://arxiv.org/html/2605.30712#A8.T57)–[60](https://arxiv.org/html/2605.30712#A8.T60), ExpeL in Table[61](https://arxiv.org/html/2605.30712#A8.T61)–[64](https://arxiv.org/html/2605.30712#A8.T64), LightMem in Table[65](https://arxiv.org/html/2605.30712#A8.T65)–[69](https://arxiv.org/html/2605.30712#A8.T69), Mem0 in Table[70](https://arxiv.org/html/2605.30712#A8.T70)–[74](https://arxiv.org/html/2605.30712#A8.T74), AWM in Table[75](https://arxiv.org/html/2605.30712#A8.T75)–[79](https://arxiv.org/html/2605.30712#A8.T79), IRCoT in Table[80](https://arxiv.org/html/2605.30712#A8.T80)–[84](https://arxiv.org/html/2605.30712#A8.T84), Search\-o1 in Table[85](https://arxiv.org/html/2605.30712#A8.T85)–[89](https://arxiv.org/html/2605.30712#A8.T89), MeMRL in Table[90](https://arxiv.org/html/2605.30712#A8.T90)–[94](https://arxiv.org/html/2605.30712#A8.T94), S3 in Table[95](https://arxiv.org/html/2605.30712#A8.T95)–[99](https://arxiv.org/html/2605.30712#A8.T99), and ExpGraph in Table[100](https://arxiv.org/html/2605.30712#A8.T100)–[104](https://arxiv.org/html/2605.30712#A8.T104)\.

Table 52:No Memory case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DUsing the formulap=mvp=mv:p=8\.0kg×2\.0m/s=16\.0kg⋅m/sp=8\.0\\,\\text\{kg\}\\times 2\.0\\,\\text\{m/s\}=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\.A \(4\.0\) and B \(6\.0\) and C \(10\.0\) are all incorrect; D matches the calculated value\.<answer\> D </answer\>Table 53:No Memory case study in Math\.Question: Trevor is a comic book illustrator\. In the past three months, he has drawn 220 pages\. The comic releases a new issue once a month\. The first and second months’ issues were the same number of pages, but the third month’s issue was four pages longer than either of the others\. How many pages was the first issue?Ground Truth: 72Let the first and second issue each havexxpages; the third hasx\+4x\+4pages\.x\+x\+\(x\+4\)=220⇒3x\+4=220⇒3x=216⇒x=72x\+x\+\(x\+4\)=220\\;\\Rightarrow\\;3x\+4=220\\;\\Rightarrow\\;3x=216\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 54:No Memory case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm \(divide by smallest prime factors iteratively\)\.Dividenby the smallest prime \(2\) while divisible, then try odd divisors up ton\\sqrt\{n\}; the remaining value is the largest prime factor\.\#\#\#\#‘‘‘pythondef largest\_prime\_factor\(n: int\) \-\> int:i = 2while i \* i <= n:if n % i: i \+= 1else: n //= ireturn n‘‘‘Table 55:No Memory case study in ALFWorld\.Task: put some remotecontrol on armchair\.Outcome: Score = 1\.0, Steps = 4Retrieved Memories:\(none — No\-Memory baseline; the agent acts from prompt \+ admissible actions only\.\)Action Trace:go to coffeetable 1 \(found remotecontrol 1, 2\)→\\totake remotecontrol 1→\\togo to armchair 1→\\tomove remotecontrol 1 to armchair 1✓\\checkmarkTable 56:No Memory case study in AppWorld\.Task: How much money have I sent to others on Venmo this month so far?Outcome: Score = 0\.50, Steps = 21Retrieved Memories:\(none — No\-Memory baseline\.\)Final Action:apis\.supervisor\.complete\_task\(answer="433\.0"\)Table 57:ReasoningBank case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DRetrieved Memory 1 \(✓ Success Strategy\):Calculating Momentum— When calculating momentum, ensure to multiply the given mass and velocity accurately\. This example highlights the importance of correctly applying the formula and performing the multiplication to avoid common mistakes\.Retrieved Memory 2 \(✗ Lesson from Failure\):Impulse and Momentum in Collisions— In a collision, the total change in momentum of the system is zero, but the lighter object experiences a greater change in momentum due to its lower mass\.Usingp=mvp=mv:p=8\.0×2\.0=16\.0kg⋅m/sp=8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\. Options A, B, C are all incorrect\. D matches the calculated value\.<answer\> D </answer\>Table 58:ReasoningBank case study in Math\.Question: Trevor is a comic book illustrator\. In the past three months, he has drawn 220 pages\. The comic releases one issue per month\. The first and second months’ issues were equal; the third was four pages longer\. How many pages was the first issue?Ground Truth: 72Retrieved Memory 1 \(✗ Lesson from Failure\):Misinterpretation of the Problem— The mistake was in not accounting for all components described in the problem\. Carefully read each condition before setting up equations\.Retrieved Memory 2 \(✗ Lesson from Failure\):Misinterpretation of the Problem— Always verify that all constraints are captured in the equation before solving\.Letxx= pages in issue 1 and 2; issue 3 =x\+4x\+4\.x\+x\+\(x\+4\)=220⇒3x=216⇒x=72x\+x\+\(x\+4\)=220\\;\\Rightarrow\\;3x=216\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 59:ReasoningBank case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.Retrieved Memory 1 \(✓ Success Strategy\):Finding the Largest Perfect Square Factor— Decompose the number into its prime factors; select the highest even power for each prime\.Retrieved Memory 2 \(✗ Lesson from Failure\):Efficient Prime Factorization— Always use the square root to limit the search space for prime factors, and handle the remaining factor after the loop\.\#\#\#\#‘‘‘pythondef largest\_prime\_factor\(n: int\) \-\> int:while n % 2 == 0:max\_prime = 2; n //= 2for i in range\(3, int\(n\*\*0\.5\)\+1, 2\):while n % i == 0:max\_prime = i; n //= iif n \> 2: max\_prime = nreturn max\_prime‘‘‘Table 60:ReasoningBank case study in AppWorld\.Task: Give me a random funny quote from my SimpleNote note about it\. Just the quote, nothing else\.Outcome: Score = 0\.67, Steps = 13Retrieved Memories:\#\# Retrieved memory items \(top\-K cosine over reasoning rationales\):\- When asked for a quote from a SimpleNote, list and search notes for keyword “quote” or “funny”, then read the matching note’s content\.\- Useapis\.simple\_note\.list\_notes\(access\_token=\.\.\., title\_query="quote"\)to enumerate candidates before reading\.\- For “random” selections, pick uniformly from the matched lines using Pythonrandom\.choice\.\[… additional rationale items omitted …\]Final Action:apis\.supervisor\.complete\_task\(answer="\- I’m on a whiskey and chocolate diet\. I’ve lost three days, but I don’t care\."\)Table 61:ExpeL case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DRetrieved Memory 1 \(✓ Success Strategy\):Calculating Momentum— When calculating momentum, ensure to multiply the given mass and velocity accurately\.Retrieved Memory 2 \(✓ Success Strategy\):Impulse\-Momentum Theorem for Impact Problems— Calculate the change in momentum by determining initial and final velocities\. Use the impulse\-momentum theorem to find the average force\.p=mv=8\.0×2\.0=16\.0kg⋅m/sp=mv=8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\.A \(4\.04\.0\), B \(6\.06\.0\), C \(10\.010\.0\) are all incorrect; D \(16\.016\.0\) matches\.\#\#\#\# DTable 62:ExpeL case study in Math\.Question: Trevor is a comic book illustrator \(220 pages total, 3 issues, first two equal, third is \+4 pages\)\. How many pages was the first issue?Ground Truth: 72Retrieved Memory 1 \(✓ Success Strategy\):Solving Equations with Word Problems— Identify variables and relationships, set up the equation, and solve it methodically\.Retrieved Memory 2 \(✓ Success Strategy\):Solving Word Problems with Equations— Express the unknown in terms of a variable, form an equation from the given information, simplify and solve\.Letxx= pages in issues 1 and 2; issue 3 =x\+4x\+4\.3x\+4=220⇒x=723x\+4=220\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 63:ExpeL case study in Code Generation\.Question: Compute thenn\-th element of the Fib4 sequence:fib4\(n\) = fib4\(n\-1\) \+ fib4\(n\-2\) \+ fib4\(n\-3\) \+ fib4\(n\-4\), withfib4\(0\)=0, fib4\(1\)=0, fib4\(2\)=2, fib4\(3\)=0\. No recursion\.Ground Truth: Iterative window of size 4\.Retrieved Memory 1 \(✓ Success Strategy\):Identifying Periodic Patterns in Recursive Sequences— When dealing with recursive sequences, compute initial values and look for repeating patterns that can be exploited for efficiency\.Retrieved Memory 2 \(✓ Success Strategy\):Telescoping Series in Fibonacci Sums— For sum\-based recurrences, check if closed\-form or iterative window approaches avoid redundant computation\.‘‘‘pythondef fib4\(n: int\) \-\> int:if n in \(0, 1, 3\): return 0if n == 2: return 2w = \[0, 0, 2, 0\]for i in range\(4, n \+ 1\):w = \[w\[1\], w\[2\], w\[3\], sum\(w\)\]return w\[3\]‘‘‘Table 64:ExpeL case study in ALFWorld\.Task: put two spraybottle in garbagecan\.Outcome: Score = 1\.0, Steps = 8Retrieved Memories:\#\# ExpeL exemplars \+ induced rules \(for similar pick\-two tasks\):\- rule: For “put two X in Y” tasks, locate*both*X instances before placing — do not relocate the first to Y until the second is found, to avoid backtracking\.\- exemplar: Successful trial T2019… forput\-two\-spraybottle\-in\-garbagecan\-3: agent visited sidetable 1 first \(found two spraybottles together\), then garbagecan 1, completing in 7 steps\.\[… additional exemplars and rules omitted …\]Action Trace:go to sidetable 1 \(found spraybottle 1, spraybottle 2\)→\\totake spraybottle 1→\\togo to garbagecan 1→\\tomove spraybottle 1 to garbagecan 1→\\togo to sidetable 1→\\totake spraybottle 2→\\togo to garbagecan 1→\\tomove spraybottle 2 to garbagecan 1✓\\checkmarkTable 65:LightMem case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DRetrieved LTM Item 1 \(✗ Lesson from Failure\):Incorrect Calculation of Change in Momentum— Carefully consider the velocity before and after the event and accurately apply the change\-in\-momentum equation\. Double\-check calculations and consider all factors that may affect velocity\.Retrieved LTM Item 2 \(✓ Success Strategy\):Energy Conversion in Projectile Motion— Identify the initial and final states of energy \(potential vs\. kinetic\) and apply conservation principles\.p=m×v=8\.0×2\.0=16\.0kg⋅m/sp=m\\times v=8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\.A, B, C are all inconsistent with this value; D is correct\.<answer\> D </answer\>Table 66:LightMem case study in Math\.Question: Trevor is a comic book illustrator \(220 pages total, 3 issues, first two equal, third is \+4 pages\)\. How many pages was the first issue?Ground Truth: 72Retrieved LTM Item 1 \(✓ Success Strategy\):Calculate the Relevant Subset Before Proceeding— Identify and exclude irrelevant components before setting up the main equation to avoid incorrect calculations\.Retrieved LTM Item 2 \(✓ Success Strategy\):Breaking Down Complex Problems into Manageable Steps— Decompose the problem into sub\-goals, solve each in sequence, and combine results\.Letxx= pages in issues 1 and 2; issue 3 =x\+4x\+4\.3x\+4=220⇒x=723x\+4=220\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 67:LightMem case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.Retrieved LTM Item 1 \(✓ Success Strategy\):Identifying the Largest Prime Factor in a Factorial— Start from the largest prime≤\\leqthe upper limit; for generalnn, use trial division up ton\\sqrt\{n\}and track the largest divisor\.Retrieved LTM Item 2 \(✗ Lesson from Failure\):Optimizing Prime Factorization— After the loop, if the remainingn\>1n\>1, it is itself a prime factor and should be recorded\.‘‘‘pythondef largest\_prime\_factor\(n: int\):largest\_factor = 1for i in range\(2, int\(n\*\*0\.5\) \+ 1\):while n % i == 0:largest\_factor = i; n //= iif n \> 1: largest\_factor = nreturn largest\_factor‘‘‘Table 68:LightMem case study in ALFWorld\.Task: put a soapbottle in countertop\.Outcome: Score = 1\.0, Steps = 12Retrieved Memories:Long\-term memory \(relevant facts, top\-20 by cosine\):\- \[2026\-04\-29 Wed\] Soapbottle items are commonly stored inside cabinets in bathroom layouts\.\- \[2026\-04\-28 Tue\] Cabinet 4 contains soapbottle 1, soapbottle 2, and candle 2 in this layout\.\- \[2026\-05\-06 Wed\] User moved cloth 1 to cabinet 1 in a similar pick\-and\-place task\.\- \[2026\-04\-29 Wed\] To complete a “put X in countertop” task: take X from its location, then move X to countertop 1\.\- \[2026\-04\-28 Tue\] Cabinet 1 typically holds candle, soapbar, spraybottle \(not soapbottle\)\.\[… additional LTM facts omitted …\]Action Trace:explore cabinets 2–3 \(no soapbottle\)→\\togo to cabinet 1 \(soapbar/spraybottle only\)→\\togo to cabinet 4 \(found soapbottle 1, 2\)→\\totake soapbottle 1→\\togo to countertop 1→\\tomove soapbottle 1 to countertop 1✓\\checkmarkTable 69:LightMem case study in AppWorld\.Task: Chelsea and I have been carpooling to work this month\. They have requested money for it on Venmo\. Accept it\.Outcome: Score = 0\.83, Steps = 17Retrieved Memories:Long\-term memory \(relevant facts, top\-20 by cosine\):\- \[2026\-05\-04 Mon\] User authenticated with venmo before viaapis\.venmo\.login\.\- \[2026\-05\-04 Mon\] Venmo balance can be checked viaapis\.venmo\.show\_account\_balance\(access\_token=\.\.\.\)\.\- \[2026\-05\-04 Mon\] Pending payment requests are listed viaapis\.venmo\.list\_pending\_payment\_requests\.\- \[2026\-05\-04 Mon\] To approve a request, callapis\.venmo\.approve\_payment\_request\(payment\_request\_id=\.\.\., access\_token=\.\.\.\)\.\[… additional LTM facts omitted …\]Final Action:print\(apis\.venmo\.approve\_payment\_request\(access\_token=venmo\_token,payment\_request\_id=6097\)\)apis\.supervisor\.complete\_task\(answer="The payment request from Chelsea for $520\.00 has been successfully approved\."\)Table 70:Mem0 case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DRetrieved Procedural Summary \(✓\):procedural\_summary— Task Objective: Analyze options to determine the correct answer about energy conversion when a bowling ball is released\. Sequential actions: \(1\) analyze option A \(stored energy\); \(2\) analyze option B \(kinetic energy of pins\); \(3\) analyze option C; \(4\) select the option consistent with conservation of energy\.Usingp=m×vp=m\\times v:p=8\.0×2\.0=16\.0kg⋅m/sp=8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\.Comparing with the options, D \(16\.0 kg×\\timesm/s\) is correct\.<answer\> D </answer\>Table 71:Mem0 case study in Math\.Question: Trevor is a comic book illustrator \(220 pages total, 3 issues, first two equal, third is \+4 pages\)\. How many pages was the first issue?Ground Truth: 72Retrieved Procedural Summary \(✓\):procedural\_summary— Task Objective: Solve a word problem to determine average daily tickets\. Sequential actions: \(1\) calculate tickets given in first 15 days; \(2\) subtract from the monthly goal; \(3\) divide remaining by days left; \(4\) verify by checking the computed average\.Step 1: definexx= pages in issues 1 and 2; issue 3 =x\+4x\+4\.Step 2:x\+x\+\(x\+4\)=220⇒3x=216⇒x=72x\+x\+\(x\+4\)=220\\;\\Rightarrow\\;3x=216\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 72:Mem0 case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.Retrieved Procedural Summary \(✓\):procedural\_summary— Task: Find the largest prime factor of 97\!\. Actions: \(1\) understand that 97\! = product of all integers 1–97; \(2\) identify the largest prime≤97\\leq 97, which is 97 itself; \(3\) return 97\.‘‘‘pythondef largest\_prime\_factor\(n: int\):i = 2while i \* i <= n:if n % i: i \+= 1else: n //= ireturn n‘‘‘Table 73:Mem0 case study in ALFWorld\.Task: put some spraybottle on toilet\.Outcome: Score = 1\.0, Steps = 9Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-05\-06\] Spraybottle items are commonly found inside cabinets in bathroom layouts\.\- \[2026\-05\-06\] Cabinet 2 contains spraybottle 1, candle 2, cloth 2 in this layout\.\- \[2026\-05\-06\] To complete a “put X on toilet” task: navigate to toilet 1 and callmove X to toilet 1\.\- \[2026\-05\-06\] Cabinet 1 typically holds soapbottle and toiletpaper, not spraybottles\.\[… additional fact entries omitted …\]Action Trace:go to cabinet 1 \(no spraybottle\)→\\togo to cabinet 3 \(toiletpaper only\)→\\togo to cabinet 2 \(found spraybottle 1\)→\\totake spraybottle 1→\\togo to toilet 1→\\tomove spraybottle 1 to toilet 1✓\\checkmarkTable 74:Mem0 case study in AppWorld\.Task: I keep a log of all my liked songs and respective artists in a note in simple\_note\. I want to add release month information for them as well\. \[…\]Outcome: Score = 0\.75, Steps = 31Retrieved Memories:\#\# Relevant memories from past experience:\- \[2026\-05\-06\] User has authenticated with simple\_note app viaapis\.simple\_note\.login\.\- \[2026\-05\-06\] Note IDs in simple\_note are queried viaapis\.simple\_note\.list\_notes\.\- \[2026\-05\-06\] Spotify track release date is inapis\.spotify\.show\_track\(track\_id=\.\.\.\)\.release\_date\.\- \[2026\-05\-06\] To update a simple\_note, callapis\.simple\_note\.update\_note\(note\_id=\.\.\., content=\.\.\.\)\.\[… additional fact entries omitted …\]Final Action:apis\.simple\_note\.update\_note\(access\_token=access\_token, note\_id=3084, content=updated\_content\)apis\.supervisor\.complete\_task\(answer="The release month information has been added to the songs in the note\."\)Table 75:AWM case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DWorkflow forarc\_c:Workflow 1: Analyze Each Option and Explain Your Reasoning— \(1\) Identify the question\. \(2\) List all options\. \(3\) Analyze each option for relevance, feasibility, and potential outcomes\. \(4\) Evaluate strengths and weaknesses\. \(5\) Select the most suitable option\.Workflow 2: Identify the Key Characteristics…\[abbreviated\]Following Workflow 1: applyp=mv=8\.0×2\.0=16\.0p=mv=8\.0\\times 2\.0=16\.0\. A, B, C are all less than 16\.0; D matches\.<answer\> D </answer\>Table 76:AWM case study in Math\.Question: Trevor is a comic book illustrator \(220 pages total, 3 issues, first two equal, third is \+4 pages\)\. How many pages was the first issue?Ground Truth: 72Workflow forgsm8k:Workflow 3: Set Up and Solve an Equation— \(1\) Identify the unknown variable\. \(2\) Set up an equation from the problem statement\. \(3\) Simplify and solve\. \(4\) Verify the answer satisfies all constraints\.Workflow 3: letxx= first issue pages;x\+x\+\(x\+4\)=220x\+x\+\(x\+4\)=220\.3x=216⇒x=723x=216\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 77:AWM case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.Workflow forhumaneval\_plus:Dividennby the smallest prime \(2\) while divisible; then increment to the next divisor and repeat\. The last divisor used beforennreaches 1 is the largest prime factor\. Handle the case where the remainingn\>1n\>1after the loop\.‘‘‘pythondef largest\_prime\_factor\(n: int\):i = 2while i \* i <= n:if n % i == 0: n //= ielse: i \+= 1return n‘‘‘Table 78:AWM case study in ALFWorld\.Task: clean some dishsponge and put it in countertop\.Outcome: Score = 1\.0, Steps = 16Retrieved Memories:\#\# AWM workflow \(induced fromN=5N\{=\}5successful trials of pick\_clean\_then\_place\_in\_recep\):Step 1: navigate to candidate locations of \{object\} \(shelf, drawer, cabinet\)Step 2: take \{object\} \{idx\} from \{location\}Step 3: navigate to sinkbasin 1Step 4: clean \{object\} \{idx\} with sinkbasin 1Step 5: navigate to \{target\_recep\} 1Step 6: move \{object\} \{idx\} to \{target\_recep\} 1Action Trace:go to sinkbasin 2 \(empty\)→\\togo to shelf 1 \(found dishsponge 1\)→\\totake dishsponge 1→\\togo to sinkbasin 1→\\toclean dishsponge 1 with sinkbasin 1→\\togo to countertop 1→\\tomove dishsponge 1 to countertop 1✓\\checkmarkTable 79:AWM case study in AppWorld\.Task: How much have I paid in phone bill on Venmo this year so far?Outcome: Score = 0\.50, Steps = 40Retrieved Memories:\#\# AWM workflow \(induced fromN=5N\{=\}5successful trials of venmo aggregate\-amount queries\):Step 1: get supervisor token viaapis\.supervisor\.show\_account\_passwordsStep 2: log into venmo:apis\.venmo\.login\(username=\.\.\., password=\.\.\.\)Step 3: list transactions:apis\.venmo\.show\_transactions\(access\_token=\.\.\.\)Step 4: filter by description containing \{keyword\} and date≥\\geqyear\-startStep 5: sum the matching amounts and complete the taskFinal Action:print\(apis\.api\_docs\.show\_api\_descriptions\(app\_name="supervisor"\)\)\# \(workflow truncated at 40 steps; agent collected partial total\)Table 80:IRCoT case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: DBM25\-Retrieved Example \(Round 1\):Calculating Momentum— When calculating momentum, ensure to multiply the given mass and velocity accurately\.BM25\-Retrieved Example \(Round 2\):Calculating Momentum— Correctly apply the formulap=mvp=mvand perform the multiplication to avoid common mistakes\.B \(6\.06\.0\):8\.0×2\.0≠6\.08\.0\\times 2\.0\\neq 6\.0\. C \(10\.010\.0\): incorrect\. D \(16\.016\.0\):8\.0×2\.0=16\.0kg⋅m/s8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}— correct\.\#\#\#\# DTable 81:IRCoT case study in Math\.Question: Trevor is a comic book illustrator \(220 pages total, 3 issues, first two equal, third is \+4 pages\)\. How many pages was the first issue?Ground Truth: 72BM25\-Retrieved Example \(Round 1\):Misinterpretation of the Problem— Always read all conditions before setting up equations\.BM25\-Retrieved Example \(Round 2\):Misinterpretation of the Problem— Faisal writes responses to questions; the problem involves two separate writers\.Letxx= pages in issues 1 and 2;3x\+4=220⇒x=723x\+4=220\\;\\Rightarrow\\;x=72\.\#\#\#\# 72Table 82:IRCoT case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.BM25\-Retrieved Example \(Round 1\):Finding the Largest Perfect Square Factor— Decompose into prime factors; select highest even powers\.\#\#\#\#‘‘‘pythondef largest\_prime\_factor\(n: int\) \-\> int:i = 2while i \* i <= n:if n % i: i \+= 1else: n //= ireturn n‘‘‘Table 83:IRCoT case study in ALFWorld\.Task: put two spraybottle in cabinet\.Outcome: Score = 1\.0, Steps = 10Retrieved Memories:Retrieved past experience \(chained CoT, max\_hops==2, top\-K==10\):\- act: go to cabinet 1→\\toobs: You arrive at cabinet 1\. The cabinet 1 is open\. In it, you see soapbottle 3, soapbottle 2\.\- act: go to countertop 1→\\toobs: … a spraybottle 1 …\- act: take spraybottle 1 from countertop 1→\\toobs: You pick up the spraybottle 1\.\- \[completed OK\] task: put two spraybottle in cabinet\|\|actions: go to countertop 1→\\totake spraybottle 1→\\togo to cabinet 1→\\tomove spraybottle 1 to cabinet 1→\\togo to bathroom→\\totake spraybottle 2→\\to…\[… additional chunks omitted …\]Action Trace:explore cabinets 1–3→\\tofound spraybottle 2→\\totake andmove spraybottle 2 to cabinet 1→\\togo to countertop 1 \(found spraybottle 1\)→\\totake spraybottle 1→\\tomove spraybottle 1 to cabinet 1✓\\checkmarkTable 84:IRCoT case study in AppWorld\.Task: Approve all venmo payment requests from my roommates from this calendar month\.Outcome: Score = 0\.86, Steps = 3Retrieved Memories:Retrieved past experience \(chained CoT, max\_hops==2, top\-K==10\):\- act:roommate\_emails = \[u\["email"\] for u in apis\.supervisor\.show\_relations\(relation="roommate"\)\]→\\toobs: list of roommate emails returned\- act:requests = apis\.venmo\.list\_pending\_payment\_requests\(access\_token=\.\.\.\)→\\toobs: pending requests dict\- \[completed OK\] task: approve roommates’ venmo requests\|\|actions: log in→\\topull roommate emails→\\tofilter by current month→\\toapprove in loop\[… additional chunks omitted …\]Final Action:current\_month, current\_year = pendulum\.now\(\)\.month, pendulum\.now\(\)\.yearfor request in payment\_requests:created\_at = pendulum\.parse\(request\["created\_at"\]\)if request\["sender"\]\["email"\] in roommate\_emails and\\\\backslashcreated\_at\.month == current\_month and created\_at\.year == current\_year:apis\.venmo\.approve\_payment\_request\(access\_token=venmo\_access\_token, payment\_request\_id=\.\.\.\)apis\.supervisor\.complete\_task\(status="success"\)Table 85:Search\-o1 case study in QA\.Question: A bowling ball with a mass of 8\.0 kg rolls down a bowling lane at 2\.0 m/s\. What is the momentum of the bowling ball?A\. 4\.0 kg×\\timesm/s B\. 6\.0 kg×\\timesm/s C\. 10\.0 kg×\\timesm/s D\. 16\.0 kg×\\timesm/sGround Truth: D<\|begin\_search\_query\|\>A bowling ball with mass 8\.0 kg rolls at 2\.0 m/s\. What is its momentum? A\. 4\.0 B\. 6\.0 C\. 10\.0 D\. 16\.0 kg×\\timesm/s<\|end\_search\_query\|\><\|begin\_search\_result\|\>Final InformationMomentum Formula InsightMomentum is defined asp=mvp=mv\. Multiply the object’s mass by its velocity to get momentum in kg⋅\\cdotm/s\. Common errors include adding rather than multiplyingmmandvv\.<\|end\_search\_result\|\>To find the momentum I applyp=mvp=mv\.Given:m=8\.0kgm=8\.0\\,\\text\{kg\},v=2\.0m/sv=2\.0\\,\\text\{m/s\}\.p=8\.0×2\.0=16\.0kg⋅m/sp=8\.0\\times 2\.0=16\.0\\,\\text\{kg\}\\cdot\\text\{m/s\}\.Checking options:\- A \(4\.04\.0\): would require8\.0÷2\.08\.0\\div 2\.0, wrong operation\.\- B \(6\.06\.0\):8\.0\+2\.0−4\.08\.0\+2\.0\-4\.0, no physical basis\.\- C \(10\.010\.0\):8\.0\+2\.08\.0\+2\.0, wrong operation\.\- D \(16\.016\.0\):8\.0×2\.0=16\.08\.0\\times 2\.0=16\.0✓The answer is D\.Table 86:Search\-o1 case study in Math\.Question: Trevor is a comic book illustrator\. He has just started on a new series\. The first issue took him 220 pages to illustrate\. If the series is 3 issues long, the first two issues are the same length, and the third issue is 4 pages longer than the others, how many pages was the first issue?Ground Truth: 72<\|begin\_search\_query\|\>Comic book illustrator: 3 issues, 220 pages total\. First two equal, third is 4 pages longer\. How many pages in the first issue?<\|end\_search\_query\|\><\|begin\_search\_result\|\>Final InformationAlgebraic Equation StrategyFor problems with equal parts and a small offset, letxxdenote the equal part\. Write a linear equation summing all parts to the total, then solve forxx\. Watch for off\-by\-one errors when the offset applies to only one term\.<\|end\_search\_result\|\>Letxx= pages in issue 1 \(and issue 2\)\. Issue 3 =x\+4x\+4\.Total equation:x\+x\+\(x\+4\)=220x\+x\+\(x\+4\)=2203x\+4=2203x\+4=2203x=2163x=216x=72x=72The first issue had72pages\. Verification:72\+72\+76=22072\+72\+76=220✓72Table 87:Search\-o1 case study in Code Generation\.Question: Return the largest prime factor ofn\. Assumen\>1n\>1and is not a prime\.Tests:assert largest\_prime\_factor\(15\) == 5,assert largest\_prime\_factor\(27\) == 3Ground Truth: Trial\-division algorithm\.<\|begin\_search\_query\|\>Python function to return the largest prime factor of n, where n \> 1 and n is not prime\.<\|end\_search\_query\|\><\|begin\_search\_result\|\>Final InformationTrial Division StrategyTo find the largest prime factor: \(1\) dividennby 2 while even; \(2\) iterate oddiifrom 3 whilei2≤ni^\{2\}\\leq n, dividing out all multiples; \(3\) ifn\>1n\>1remains, it is the largest prime factor\. Time complexityO\(n\)O\(\\sqrt\{n\}\)\.<\|end\_search\_result\|\>I need to find the largest prime factor ofnn\.Algorithm \(trial division\):1\. Handle factor 2: whilennis even, divide by 2; track last factor\.2\. Handle odd factors: fori=3,5,…i=3,5,\\ldotswhilei2≤ni^\{2\}\\leq n, divide out all copies ofii\.3\. Ifn\>1n\>1after the loop,nnitself is the largest prime factor\.Trace onlargest\_prime\_factor\(15\):15/3=515/3=5;5\>15\>1⇒\\Rightarrowreturn 5 ✓Trace onlargest\_prime\_factor\(27\):27/3=927/3=9,9/3=39/3=3,3/3=13/3=1; last factor = 3 ✓The cleanest implementation uses a running variableiistarting at 2 and repeatedly dividesnn\.‘‘‘pythondef largest\_prime\_factor\(n: int\) \-\> int:i = 2while i \* i <= n:if n % i: i \+= 1else: n //= ireturn n‘‘‘Table 88:Search\-o1 case study in ALFWorld\.Task: cool some bread and put it in countertop\.Outcome: Score = 1\.0, Steps = 11Retrieved Memories:Search\-o1 trace \(planner\-driven iterative search,≤3\\leq 3rounds\):⟨\\langlesearch⟩\\ranglehow to cool an item in ALFWorld kitchen⟨\\langle/search⟩\\rangle→\\toretrieved: “open fridge 1→\\tocool \{X\} with fridge 1”\.⟨\\langlesearch⟩\\ranglewhere can bread be found in seen ALFWorld layouts⟨\\langle/search⟩\\rangle→\\toretrieved: “countertop 2 commonly contains bread, egg, pan, kettle in this layout”\.Action Trace:go to countertop 2 \(found bread 1\)→\\totake bread 1→\\togo to fridge 1→\\toopen fridge 1→\\tocool bread 1 with fridge 1→\\togo to countertop 1→\\tomove bread 1 to countertop 1✓\\checkmarkTable 89:Search\-o1 case study in AppWorld\.Task: Update all my Spotify playlist titles with the most common song genre in that playlist in this format: “<original\_title\>\|\|<most\_common\_genre\>”\. \[…\]Outcome: Score = 0\.83, Steps = 6Retrieved Memories:Search\-o1 trace \(planner\-driven iterative search\):⟨\\langlesearch⟩\\ranglehow to enumerate Spotify playlists for the current user⟨\\langle/search⟩\\rangle→\\toretrieved: “apis\.spotify\.show\_playlists\(access\_token=\.\.\.\)returnsplaylist\_id, title, tracks”\.⟨\\langlesearch⟩\\ranglehow to update a playlist title⟨\\langle/search⟩\\rangle→\\toretrieved: “apis\.spotify\.update\_playlist\(playlist\_id, title=\.\.\.\)”\.Final Action:\# \(loop body: for each playlist, compute mode of track genres, update title\)apis\.supervisor\.complete\_task\(answer="All Spotify playlist titles have been updated with the most common song genre\."\)Table 90:MeMRL case study in QA\.Question: The speed of a sound wave varies as it travels through different substances\. Which factor will most affect the speed of a sound wave?A\. the frequency of the wave B\. the wavelength of the wave C\. the source that created the sound D\. the distance between molecules in the substanceGround Truth: DRetrieved Memory 1 \(✓ Success, Q = 0\.82\):Sound Transmission in Media— Consider the properties of each medium \(rock, seawater, space, atmosphere\) and compare their speeds of sound transmission\. The speed of sound depends on molecular density and arrangement\.Retrieved Memory 2 \(✗ Failure, Q = 0\.31\):Wave Properties Confusion— Incorrectly assumed that frequency determines speed; frequency affects pitch, not propagation speed\.Retrieved Memory 3 \(✓ Success, Q = 0\.76\):Molecular Density and Sound— The closer the molecules in a substance, the faster sound can propagate\. This is why sound travels faster in solids than in gases\.The speed of sound depends on the medium’s properties, specifically the distance between molecules\. Frequency \(A\) affects pitch, wavelength \(B\) is derived from speed and frequency, and the source \(C\) creates the wave but doesn’t determine speed\. The molecular arrangement \(D\) is the key factor\.<answer\> D </answer\>Table 91:MeMRL case study in Math\.Question: There are 15 girls and 10 boys in Ms\. Smith’s class\. She divided 375 books equally among them\. How many books did all the girls get combined?Ground Truth: 225Retrieved Memory 1 \(✓ Success, Q = 0\.91\):Division and Grouping— First find the total number of students, then compute per\-student share, then multiply by the target group size\.Retrieved Memory 2 \(✓ Success, Q = 0\.85\):Boys and Girls Ratio— Correctly represent the relationship between groups using subtraction from the total, then allocate proportionally\.Retrieved Memory 3 \(✗ Failure, Q = 0\.22\):Misapplied Division— Divided books only among girls instead of all students; always check whether the division applies to the full group\.Total students=15\+10=25=15\+10=25\. Books per student=375÷25=15=375\\div 25=15\. Girls’ total=15×15=225=15\\times 15=225\.<answer\> 225 </answer\>Table 92:MeMRL case study in Code Generation\.Question: Given a class name and a list of extensions, find the strongest extension\. The strength isCAP−SM\\text\{CAP\}\-\\text\{SM\}, where CAP is the count of uppercase letters and SM is the count of lowercase letters in the extension name\. Return"ClassName\.StrongestExt"\.Tests:assert Strongest\_Extension\(’my\_class’, \[’AA’, ’Be’, ’CC’\]\) == ’my\_class\.AA’Ground Truth: Iterate extensions, compute CAP−\-SM for each, track the maximum\.Retrieved Memory 1 \(✓ Success, Q = 0\.78\):String Character Counting— Define the function signature correctly, iterate through characters, and useisupper\(\)/islower\(\)to count\.Retrieved Memory 2 \(✓ Success, Q = 0\.72\):Tracking Maximum with Index— Initialize the strongest extension as the first element, then compare each subsequent extension’s strength\.Retrieved Memory 3 \(✗ Failure, Q = 0\.18\):Incomplete Implementation— Correctly defined the function parameters but left the iteration logic as a placeholder; always complete the loop body\.def Strongest\_Extension\(class\_name, extensions\):strongest = extensions\[0\]max\_str = sum\(1 for c in extensions\[0\] if c\.isupper\(\)\)\- sum\(1 for c in extensions\[0\] if c\.islower\(\)\)for ext in extensions\[1:\]:s = sum\(1 for c in ext if c\.isupper\(\)\)\- sum\(1 for c in ext if c\.islower\(\)\)if s \> max\_str: max\_str = s; strongest = extreturn class\_name \+ ’\.’ \+ strongestTable 93:MeMRL case study in ALFWorld\.Task: put two cd in drawer\.Outcome: Score = 1\.0, Steps = 5Retrieved Memories:Past successful trajectories \(reward\-weighted, top\-similar\):Your task is to: put two cellphone in dresser\.\> go to desk 1You arrive at desk 1\. … a cellphone 1, cellphone 2, …\> take cellphone 1 from desk 1\> move cellphone 1 to dresser 1\> take cellphone 2 from desk 1\> move cellphone 2 to dresser 1—\[… additional reward\-weighted trajectories omitted …\]Action Trace:go to desk 1 \(found cd 1, cd 2 alongside other items\)→\\totake cd 1→\\tomove cd 1 to drawer 1→\\totake cd 2→\\tomove cd 2 to drawer 1✓\\checkmarkTable 94:MeMRL case study in AppWorld\.Task: I went on a trip with friends to Maui recently\. I have maintained a note of money I owe to others and others owe me from the trip in simple\_note\. Make private venmo payments or requests accordingly\. \[…\]Outcome: Score = 0\.86, Steps = 19Retrieved Memories:Past successful trajectories \(reward\-weighted, top\-similar\):Task: settle expense splits via venmo from a simple\_note ledger\.\> Read note via apis\.simple\_note\.show\_note\(note\_id=\.\.\.\)\> Parse "I owe X" / "X owes me" lines into two dicts\.\> Resolve names to emails via apis\.supervisor\.show\_relations\.\> For each entry, call apis\.venmo\.send\_money OR apis\.venmo\.create\_payment\_request with private=True\.\[… additional reward\-weighted trajectories omitted …\]Final Action:owe\_me\_map = \{"Cory": 212, "Edwin": 70, "Jeremy": 149\}for person, amount in owe\_me\_map\.items\(\):if person in user\_emails:apis\.venmo\.create\_payment\_request\(user\_email=user\_emails\[person\], amount=amount,access\_token=venmo\_token, description="For Maui trip", private=True\)apis\.supervisor\.complete\_task\(answer="Payments and requests for the Maui trip have been processed\."\)Table 95:S3 case study in QA\.Question: Which property best explains why a metal spoon feels colder than a wooden spoon at the same room temperature?Options: \(A\) higher density, \(B\) higher thermal conductivity, \(C\) lower melting point, \(D\) stronger magnetism\.Ground Truth: \(B\) higher thermal conductivity\.<\|begin\_search\_query\|\>Why does a metal spoon feel colder than a wooden spoon at room temperature?<\|end\_search\_query\|\><\|begin\_search\_result\|\>Retrieved Information: Metals conduct heat much faster than wood\. When touched, metal draws heat away from the skin more rapidly, making it feel colder even if both objects have the same temperature\.<\|end\_search\_result\|\>The retrieved passage is directly relevant\. The question asks about two objects at the same room temperature, so the answer cannot be about actual temperature difference\. The key factor is the rate of heat transfer from the hand to the object\. Since metal has higher thermal conductivity than wood, it removes heat from the skin faster and therefore feels colder\.Prediction: \(B\) higher thermal conductivity\. ✓Table 96:S3 case study in Math\.Question: A store sells notebooks for $3 each and pens for $2 each\. Alice buys 4 more pens than notebooks and spends $34 in total\. How many notebooks does she buy?Ground Truth: 6 notebooks\.<\|begin\_search\_query\|\>solve word problem notebooks 3 dollars pens 2 dollars 4 more pens total 34<\|end\_search\_query\|\><\|begin\_search\_result\|\>Retrieved Information: Let the smaller quantity bexx\. If pens are 4 more than notebooks, use notebooks=x=xand pens=x\+4=x\+4\. The cost equation is3x\+2\(x\+4\)=343x\+2\(x\+4\)=34\.<\|end\_search\_result\|\>Letxxbe the number of notebooks\. Then the number of pens isx\+4x\+4\. The total cost is3x\+2\(x\+4\)=343x\+2\(x\+4\)=34\. Thus3x\+2x\+8=343x\+2x\+8=34, so5x=265x=26, givingx=5\.2x=5\.2\. This is not an integer, so the retrieved setup suggests the problem may be inconsistent\. However, checking nearby integer values: ifx=6x=6, pens=10=10, cost=18\+20=38=18\+20=38, not 34; ifx=5x=5, pens=9=9, cost=15\+18=33=15\+18=33\.Prediction: No integer solution\.×\\timesObservation: S3 retrieves a useful equation pattern, but because the retrieved information is used mechanically, it exposes the inconsistency rather than correcting the task\. This case shows that search\-style retrieval can help formulate equations but does not always provide robust reasoning when the query requires verification or adjustment\.Table 97:S3 case study in Code Generation\.Question: Write a functioncount\_vowels\(s\)that returns the number of vowels in a string\. Treat both uppercase and lowercase vowels as valid\.Tests:assert count\_vowels\("Apple"\) == 2,assert count\_vowels\("WHY"\) == 0\.Ground Truth: Count characters whose lowercase form is in"aeiou"\.<\|begin\_search\_query\|\>Python count vowels in string uppercase lowercase<\|end\_search\_query\|\><\|begin\_search\_result\|\>Retrieved Information: Convert the string to lowercase and sum over characters that appear in the vowel set\{’a’,’e’,’i’,’o’,’u’\}\.<\|end\_search\_result\|\>The retrieved result gives the exact algorithm\. We can lowercase the input and count membership in the vowel set\. This handles uppercase vowels such as"A"in"Apple", while ignoring consonants such as"W","H", and"Y"\.‘‘‘pythondef count\_vowels\(s\):vowels = set\("aeiou"\)return sum\(1 for ch in s\.lower\(\) if ch in vowels\)‘‘‘Prediction: Passes the tests\. ✓Table 98:S3 case study in ALFWorld\.Task: Put a clean mug in the cabinet\.Ground Truth Plan: Find mug→\\rightarrowtake mug→\\rightarrowclean mug at sink→\\rightarrowgo to cabinet→\\rightarrowput mug in cabinet\.<\|begin\_search\_query\|\>ALFWorld clean mug put in cabinet strategy<\|end\_search\_query\|\><\|begin\_search\_result\|\>Retrieved Information: For clean\-and\-place tasks, first locate the object, take it, move to a sinkbasin, use the sink to clean it, then move to the target receptacle and put the object there\.<\|end\_search\_result\|\>The search result provides a useful high\-level plan\. The agent first searches the countertop and shelves, finds the mug, and takes it\. It then navigates to the sinkbasin and performs a cleaning action before moving to the cabinet\. However, after cleaning, the agent repeats several navigation actions because the retrieved plan does not specify which cabinet has already been observed\.Outcome: Success, but with redundant steps\.Observation: S3 retrieves a generally correct procedural template, but it does not maintain a structured memory of environment\-specific state\. This can lead to repeated exploration even when the high\-level strategy is correct\.Table 99:S3 case study in AppWorld\.Task: Find the user’s unread email from Bob about a dinner reservation and create a calendar event for the reservation time\.Ground Truth Plan: Search unread emails from Bob→\\rightarrowextract restaurant, date, and time→\\rightarrowcreate calendar event with the extracted details\.<\|begin\_search\_query\|\>AppWorld search unread email create calendar event from extracted time<\|end\_search\_query\|\><\|begin\_search\_result\|\>Retrieved Information: Use the email API to filter unread messages by sender or keyword, parse the relevant message body, then call the calendar API with title, start time, end time, and optional location\.<\|end\_search\_result\|\>The retrieved information gives a useful API\-level workflow\. The agent searches unread messages and identifies the relevant email\. It extracts the reservation time but creates the calendar event without including the restaurant name as the location\. Since the evaluator checks both the time and event details, the execution is only partially correct\.Outcome: Partial execution; fails strict validation\.Observation: S3 provides a plausible search\-derived workflow, but it does not explicitly preserve fine\-grained task constraints from past executions\. This can hurt AppWorld tasks where small missing fields cause failure\.Table 100:ExpGraph case study in QA\.Question: Which property best explains why a metal spoon feels colder than a wooden spoon at the same room temperature?Options: \(A\) higher density, \(B\) higher thermal conductivity, \(C\) lower melting point, \(D\) stronger magnetism\.Ground Truth: \(B\) higher thermal conductivity\.Retrieved Experience Graph Neighborhood:Seed experience: When two objects have the same temperature, perceived coldness often depends on how quickly heat leaves the skin, not on the object’s actual temperature\.Connected experience: Metals transfer heat faster than wood or plastic; this makes metal feel colder because it draws heat from the hand more rapidly\.Utility signal: Similar heat\-transfer explanations previously improved science QA accuracy, while density\-based distractors were often misleading\.The retrieved experiences separate the relevant physical mechanism from surface distractors\. Since the spoon and wooden spoon are at the same room temperature, the answer must explain the sensation of coldness\. The useful experience indicates that heat\-transfer rate is the key factor, and the connected experience identifies metals as high\-conductivity materials\. Therefore, the metal spoon feels colder because it conducts heat away from the skin faster\.Prediction: \(B\) higher thermal conductivity\. ✓Table 101:ExpGraph case study in Math\.Question: A store sells notebooks for $3 each and pens for $2 each\. Alice buys 4 more pens than notebooks and spends $38 in total\. How many notebooks does she buy?Ground Truth: 6 notebooks\.Retrieved Experience Graph Neighborhood:Seed experience: For two\-item purchase problems, define the smaller quantity asxxand express the larger quantity using the stated difference\.Connected experience: Always verify the solution by substituting the quantity back into the total\-cost equation\.Connected failure lesson: Do not answer immediately after solving the equation; check whether the result is an integer and satisfies all constraints\.Letxxbe the number of notebooks\. Then the number of pens isx\+4x\+4\. The total cost equation is3x\+2\(x\+4\)=383x\+2\(x\+4\)=38\. Thus3x\+2x\+8=383x\+2x\+8=38, so5x=305x=30andx=6x=6\. Verification: notebooks cost6×3=186\\times 3=18, pens cost10×2=2010\\times 2=20, and the total is3838\.Prediction: 6 notebooks\. ✓Table 102:ExpGraph case study in Code Generation\.Question: Write a functioncount\_vowels\(s\)that returns the number of vowels in a string\. Treat both uppercase and lowercase vowels as valid\.Tests:assert count\_vowels\("Apple"\) == 2,assert count\_vowels\("WHY"\) == 0\.Ground Truth: Count characters whose lowercase form is in"aeiou"\.Retrieved Experience Graph Neighborhood:Seed experience: For character\-counting tasks, normalize the input before applying membership tests\.Connected experience: Use a set for constant\-time membership when checking whether characters belong to a small category\.Connected failure lesson: Do not forget uppercase cases when the problem explicitly says the match should be case\-insensitive\.The graph retrieves both the algorithmic pattern and the common failure mode\. The implementation should lowercase the string before counting vowels, and a set is sufficient for checking membership\. This directly handles"Apple"because"A"becomes"a", and it returns zero for"WHY"because none of the characters are vowels after lowercasing\.‘‘‘pythondef count\_vowels\(s\):vowels = set\("aeiou"\)return sum\(1 for ch in s\.lower\(\) if ch in vowels\)‘‘‘Prediction: Passes the tests\. ✓Table 103:ExpGraph case study in ALFWorld\.Task: Put a clean mug in the cabinet\.Ground Truth Plan: Find mug→\\rightarrowtake mug→\\rightarrowclean mug at sink→\\rightarrowgo to cabinet→\\rightarrowput mug in cabinet\.Retrieved Experience Graph Neighborhood:Seed experience: For clean\-and\-place tasks, first secure the target object, then clean it at a sinkbasin before moving to the destination receptacle\.Connected experience: After cleaning, avoid re\-checking unrelated rooms; go directly to the target receptacle if it has already been observed\.Connected failure lesson: A common failure is putting the object away before cleaning it, which completes the wrong state and causes task failure\.The retrieved experiences provide both a high\-level plan and a constraint to avoid a known failure\. The agent first locates and takes the mug, moves to the sinkbasin, cleans the mug, and then navigates directly to the cabinet\. Because the graph also retrieves a failure lesson about premature placement, the agent avoids putting the mug into the cabinet before cleaning it\.Outcome: Success with fewer redundant exploration steps\.Table 104:ExpGraph case study in AppWorld\.Task: Find the user’s unread email from Bob about a dinner reservation and create a calendar event for the reservation time\.Ground Truth Plan: Search unread emails from Bob→\\rightarrowextract restaurant, date, and time→\\rightarrowcreate calendar event with the extracted details\.Retrieved Experience Graph Neighborhood:Seed experience: For email\-to\-calendar tasks, first filter by unread status and sender, then parse the message body for time, date, title, and location\.Connected experience: Calendar creation tasks often require all structured fields, including title, start time, end time, and location, not just the time\.Connected failure lesson: Missing optional\-looking fields such as restaurant name or location can still fail strict AppWorld validation\.The retrieved experiences emphasize both the API workflow and the strict validation requirement\. The agent searches unread emails from Bob, identifies the reservation email, extracts the restaurant name, date, and time, and creates a calendar event with a title such asDinner Reservation, the correct start and end time, and the restaurant as the location\. The failure lesson prevents the agent from creating an underspecified event\.Outcome: Success under strict validation\.
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

Similar Articles

Experience Memory Graph: One-Shot Error Correction for Agents

@neural_avb: Here's the latest paper on Graph Memory on LLM agents

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

Submit Feedback

Similar Articles

Experience Memory Graph: One-Shot Error Correction for Agents
@neural_avb: Here's the latest paper on Graph Memory on LLM agents
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents