Multi-Agent Transactive Memory

arXiv cs.AI 06/20/26, 04:00 AM Papers
Summary
Proposes Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories to improve task performance and reduce interaction steps in interactive environments like ALFWorld and WebArena.
arXiv:2606.19911v1 Announce Type: new Abstract: The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.
Original Article
View Cached Full Text
Cached at: 06/20/26, 02:34 PM
# Multi-Agent Transactive Memory
Source: [https://arxiv.org/html/2606.19911](https://arxiv.org/html/2606.19911)
To Eun Kim1\*Xuhong He1\*Dishank Jain1\* Ambuj Agrawal1Negar Arabzadeh2Fernando Diaz1 1Carnegie Mellon University2University of California, Berkeley

###### Abstract

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations\. Just as search engines index human\-generated artifacts to support human problem solving, retrieval systems can organize agent\-generated artifacts for reuse across agent populations\. We extend retrieval\-augmented generation—which demonstrates the value of human\-authored artifacts to individual agents—to retrieval of agent\-generated artifacts supporting a population of agents\. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions\. We propose Multi\-Agent Transactive Memory \(MATM\), a framework for population\-level storage and retrieval of agent\-generated trajectories, whereproducer agentscontribute trajectories to a shared repository andconsumer agentsretrieve them to improve task execution\. We focus on interactive environments \(ALFWorld and WebArena\), where trajectories are long and encode especially rich procedural structure\. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training\. These results position MATM as a design pattern for population\-level experience sharing in open agent ecosystems\.

Multi\-Agent Transactive Memory

To Eun Kim1\*Xuhong He1\*Dishank Jain1\*Ambuj Agrawal1Negar Arabzadeh2Fernando Diaz11Carnegie Mellon University2University of California, Berkeley

††footnotetext:\*Denotes equal contribution\.††footnotetext:\*[https://github\.com/kimdanny/matm](https://github.com/kimdanny/matm)![Refer to caption](https://arxiv.org/html/2606.19911v1/x1.png)Figure 1:Multi\-Agent Transactive Memory \(MATM\)\. Traditional search serves humans retrieving human\-authored documents\. RAG extends this to agents retrieving from human\-generated corpora\. MATM takes the next step by letting agents retrieve agent\-generated artifacts such as interaction trajectories, which are atypical documents that differ fundamentally from human\-written text\. MATM can continually grow while serving a distributed population of agents\.## 1Introduction

As heterogeneous LLM agents are deployed across increasingly diverse domains, research on individual agent design must be complemented by methods for supporting decentralized populations of agents\. The need for population\-level infrastructure has motivated protocols to support agent\-tool interaction\(mcp2026\)as well as inter\-agent communication\(a2a\_protocol2026\)\. Beyond standards, tools such as search engines are beginning to be optimized for agents\(zamani:reml;salemi:se\-for\-machines\)\. Although retrieval\-augmented generation \(RAG\) demonstrates the value of human\-authored artifacts to individual agents, infrastructure for knowledge sharing amongst agents offers a compelling alternative\. Just as search engines index human\-generated artifacts to support human problem solving, retrieval systems can organize agent\-generated artifacts for reuse across agent populations \(Figure[1](https://arxiv.org/html/2606.19911#S0.F1)\)\.

Artifact sharing and reuse are essential for enabling scalable, efficient, and continually improving agent populations\. As agents operate across environments, they produce a number of intermediate artifacts, which contain rich procedural knowledge, such as action\-observation trajectories\(muennighoff\-etal\-2025\-s1\)\. Yet these artifacts are typically discarded after a single use or retained only by the producing agent\(zheng2024synapse\)\. The ability to efficiently reuse learned behaviors and continually acquire new knowledge or experience becomes critical for scalability and long\-term performance\(wang2025agentworkflow;liang2026skillnet;shi2025continual\)\. In contrast with RAG, agent\-generated artifacts can be more suitable for agent consumption compared to human\-authored documents\(chen2026agentir\)\. The need for population\-level reuse is further amplified by practical considerations\. Many modern agents rely on inference\-time scaling and generate a number of intermediate artifacts, incurring substantial computational cost\(kaplan2020scaling;yao2023tree;wu2024scaling;welleck2024from\)\. As a result, reusing those artifacts can reduce costs for reasoning and exploration\(ahmed2025retrieval\)\.

Existing approaches to artifact reuse are insufficient for heterogeneous agent ecosystems\. Prior work on reasoning or thought reuse\(zheng2024synapse;ouyang2025reasoningbank;ahmed2025retrieval\)improves cost\-efficiency and effectiveness within individual agents, but reuse remains limited to the original artifact\-producer; despite substantial overlap in the tasks agents solve, interaction trajectories are typically discarded after a single use\(zheng2024synapse;zhao2024expel\), causing newly instantiated agents to repeatedly rediscover solutions that already exist elsewhere in the ecosystem\. Related paradigms such as transfer learning\(konidaris:portable\-options;brunskill:multi\-task\-rl\-sample\-complexity\)and knowledge distillation\(li2025naturalthoughts;kang2025distilling\)require alignment between source and target domains and often demand additional training, making them impractical for diverse, dynamically instantiated populations of heterogeneous agents\. Centralized multi\-agent coordination methods\(dang2025multiagent\)further assumecooperative settingsand shared protocols, constraining their applicability in open ecosystems\(tranMultiAgentCollaborationMechanisms2025\)where agents can freely join at any time\. Indeed, based on analysis of Moltbook,liDoesSocializationEmerge2026identify shared social memory as a missing prerequisite for the development of agent societies\.

To address this gap, we propose Multi\-Agent Transactive Memory \(MATM\), a framework for population\-level storage and retrieval of agent\-generated artifacts, based on the concept of transactive memory\(wegner1987transactive\), in which human groups coordinate by distributing knowledge across individuals by using shared mechanisms for locating and retrieving relevant information\. Similarly, MATM maintains a shared repository to which agents can contribute artifacts produced during their own task execution \(producer agents\) and from which agents can retrieve procedural knowledge to improve their own task effectiveness and efficiency \(consumer agents\)\. Roles are not mutually exclusive: an agent may produce trajectories in one context and consume them in another\.

This producer\-consumer structure induces a two\-sided marketplace for agent\-generated procedural knowledge, with clear attribution between retrieved artifacts and their sources\. As more agents interact with the repository, MATM grows organically, accumulating a corpus across an increasingly diverse set of tasks and environments\. Operating as a specialized retrieval system over agent\-generated artifacts, MATM further enables retrieval functions that go beyond generic similarity search, including agent\-specific personalization, producer trust modeling, and periodic update of retriever as the population evolves\.

We empirically demonstrate the effectiveness of MATM in interactive environments \(ALFWorld\(shridhar2021alfworld\)and WebArena\(zhou2024webarena\)\)\. We first show that agents consistently benefit from a simple single\-stage retrieval pipeline: retrieving relevant trajectories from a MATM repository populated by diverse agents not only improves downstream task performance without requiring additional coordination or joint training, but also improves task efficiency as measured by a reduced number of interaction steps\. We further introduce an efficient yet powerful learning\-to\-rank \(LTR\)\-based trajectory reranking stage\. With simple featurization of trajectory information, reranking yields better retrieval quality, leading to improved task effectiveness and greater step efficiency\. Moreover, we find that retrieval benefit extends to both weaker and stronger agents, generalizes across tasks, and continues to improve as the repository grows\. Taken together, our results demonstrate that MATM provides a scalable mechanism for population\-level experience reuse, enabling agents to leverage collective trajectories rather than repeatedly rediscovering solutions in isolation\.

## 2Background \(Appendix[A](https://arxiv.org/html/2606.19911#A1)\)

Memory has long played a role in the development of AI agents\. Existing approaches can be understood as memory over various sources of data\.Memory of training dataprovides agents with access to knowledge explicitly or implicitly stored during optimization\. Explicit methods include nearest\-neighbor algorithms\(cover\-hart:nn;khandelwal:knnlm\)or case\-based reasoning\(kolodner:intro\-cbr;das:nl\-cbr\)or implicit behaviors\(carlini:memorization\)\.Memory over experience dataprovides agents with access to traces of their own interactions\. Historically, methods reflecting memory of experience data include early cognitive architectures like SOAR\(laird:soar\), reinforcement learning\(lin:experience\-replay\), and neural networks\(weston:memory\-networks\)\. In the context of LLM agents, recent extensions treat an agent’s own interaction history as retrievable context, giving rise to memory\-augmented generation where past conversations or execution traces guide future behavior\(shinn2023reflexion;majumder2024clin;zheng2024synapse\)\. Agents generate rich intermediate artifacts during problem solving, including action\-observation trajectories, thinking traces, plans, workflows, and reusable code analogous to options in reinforcement learning\(Garcia19compressionMacro;veeriah2021discovery\)\. At the trajectory level, Buffer of Thoughts\(yang2024buffer\)and Retrieval of Thought\(ahmed2025retrieval\)retrieve reasoning templates as in\-context guidance, whilezheng2024synapseandzhao2024expelreuse action\-observation trajectories for downstream decision\-making\. Beyond trajectories, works such as CLIN\(majumder2024clin\), Voyager\(wang2024voyager\), AWM\(wang2025agentworkflow\), MaestroMotif\(klissarov2025maestromotif\), ASI\(wang2025inducing\), ReasoningBank\(ouyang2025reasoningbank\), and T3\(arabzadeh2026thinkingtrace\)extract and reuse more abstract artifacts such as causal abstractions, workflows, skills, and executable code\. Agent artifacts can further serve as distillation signals to transfer competence across models\(yang2025supercorrect;li2025naturalthoughts;kang2025distilling\)\.Memory of external dataprovides agent with access to shared artifact repositories and is represented by retrieval\-augmented generation \(RAG\)\(lewis2020retrieval\), which enhances language models by conditioning generation on retrieved external context\(fan2024survey\)\.

## 3Multi\-Agent Transactive Memory

In existing memory systems, experience data is typically reused only by the same or homogeneous agent\(s\) that produced it, leaving valuable experience isolated and forcing less\-experienced agents to rediscover existing solutions\. In contrast, we propose a population\-level memory of experience data, providing a collective memory for a population of agents\. Rather than treating memory as private to each agent, we consider it a shared, structured resource that heterogeneous agents can both contribute to and retrieve from\. This shifts artifact reuse from an individual optimization mechanism to a collective knowledge infrastructure, enabling continual learning and cross\-agent transfer, reducing redundant exploration, and supporting cumulative capability growth at the population level\.

We consider a population ofnnLLM agents𝒜=\{Ai\}i=1n\\mathcal\{A\}=\\\{A\_\{i\}\\\}\_\{i=1\}^\{n\}each potentially pursuing heterogeneous goals and operating across one or more environmentsℰ=\{Ei\}i=1m\\mathcal\{E\}=\\\{E\_\{i\}\\\}\_\{i=1\}^\{m\}\. A task is specified by a descriptionx∈𝒳x\\in\\mathcal\{X\}, which corresponds to goal specification or initial state for an agent\. Given a task descriptionxxin environmentEiE\_\{i\}, an LLM agentAjA\_\{j\}performs a series of interleaved turns with the environment to solve the task\. During this process, we can record a variable\-length trajectory𝒯Ei,Aj=\(τt\)t=1H\\mathcal\{T\}\_\{E\_\{i\},A\_\{j\}\}=\(\\tau\_\{t\}\)\_\{t=1\}^\{H\}, where each stepτt\\tau\_\{t\}represents a unit of interaction\. For example, in a web navigation environment, eachτt\\tau\_\{t\}corresponds to an action\-observation pair \(e\.g\., a click action and the resulting HTML observation\) in the interaction sequence\. For simplicity, we denote this agent\-generated trajectory as𝒯\\mathcal\{T\}\.

As the agent population𝒜\\mathcal\{A\}operates across environmentsℰ\\mathcal\{E\}, these trajectories accumulate into a rich collection of intermediate artifacts\. We denote the population\-level artifact repository as𝒟=\{𝒯\}\\mathcal\{D\}=\\\{\\mathcal\{T\}\\\}and refer to this incrementally growing shared memory as Multi\-Agent Transactive Memory \(MATM\)\. Within this framework, we refer to agents that contribute trajectories to𝒟\\mathcal\{D\}asproducer agentsand those that retrieve from it to aid their own task\-solving asconsumer agents, where these roles are not mutually exclusive\. Our goal is to study how retrieval from this population\-level memory can be optimized to improve outcomes for the population of consumer agents\.

Although we focus on raw trajectories, this does not preclude higher\-level abstractions such as skills or induced policies\. We operate with trajectories as the lowest\-level and most universally available outputs produced by agents across environments, and therefore provide a natural foundation for studying indexing and retrieval in MATM while still allowing higher\-level abstractions such as skill induction\(klissarov2025maestromotif\)to be built on top\. Moreover, retrieval over interactive trajectories is itself non\-trivial\. Prompt\-like artifacts such asSKILL\.mdfiles can be indexed with standard RAG techniques\(liang2026skillnet\)or further transformed into more retrieval\-friendly forms\(arabzadeh2026thinkingtrace\), but state\-conditioned retrieval over action\-observation histories has received much less attention and is the setting we study\.

### 3\.1Transactive Memory Indexing & Retrieval

For action\-observation trajectories, we adopt a state\-conditioned key\-value indexing scheme in which recent interaction history serves as the retrieval key and the subsequent interaction segment as the stored value\. This allows consumer agents to retrieve continued guidance conditioned on their current state rather than only on the original task instruction\.

Given a window sizell, for each interaction stepttwe define the key𝐞key\(t\)=f\(x,τt−l\+1,…,τt\)\\mathbf\{e\}\_\{\\text\{key\}\}^\{\(t\)\}=f\\left\(x,\\tau\_\{t\-l\+1\},\\ldots,\\tau\_\{t\}\\right\)and the associated value as the nextllsteps,\(τt,…,τt\+l−1\)\\left\(\\tau\_\{t\},\\ldots,\\tau\_\{t\+l\-1\}\\right\)which serves as the documentddthat an agent retrieves at inference time, whereffis a shared embedding function, andτi\\tau\_\{i\}contains both an observation and an action\.

Given a task descriptionxxand MATM memory𝒟\\mathcal\{D\}, a trajectory retrieverℛ\\mathcal\{R\}forms a search queryqq, following the process described above and returns a ranked listπ=ℛ\(x,𝒟,K\)\\pi=\\mathcal\{R\}\(x,\\mathcal\{D\},K\)of candidate trajectory chunks, where higher\-ranked chunks are predicted to be more relevant for the current task and state\. The trajectory retrieverℛ\\mathcal\{R\}may be instantiated as a dense retriever, or a cascaded retrieval pipeline combining an initial retriever with a reranker\.

Although the embedding modelffcan in principle be tuned to better support artifact retrieval, we instead explore a simpler and underexamined approach that aligns retrieval results with consumer\-agent preferences using lightweight learning\-to\-rank \(LTR\) rerankers\(cao2007ltr\)\.

### 3\.2Learning To Rank Trajectories \(LTRT\)

Learning to rank pipelines consist of a retrieval stage known as candidate generation, followed by a feature\-based ranking stage that re\-orders the retrieved candidates\.

A feature mapϕ\\phiis designed to capture multiple complementary aspects of trajectory usefulness\. Letϕ\(q,d\)∈ℝz\\phi\(q,d\)\\in\\mathbb\{R\}^\{z\}be a feature map that extracts features by inspecting the queryqqand document keydd\. Letgθ:ℝz↦ℝg\_\{\\theta\}:\\mathbb\{R\}^\{z\}\\mapsto\\mathbb\{R\}be a parameterized scoring function, where larger outputs indicate greater predicted helpfulness of a documentddvalue for taskqq\.

In MATM, we define features in six categories:1producer agent metadata \(e\.g\.,relevant benchmark scores\);2consumer agent metadata \(e\.g\., agent ID\);3first\-stage retrieval features \(e\.g\., retrieval scores\);4query features \(e\.g\., query length\);5trajectory features \(e\.g\., trajectory length\); and6query\-trajectory interaction features \(e\.g\., query\-trajectory embedding similarity\)\.Two of these categories carry particular conceptual weight\. Producer\-agent metadata is designed to enable a form oftrust modeling, allowing the reranker to learn which producers’ trajectories are reliable for a given context\. Consumer\-agent metadata is designed to enablepersonalizationof retrieval to the individual consumer that has joined the MATM framework, since the same trajectory may be more or less useful depending on the consumer’s capabilities\.

Traininggθg\_\{\\theta\}requires supervision over which retrieved trajectories actually help\. Rather than treating relevance as semantic similarity, we label trajectory chunks by theirmarginal utility\(salemiEvaluatingRetrievalQuality2024\): a chunk is helpful to the extent that injecting it into the consumer agent improves task outcome relative to running the same agent with no retrieval\. The concrete procedure for collecting these labels is intertwined with how the memory itself is built, and we describe both jointly in Section[4\.1](https://arxiv.org/html/2606.19911#S4.SS1)and[4\.2](https://arxiv.org/html/2606.19911#S4.SS2)\.

## 4Experimental Setup

We instantiate MATM in two interactive benchmarks: ALFWorld\(shridhar2021alfworld\), a text\-based household\-task environment, and WebArena\(zhou2024webarena\), a web navigation\-based task environment\. Each benchmark yields its own MATM index, populated by trajectories from 35 to 37 producer agents and consumed by 34 consumer agents \(full population list in Appendix[C](https://arxiv.org/html/2606.19911#A3)\)\.

For ALFWorld, we use the official train and test split, using 3553 episodes from the official training set, and evaluating on all 274 official test episodes \(Appendix[D](https://arxiv.org/html/2606.19911#A4)\)\. For WebArena, which ships without a standard train/test partition, we construct a custom split that preserves the distribution of task intents, yielding 724 training and 88 test episodes \(Appendix[E](https://arxiv.org/html/2606.19911#A5)\)\. In both benchmarks, all MATM construction and LTRT training is performed strictly on the training partition, so the test set remains untouched throughout the MATM corpus construction phase\. As a result, the test set may contain questions whose task type or environment configuration overlaps with those seen during MATM construction \(e\.g\., similar map layouts in ALFWorld or shared website domains in WebArena\), but no test question is itself solved by a producer agent and inserted into the corpus\.

### 4\.1Transactive Memory Construction

To construct MATM emerging as a trajectory storage of a population of agents producing and consuming trajectories, we expand the MATM corpus through two phases\.Pre\-populationinitializes the index from existing trajectory sources, andincremental updategrows it as the producer and consumer agent population processes new training questions and contributessuccessfultrajectories back to the shared memory\. Both phases operate exclusively over the training partition\.

#### Pre\-Population\.

The pre\-population phase seeds an initial index𝒟0\\mathcal\{D\}\_\{0\}with publicly available trajectories\. For ALFWorld, we collect trajectories from a trained seq2seq model released by the benchmark authors, supplemented by trajectories generated by running Qwen3\-32B and GPT\-OSS 20B on the training set\. For WebArena, we collect publicly available trajectories produced by GPT\-4\-Turbo, GPT\-4\-Turbo\-Preview, and Claude\-3\.5\-Sonnet from the official benchmark runs, again supplemented by Qwen3\-32B and GPT\-OSS 20B trajectories generated on the training set\. In both cases, the collected trajectories are segmented into document chunks, encoded with the shared embedding functionff, and inserted into the dense index𝒟0\\mathcal\{D\}\_\{0\}, yielding 85,615 and 8,547 chunks for ALFWorld and WebArena respectively\.

#### Incremental Update\.

After pre\-population, MATM grows incrementally as the agent population operates over a stream of new training questions\. This phase serves a dual purpose\. It enriches the index with trajectories from a diverse set of producer agents, and it simultaneously creates the supervision signals needed to train LTRT rerankers\.

The training questions are organized into partitions\{𝒳p\}p=1P\\\{\\mathcal\{X\}\_\{p\}\\\}\_\{p=1\}^\{P\}processed sequentially\. Within each partition, every question is assigned to a producer agent via a deterministic allocation functionσ\(x,𝒜,p\)\\sigma\(x,\\mathcal\{A\},p\)that ensures balanced coverage across task categories and agents \(Appendix[F](https://arxiv.org/html/2606.19911#A6)\)\. For each assigned pair\(x,An\)\(x,A\_\{n\}\), the agent first attemptsxxwithout retrieval to obtain a baseline trajectory𝒯base\\mathcal\{T\}\_\{\\mathrm\{base\}\}and scoresbases\_\{\\mathrm\{base\}\}, which serves as the reference point for downstream marginal\-utility comparisons\. We then sampleTTbranching points\{t1,…,tT\}\\\{t\_\{1\},\\dots,t\_\{T\}\\\}randomly from the steps of𝒯base\\mathcal\{T\}\_\{\\mathrm\{base\}\}\. Followingchang2015learning, at each branching pointtt, weroll into the corresponding prefixht=\(τ1,…,τt\)h\_\{t\}=\(\\tau\_\{1\},\\dots,\\tau\_\{t\}\)and retrieve the top\-KKchunks from the current index most similar toxxcombined withhth\_\{t\}\. We thenroll out\|ℐ\|\|\\mathcal\{I\}\|one\-shot trajectory\-augmented generations fromhth\_\{t\}, one per selected rankj∈ℐ⊆\{1,…,K\}j\\in\\mathcal\{I\}\\subseteq\\\{1,\\ldots,K\\\}, scoring each resulting trajectory𝒯t\(j\)\\mathcal\{T\}\_\{t\}^\{\(j\)\}asst\(j\)s\_\{t\}^\{\(j\)\}\.

This loop produces two outputs simultaneously\. Any trajectory meeting a quality thresholdθ\\theta, including the baseline, is added to a trajectory bufferℬp\\mathcal\{B\}\_\{p\}\. After all questions in𝒳p\\mathcal\{X\}\_\{p\}have been processed, every trajectory inℬp\\mathcal\{B\}\_\{p\}is segmented, embedded withff, and added to the index, yielding𝒟p\\mathcal\{D\}\_\{p\}\.

After all partitions are processed, the final MATM corpus contains 86,833 chunks for ALFWorld and 20,102 chunks for WebArena \(Appendix[G](https://arxiv.org/html/2606.19911#A7)\)\.111All trajectories are available at[https://huggingface\.co/datasets/toeunkim/matm\-trajectories](https://huggingface.co/datasets/toeunkim/matm-trajectories)\.The full algorithm is given in Appendix[H](https://arxiv.org/html/2606.19911#A8)\.

### 4\.2LTRT Dataset & Reranker Training

The incremental construction procedure yields a labeled training dataset𝒮=\{\(q,d,ℓ\)\}\\mathcal\{S\}=\\\{\(q,d,\\ell\)\\\}for the LTRT reranker\. For each retrieved chunkdt\(j\)d\_\{t\}^\{\(j\)\}evaluated at branching pointtt, we record the tuple\(qt,dt\(j\),ℓ\)\(q\_\{t\},d\_\{t\}^\{\(j\)\},\\ell\)with labelℓ=st\(j\)−sbase\\ell=s\_\{t\}^\{\(j\)\}\-s\_\{\\mathrm\{base\}\}, capturing the chunk’s marginal utility relative to the no\-retrieval baseline\. WithQ=∑p\|𝒳p\|Q=\\sum\_\{p\}\|\\mathcal\{X\}\_\{p\}\|training questions,TTbranching points per question, and\|ℐ\|\|\\mathcal\{I\}\|ranks evaluated per branching point, the resulting dataset containsQ×T×\|ℐ\|Q\\times T\\times\|\\mathcal\{I\}\|labeled tuples\. In our experiments, we usedT=2T=2for ALFWorld,T=1T=1for WebArena\. We sample rank positionsℐ=\{1,5,10,15,20\}\\mathcal\{I\}=\\\{1,5,10,15,20\\\}for both benchmarks, exposing the LTRT model to candidates across the full retrieval depth while avoiding the cost of generating all twenty labeled episodes\.

We compute 44 features per\(q,d\)\(q,d\)pair, spanning the six categories introduced in §[3\.2](https://arxiv.org/html/2606.19911#S3.SS2)\. The full feature list is provided in Appendix[I](https://arxiv.org/html/2606.19911#A9)\. We train three reranker families spanning common LTR paradigms: a pointwise feed\-forward network \(FFN\), pairwise LambdaMART\(wu2010adapting\), and pairwise SVMRank\(joachims2006training\)\. 20% of the training set was used for the validation set for LTRT training\.

### 4\.3Inference\-Time Configuration & Baselines

Across both environments we use the E5\-Base embedding model\(wang2022text\)as the shared embedding functionff\. Trajectory chunks spanl=5l=5action\-observation steps under the key\-value scheme\. At inference time, a cascaded retrieval pipeline first retrieves the top 20 candidate trajectory chunks, after which an LTRT reranker selects the final top\-1 chunk\. The retrieval budget is therefore 1: the working agent conditions on a single retrieved trajectory unit per retrieval call\.

Model Setups\.We compare three configurations: a vanilla LLM without retrieval, MATM with single\-stage dense retrieval only, and MATM with an LTRT reranker \(LLM prompts in Appendix[M](https://arxiv.org/html/2606.19911#A13)\)\.RetrievalPlanner\.Each consumer agent is equipped with aRetrievalPlannerLLM that decides, at each interaction step, whether to issue a retrieval call against MATM\. This allows agents to call on shared memory selectively rather than on every step, which is important because indiscriminate retrieval can dilute the agent’s context with irrelevant guidance\.Metrics\.We evaluate both task performance and efficiency\. Task performance is measured by downstream success rate \(SR\) and efficiency by the number of interaction steps per episode \(\# steps\)\. To jointly capture both dimensions, we adopt return\-paired preference \(RPP\)\(diaz2026rpp\), which measures the Pareto\-dominance of trajectories between a candidate model and a fixed baseline \(Appendix[B](https://arxiv.org/html/2606.19911#A2)\)\. Since consumer agents operate at population scale, unless specified, all reported metrics reflect average performance across consumer models, referred to as consumer population welfare\.

## 5Results

We organize results around five research questions: whether MATM augmentation improves the downstream effectiveness and efficiency of consumer agents \(§[5\.1](https://arxiv.org/html/2606.19911#S5.SS1)\), whether a learned reranking model can further boost retrieval quality \(§[5\.2](https://arxiv.org/html/2606.19911#S5.SS2)\), whether MATM retrieval benefit is exclusive to certain model groups or distributed across the population \(§[5\.3](https://arxiv.org/html/2606.19911#S5.SS3)\), whether MATM generalizes across task types \(§[5\.4](https://arxiv.org/html/2606.19911#S5.SS4)\), and how consumer population performance scales with memory size \(§[5\.5](https://arxiv.org/html/2606.19911#S5.SS5)\)\.

Table 1:Evaluation of MATM\-augmented agents in interactive environments\. Success Rate \(SR\) and number of steps \(\# steps\) are used for measuring the effectiveness and efficiency\. Values are average of the five runs of randomized task\-model allocation\.### 5\.1MATM\-Augmentation improves effectiveness and efficiency

Table[1](https://arxiv.org/html/2606.19911#S5.T1)summarizes results for ALFWorld and WebArena under no\-retrieval and single\-stage retrieval from MATM\. Across both benchmarks, retrieval from the shared repository consistently improves task outcomes\.

On ALFWorld, success rate increases from 47% to 55% \(\+8\.0%p\), while average steps per episode decrease from 11\.77 to 11\.18\. The RPP score rises from−0\.16\-0\.16to−0\.05\-0\.05, indicating that the retrieval\-augmented population more frequently Pareto\-dominates the no\-retrieval baseline in terms of the joint success\-efficiency\. On WebArena, success rate improves from 18% to 20% \(\+2%p\), with average steps falling from 22\.0 to 20\.3 and RPP turning positive at 0\.03\. The improvement is more modest than ALFWorld, possibly due to WebArena’s longer task horizons and greater sensitivity to early\-step errors\. Together, these results show that a shared repository of heterogeneous agent trajectories improves consumer population welfare along both effectiveness and efficiency dimensions\.

### 5\.2Learning to Rank Trajectories further improves MATM participants’ welfare

Single\-stage retrieval selects trajectories by embedding similarity alone\. We next ask whether a learned reranker trained to predict downstream utility can improve over this baseline\. We experiment with three reranker configurations: a feed\-forward network \(FFN\), LambdaMART, and SVMRank\.

On ALFWorld, all three rerankers improve over single\-stage retrieval, and SVMRank achieves the strongest results across all metrics: success rate reaches 64\.3% \(\+9\.2%p over single\-stage, \+17\.2%p over no\-retrieval\), average steps fall to 10\.35, and RPP rises to 0\.15\. On WebArena, reranker effectiveness is more moderate\. FFN matches the success rate of single\-stage retrieval at 20\.5% while achieving the lowest step count \(19\.91\) and highest RPP \(0\.04\) among all methods, making it the most effective reranker for that environment\. LambdaMART, by contrast, reverts success rate to the no\-retrieval level on WebArena, suggesting that the features it relies on are better calibrated to ALFWorld’s task structure than WebArena’s\.

### 5\.3MATM benefits are distributed across the agent population \(Appendix[K](https://arxiv.org/html/2606.19911#A11)\)

![Refer to caption](https://arxiv.org/html/2606.19911v1/x2.png)Figure 2:Retrieval Advantage vs\. Producer\-Consumer Capability Gap on ALFWorld with SVMRank reranking\. Each point is one producer–consumer pair\.While the previous sections establish that MATM improves average consumer welfare, they leave open whether that benefit is concentrated among particular producer\-consumer pairings or distributed across the population\. To answer this, we measure two quantities for each \(producer, consumer\) pair: theretrieval advantage, defined as the gain in consumer success rate when retrieving from that producer relative to its no\-retrieval baseline, and thecapability gap, defined as the difference between the two agents’ aggregated benchmark scores \(AAI\(artificialanalysis2026\)\)\. Higher the capability gap value, the producer is more capable than the consumer\.

Figure[2](https://arxiv.org/html/2606.19911#S5.F2)shows that consumers benefit from retrieval regardless of whether the producer is weaker, comparable, or stronger than themselves, indicating MATM’s value cannot be reduced to a single strong producer\. Although the correlation between capability gap and retrieval advantage shows a slight positive trend — suggesting that stronger producers may yield marginally higher benefit — it remains small and insignificant across both benchmarks, indicating that retrieval utility is not primarily driven by producer\-to\-consumer competence transfer\. Finally, reranking lifts the entire retrieval advantage distribution, confirming the finding in §[5\.2](https://arxiv.org/html/2606.19911#S5.SS2): on ALFWorld, SVMRank roughly doubles the mean retrieval advantage, and the same direction of effect appears on WebArena under FFN reranking \(Appendix[K](https://arxiv.org/html/2606.19911#A11)\)\.

### 5\.4MATM offers cross\-task generalization

Table 2:MATM retrieval scope ablation results across three candidate pool restrictions\. Number of tasks in parentheses\.We study how well MATM generalizes across tasks by varying the retrieval scope\. We evaluate three conditions: \(i\)fullretrieval places no restriction on the candidate pool; \(ii\)same\-taskretrieval limits candidates to trajectories from the same task type as the query; and \(iii\)cross\-taskretrieval limits candidates exclusively to trajectories from different task types \(Appendix[L](https://arxiv.org/html/2606.19911#A12)\)\.

Table[2](https://arxiv.org/html/2606.19911#S5.T2)shows that full retrieval achieves the highest SR and RPP in both environments, confirming that unrestricted candidate diversity is beneficial\. Two findings point to genuine cross\-task generalization\. First, even under cross\-task retrieval, ALFWorld SR reaches 59\.9%, which remains well above the no\-retrieval baseline of 47\.1% from Table[1](https://arxiv.org/html/2606.19911#S5.T1)\. This shows that trajectories from structurally different task types still carry transferable utility\. Second, the effectiveness and efficiency gap between full and same\-task retrieval in both environments suggests that restricting the candidate pool to same\-type trajectories is itself a source of degradation, excluding useful candidates that happen to cross task boundaries\. However, for both benchmarks, same\-task retrieval outperforms cross\-task retrieval, indicating that task\-type alignment still carries a meaningful relevance signal\.

### 5\.5MATM scales with memory size

![Refer to caption](https://arxiv.org/html/2606.19911v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.19911v1/x4.png)

Figure 3:MATM memory scaling curves for ALFWorld \(top\) and WebArena \(bottom\)\. Success rate \(left axis\) and average steps per episode \(right axis\) as a function of index size\. The dotted line marks SR of the no\-retrieval baseline\. Results are averaged over five runs with different random seeds\.We study how downstream effectiveness and efficiency change as MATM grows in size\. We construct nested memory subsets at 10%, 25%, 50%, 75%, and 100% of the full index, with each subset preserving producer model composition and benchmark coverage to isolate the effect of memory size from shifts in data distribution\. Each subset is constructed with five different random seeds and results are averaged across runs\.

Figure[3](https://arxiv.org/html/2606.19911#S5.F3)shows the scaling curves for both environments\. ALFWorld exhibits monotonic improvement in both success rate and step efficiency with index size, confirming that larger memory consistently benefits the agent\. On WebArena, step efficiency also decreases monotonically, consistent with ALFWorld\. Success rate, however, exhibits a non\-monotonic pattern: it dips at the 50% index before recovering sharply to 20\.9% at full scale, the strongest result across all index sizes and the clearest margin above the no\-retrieval baseline of 18\.2%\. We hypothesize that at intermediate scales, the index is large enough to surface plausible but ultimately unhelpful trajectories, yet not diverse enough to reliably include high\-quality matches\. At full scale, sufficient coverage overcomes this noise, restoring and exceeding the gains observed at smaller index sizes\.

## 6Discussion

Our results support the hypothesis that transactive memory provides a viable architecture for population\-level agent memory\. The results in §[5\.1](https://arxiv.org/html/2606.19911#S5.SS1)and §[5\.2](https://arxiv.org/html/2606.19911#S5.SS2)confirm that MATM with learned reranking consistently improves consumer agent welfare in both effectiveness and efficiency\.

Our feature importance analysis \(Appendix[J](https://arxiv.org/html/2606.19911#A10)\) suggests that predictive signals for reranking extend beyond retrieval\-level similarity to include producer agent metadata such as benchmark scores that represents the producer agents’ capability\. This reframes trajectory selection as, in part, a problem of producer trust modeling: the reranker learns to prefer trajectories from agents whose competence profiles predict downstream utility for the consumer\. However, feature importance concentrates differently across environments — ALFWorld relies heavily on a small set of producer features while WebArena distributes importance more evenly — which likely explains why no single reranker dominates both, and reinforces the need for retrieval systems that adapt to task structure rather than a fixed ranking policy\.

The capability\-gap analysis in §[5\.3](https://arxiv.org/html/2606.19911#S5.SS3)further shows that retrieval benefit is broadly distributed across the population rather than driven by transfer from stronger to weaker agents\. This opens a natural future direction of per\-consumer \(group\) personalization of retrieval policy since the same shared repository may be optimally exploited differently by agents with different competence profiles and task preferences\.

The retrieval scope experiments in §[5\.4](https://arxiv.org/html/2606.19911#S5.SS4)provide direct evidence for why population\-level memory is necessary\. Trajectories from different task types carry transferable utility, and restricting the candidate pool to same\-type trajectories degrades performance relative to unrestricted retrieval\. This means that useful trajectories are not confined within task boundaries\. They encode reusable patterns of interaction that generalize across tasks\. An agent\-specific or task\-specific memory, by design, would exclude these candidates\. MATM’s value lies in making them accessible\.

Our scaling experiments in §[5\.5](https://arxiv.org/html/2606.19911#S5.SS5)demonstrate that the value of MATM improves as the repository grows, suggesting that incentives or remuneration for producer contributions will be an important challenge for such platforms\.

## 7Conclusion

We introduced MATM, a shared population\-level memory where heterogeneous agents contribute and retrieve trajectories to improve task performance\. Retrieval improves effectiveness and efficiency across the agent population regardless of capability, with reranking further amplifying gains\. Retrieved trajectories generalize across task boundaries and performance scales with memory size, suggesting shared artifact storage as a promising substrate for collective and continual intelligence among distributed agents\.

## Limitations and Future Work

Our experiments cover two interactive benchmarks \(ALFWorld and WebArena\) and a 34\-model consumer population\. While no single study can cover every environment or every model in a rapidly evolving landscape, this restricts the scope of our empirical claims\. Our experiments and rerankers are also trained and evaluated within the same benchmark, so cross\-benchmark reranker generalization remains untested\. Also, the LTR dataset uses sparse rank sampling at positionsℐ=\{1,5,10,15,20\}\\mathcal\{I\}=\\\{1,5,10,15,20\\\}, which does not fully cover the label distribution at all rank positions\. This was a practical choice given experiment budget, and we found it to yield strong LTR learning performance with a favorable cost\-quality tradeoff\. Finally, our work focuses entirely on the consumer side of MATM\. Because MATM is fundamentally a two\-sided market, evaluating producer\-side welfare is equally important\. Future work could draw on attribution fairness in RAG\(kim2025fairrag\)or marketplace evaluation frameworks\(kim2026evaluation\)to address this gap\. Relatedly, the current framework does not account for adversarial producers who may contribute malicious trajectories, potentially placing consumer agents at risk\.

## Acknowledgments

This work was supported by NSF grant 2402874\. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors\.

## References

## Appendix AUnabridged Related Work

### A\.1Memory\-Augmented Agents

Retrieval\-augmented generation \(RAG\)\(lewis2020retrieval\), an instance of Retrieval\-Enhanced Machine Learning\(zamani:reml;kim2024reml\), enhances language models by conditioning generation on retrieved external context, most commonly human\-authored documents such as web pages or knowledge bases\(fan2024survey\)\. Recent extensions of this paradigm treat an agent’s own interaction history as retrievable context, giving rise to memory\-augmented generation, where past conversations or execution traces are indexed and reused to guide future behavior\(shinn2023reflexion;majumder2024clin;zheng2024synapse\)\.

MATM fits within this retrieval\- and memory\-augmented paradigm but differs in both content and scope: instead of retrieving web documents or single\-agent’s local history, MATM indexes agent\-generated trajectories and treats memory as a population\-level resource shared across agents, where agents can contribute to and retrieve from the shared repository\.

### A\.2Reuse of Agent Artifacts

Modern reasoning agents, particularly those employing high inference\-time scaling, generate rich intermediate artifacts during problem solving\. These include low\-level action\-observation trajectories and thinking traces, as well as higher\-level plans, strategies, workflows, and reusable code which are anologous to the notion of options in reinforcement learning\(Garcia19compressionMacro;veeriah2021discovery\)\. Theseagent artifactscan be leveraged to improve inference efficiency, generalization, and continual adaptation\.

At the trajectory level, prior work has explored reusing reasoning or action\-observation trajectories as in\-context guidance\. Buffer of Thoughts\(yang2024buffer\)maintains and retrieves reasoning templates to guide new problem instances, while Retrieval of Thought\(ahmed2025retrieval\)constructs thought templates on the fly by retrieving prior reasoning trajectories\. For action\-observation trajectories,zheng2024synapseandzhao2024expelreuse environment interaction trajectories as in\-context examples to improve downstream decision\-making\.

Beyond the trajectory level, several works extract and reuse more abstract artifacts such as plans, strategies, workflows, and skills\. CLIN\(majumder2024clin\)stores textual causal abstractions to support continual improvement\. Agent Workflow Memory\(wang2025agentworkflow\)distills reusable workflows from web interaction trajectories, and MaestroMotif\(klissarov2025maestromotif\)induces reusable skills via reinforcement learning\. ReasoningBank\(ouyang2025reasoningbank\)retrieves strategy\-level reasoning patterns to guide problem solving\.arabzadeh2026thinkingtracetransform math reasoning thinking trajectories into higher level structures and use them as retrievable objects\. Applied to a programming domain,wang2025inducingenable agents to induce, verify, and reuse program\-based skills on the fly in web\-based tasks, while Voyager\(wang2024voyager\)maintains a growing library of executable code for open\-ended task execution\. In a concurrent work, SkillNet\(liang2026skillnet\)assembles a collection of skills contributed by multiple agents, framing skill accumulation as a system\-design problem\. Collectively, these works frame artifact reuse as a mechanism for accumulating reusable competence over time\.

Agent artifacts can also serve as supervision signals for model distillation\. SuperCorrect\(yang2025supercorrect\)extracts thought templates from a teacher model to guide smaller models during reasoning\. Related approaches similarly distill structured reasoning artifacts to transfer competence across models\(li2025naturalthoughts;kang2025distilling\)\. In these settings, artifacts function not only as inference\-time memory but also as compressed representations of reasoning expertise\.

While these systems reuse different types of agent artifacts, such artifacts are typically reused only by the same or homogeneous agent\(s\) that produced them, with less consideration of emerging society of agents\(liDoesSocializationEmerge2026;wangSkillOrchestraLearningRoute2026\)\. As a result, valuable experience remains isolated, and newly instantiated agents repeatedly rediscover solutions that already existed elsewhere in the other systems\.

In contrast, MATM proposes apopulation\-levelartifact repository\. Rather than treating artifacts as private, per\-agent memory, we model them as shared, structured resources that heterogeneous agents can both contribute to and retrieve from\. This shifts artifact reuse from an individual optimization mechanism to a collective knowledge infrastructure, enabling continual learning and cross\-agent transfer, reducing redundant exploration, and supporting cumulative capability growth at the ecosystem level\. The most closely related concurrent work, SkillNet\(liang2026skillnet\), also assembles a collection of skills across heterogeneous agents\. However, it primarily addresses system design and does not evaluate the benefit of retrieval for consumer agents in a population setting, nor does it provide in\-depth analysis of artifact repository search\.

## Appendix BReturn\-Paired Preference\(diaz2026rpp\)

Given a set ofnntasks and two agents—a control or baseline agentAAand a treatment agentA′A^\{\\prime\}—we have a set ofnntrajectories for each agent\. For a task, we say thatA′A^\{\\prime\}is preferred toAAifA′A^\{\\prime\}is successful andAAis not or, if both are successful,A′A^\{\\prime\}reaches the success in fewer steps; similarly,AAis preferred toA′A^\{\\prime\}ifAAis successful whileA′A^\{\\prime\}is not or if it faster to success if both are successful\. In all other situations, we say that there is no preference betweenA′A^\{\\prime\}andAAfor that task\. If the value of the preference is11whenA′A^\{\\prime\}is preferred,−1\-1whenAAis preferred, and0if there is no preference, then the return\-paired preference metric for agentA′A^\{\\prime\}is themeanpreference value over allnntasks compared with all other agents\.

## Appendix CPopulation of LLM Agents

Table[3](https://arxiv.org/html/2606.19911#A3.T3)shows the list of agents used as a population in each benchmark\.

Table 3:Producer and consumer agents across ALFWorld \(A\) and WebArena \(W\)\. AW indicates the model serves in that role for both environments\.
## Appendix DALFWorld Description

ALFWorld\(shridhar2021alfworld\)contains interactive TextWorld environments\(cote18textworld\)that parallel embodied worlds in the ALFRED dataset\(ALFRED20\)\. The aligned environments allow agents to reason and learn high\-level policies in an abstract space before solving embodied tasks through low\-level actuation\. ALFWorld translates complex household tasks such as finding, cleaning, heating, or placing objects into textual observations and actions, allowing researchers to train and evaluate agents using natural language rather than raw visual input\. The dataset consists of 3553 tasks for training and a heldout test set of 274 tasks\. The tasks are grouped into 6 task types: Pick & Place, Examine in Light, Clean & Place, Heat & Place, Cool & Place, Pick Two & Place\. Within each task category there is significant variation: the embodied environment includes 120 rooms \(30 kitchens, 30 bedrooms, 30 bathrooms, 30 living rooms\), each dynamically populated with a set of portable objects \(e\.g\., apple, mug\), and static receptacles \(e\.g\., microwave, fridge\)\. For interaction, TextWorld environments allow 9 high\-level actions such as ’open’, ’heat’, etc\. For our experiments, we use the 3553 training episodes for populating MATM memory, and a sampled representative subset of 355 episodes for generating data for training LTRT rerankers\. We use the heldout 274 episodes as the test set\.

## Appendix EWebArena Description

WebArena\(zhou2024webarena\)is a standalone, self\-hostable web environment for building autonomous agents\. WebArena creates websites from five popular categories \(Ecommerce platforms, Social Forums, Maps, Content Management Systems and Collaborative Development Platforms for software development\) with functionality and data mimicking their real\-world equivalents\. The dataset consists of 812 examples consisting of high\-level natural language instructions that require interaction with the WebArena environment to solve\. The dataset was created by curating realistic intents to carry out complex and creative tasks within WebArena\. Annotators were guided to spend a few minutes exploring the websites to familiarize themselves with the websites’ content and functionalities\. Then the annotators are tasked with intent formulation\. At the end, 241 intents were curated and 812 tasks were created with different instantiations of these intents\. Figure[4](https://arxiv.org/html/2606.19911#A5.F4)shows the distribution of intents accross different sites\.

![Refer to caption](https://arxiv.org/html/2606.19911v1/x5.png)Figure 4:The intent distribution across different websites for WebArenaFor our experiments, the test set comprised of 88 tasks sampled from the total 812 tasks while maintaining the distribution of intents\. We treated the leftover 724 tasks as training data for populating MATM memory, and used a subset of 58 episodes for generating LTRT training data\. In the end, our test set consisted of 88 tasks\.

## Appendix FTask Allocation Function

To ensure that every model in the population produces trajectories across all task categories, we employ atask\-type\-aware stratified round\-robinassignment with an offset\. Given a partition𝒳p\\mathcal\{X\}\_\{p\}, questions are first grouped into buckets by task type \(e\.g\., Algebra, Geometry, Precalculus for mathematical problem solving, or task categories for interactive benchmarks\), and each bucket is sorted in a deterministic order\. Within each bucket, questions are assigned to agents by cycling through the ordered populationA1,A2,…,ANA\_\{1\},A\_\{2\},\\ldots,A\_\{N\}in round\-robin fashion, starting at an offseto∈\{0,…,N−1\}o\\in\\\{0,\\ldots,N\{\-\}1\\\}\. The offset shifts the starting agent but does not change the bucket composition, ensuring that different offset values produce complementary assignments across agents\.

## Appendix GMATM Index Statistics

Table[4](https://arxiv.org/html/2606.19911#A7.T4)shows the size of MATM index across environments\.

\(a\)ALFWorld
\(b\)WebArena

Table 4:MATM index statistics across benchmarks\.
## Appendix HIncremental Construction of MATM & LTRT Dataset

Algorithm 1Incremental MATM & LTRT Dataset Construction1:Agent population

𝒜\\mathcal\{A\}, question partitions

𝒳1,…,𝒳P\\mathcal\{X\}\_\{1\},\\dots,\\mathcal\{X\}\_\{P\}, pre\-warmed index

𝒟0\\mathcal\{D\}\_\{0\}, retrieval depth

KK, rank positions

ℐ⊆\{1,…,K\}\\mathcal\{I\}\\subseteq\\\{1,\\dots,K\\\}, branching points per question

TT, allocation

σ\\sigma, evaluatorEval, quality threshold

θ\\theta, embedding function

ff
2:Updated MATM index

𝒟P\\mathcal\{D\}\_\{P\}and LTRT dataset

𝒮\\mathcal\{S\}
3:

𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset⊳\\trianglerightLTRT Dataset

4:for

p=1,…,Pp=1,\\dots,Pdo⊳\\trianglerightProcess each partition

5:

ℬp←∅\\mathcal\{B\}\_\{p\}\\leftarrow\\emptyset⊳\\trianglerightTrajectory buffer

6:foreach question

x∈𝒳px\\in\\mathcal\{X\}\_\{p\}do

7:

An←σ\(x,𝒜,p\)A\_\{n\}\\leftarrow\\sigma\(x,\\mathcal\{A\},p\)⊳\\trianglerightAssign agent

8:

\(𝒯base,y^base\)←An\(x\)\(\\mathcal\{T\}\_\{\\mathrm\{base\}\},\\hat\{y\}\_\{\\mathrm\{base\}\}\)\\leftarrow A\_\{n\}\(x\)⊳\\trianglerightBaseline trajectory without retrieval

9:

sbase←Eval\(y^base,x\)s\_\{\\mathrm\{base\}\}\\leftarrow\\textsc\{Eval\}\(\\hat\{y\}\_\{\\mathrm\{base\}\},x\)⊳\\trianglerightReference score for marginal utility

10:if

sbase≥θs\_\{\\mathrm\{base\}\}\\geq\\thetathen

11:

ℬp←ℬp∪\{𝒯base\}\\mathcal\{B\}\_\{p\}\\leftarrow\\mathcal\{B\}\_\{p\}\\cup\\\{\\mathcal\{T\}\_\{\\mathrm\{base\}\}\\\}
12:endif

13:Sample

\{t1,…,tT\}⊆\{1,…,\|𝒯base\|\}\\\{t\_\{1\},\\dots,t\_\{T\}\\\}\\subseteq\\\{1,\\dots,\|\\mathcal\{T\}\_\{\\mathrm\{base\}\}\|\\\}uniformly at random⊳\\trianglerightBranching points

14:for

t∈\{t1,…,tT\}t\\in\\\{t\_\{1\},\\dots,t\_\{T\}\\\}do⊳\\trianglerightRoll\-in to stepttof𝒯base\\mathcal\{T\}\_\{\\mathrm\{base\}\}

15:

ht←\(τ1,…,τt\)h\_\{t\}\\leftarrow\(\\tau\_\{1\},\\dots,\\tau\_\{t\}\)from

𝒯base\\mathcal\{T\}\_\{\\mathrm\{base\}\}
16:

qt,\(dt\(1\),…,dt\(K\)\)←Retrieve\(x,ht,𝒟p−1,K\)q\_\{t\},\(d\_\{t\}^\{\(1\)\},\\\!\\dots,\\\!d\_\{t\}^\{\(K\)\}\)\\leftarrow\\textsc\{Retrieve\}\(x,h\_\{t\},\\mathcal\{D\}\_\{p\-1\},K\)
17:for

j∈ℐj\\in\\mathcal\{I\}do⊳\\trianglerightRoll\-out: one\-shot augmented generation per rank

18:

\(𝒯t\(j\),y^t\(j\)\)←An\(x∣ht,dt\(j\)\)\(\\mathcal\{T\}\_\{t\}^\{\(j\)\},\\hat\{y\}\_\{t\}^\{\(j\)\}\)\\leftarrow A\_\{n\}\(x\\mid h\_\{t\},d\_\{t\}^\{\(j\)\}\)
19:

st\(j\)←Eval\(y^t\(j\),x\)s\_\{t\}^\{\(j\)\}\\leftarrow\\textsc\{Eval\}\(\\hat\{y\}\_\{t\}^\{\(j\)\},x\)
20:if

st\(j\)≥θs\_\{t\}^\{\(j\)\}\\geq\\thetathen

21:

ℬp←ℬp∪\{𝒯t\(j\)\}\\mathcal\{B\}\_\{p\}\\leftarrow\\mathcal\{B\}\_\{p\}\\cup\\\{\\mathcal\{T\}\_\{t\}^\{\(j\)\}\\\}⊳\\trianglerightAdd successful trajectory

22:endif

23:

𝒮←𝒮∪\{\(qt,dt\(j\),st\(j\)−sbase\)\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{\(q\_\{t\},d\_\{t\}^\{\(j\)\},s\_\{t\}^\{\(j\)\}\-s\_\{\\mathrm\{base\}\}\)\\\}⊳\\trianglerightMarginal utility label

24:endfor

25:endfor

26:endfor

27:

𝒟p←IndexUpdate\(𝒟p−1,ℬp,f\)\\mathcal\{D\}\_\{p\}\\leftarrow\\textsc\{IndexUpdate\}\(\\mathcal\{D\}\_\{p\-1\},\\;\\mathcal\{B\}\_\{p\},\\;f\)⊳\\trianglerightChunk, embed, add

28:endfor

Algorithm[1](https://arxiv.org/html/2606.19911#alg1)describes formal procedure of incremental construction of MATM index and LTRT dataset for trajectory reranker training\.

## Appendix ILearning\-To\-Rank Features

Table[5](https://arxiv.org/html/2606.19911#A9.T5)shows the complete list of learning\-to\-rank features used across environments\.

CategoryFeatures \(\# features: 44\)Producer Agent Info \(\#: 13\)agent IDcontext\-windowagent benchmark scores \(11 features\):Artificial Analysis Intelligence \(AAI\) IndexGDPval\-AAτ2\\tau^\{2\}\-Bench TelecomTerminal\-Bench HardSciCodeAA\-LCRAA\-Omniscience AccuracyIFBenchHumanity’s Last Exam \(HLE\)GPQA\-DiamondCritPtConsumer Agent Info \(\#: 1\)agent ID1stStage Retrieval \(\#: 1\)1ststage retrieval scoreQuery Features \(\#: 2\)query lengthcurrent step numberTrajectory Features \(\#: 4\)retrieved chunk lengthnumber of steps in trajectorysuccess flagtrajectory lengthQuery–Trajectory Interaction Features \(\#: 23\)unigram text tfidf cosine similarityunigram goal tfidf cosine similarityunigram state tfidf cosine similarityunigram context tfidf cosine similaritybigram text tfidf cosine similaritybigram goal tfidf cosine similaritybigram state tfidf cosine similaritybigram context tfidf cosine similaritytext overlap ratiogoal overlap ratiostate overlap ratiocontext overlap ratiotext jaccard similaritygoal jaccard similaritystate jaccard similaritycontext jaccard similaritytext embedding similaritygoal embedding similaritystate embedding similaritycontext embedding similaritytask matchtask variation matchstep number difference

Table 5:Features used for Learning\-To\-Rank Trajectories \(LTRT\)\.
## Appendix JFeature Importance Test of the Trained Learning\-To\-Rank Trajectories \(LTRT\) Model

Table 6:Top feature rankings for LTRT model on each benchmark\. SVMRank for ALFWorld and FFN for WebArena\. For SVMRank, we report the feature importance as the weight learnt for SVMRank\. For FFN, we compute feature importance by removing the feature and measuring the drop in NDCG@10 score while training the LTRT reranker\.Table[6](https://arxiv.org/html/2606.19911#A10.T6)shows the top ten most important features for both benchmarks\.

## Appendix KExtended Results of Section §[5\.3](https://arxiv.org/html/2606.19911#S5.SS3)

### K\.1Formalism

We formalize the analysis of §[5\.3](https://arxiv.org/html/2606.19911#S5.SS3)as follows\. With some abuse of notation, let𝒫\\mathcal\{P\}denote the set of producer agents and𝒞\\mathcal\{C\}the set of consumer agents, with𝒳\\mathcal\{X\}the set of evaluation tasks\. For a consumerc∈𝒞c\\in\\mathcal\{C\}, letμ0\(c\)∈\[0,1\]\\mu\_\{0\}\(c\)\\in\[0,1\]denote its average final score on𝒳\\mathcal\{X\}without retrieval, and letμr\(p,c\)∈\[0,1\]\\mu\_\{r\}\(p,c\)\\in\[0,1\]denote its average final score on episodes where producerp∈𝒫p\\in\\mathcal\{P\}appeared among the retrieved source models\. Theretrieval advantageof the pair\(p,c\)\(p,c\)is

μα\(p,c\)=μr\(p,c\)−μ0\(c\),\\mu\_\{\\alpha\}\(p,c\)=\\mu\_\{r\}\(p,c\)\-\\mu\_\{0\}\(c\),the gain in consumer success rate attributable to retrieving from producerpprelative to that consumer’s own no\-retrieval baseline\.

Each agenti∈𝒫∪𝒞i\\in\\mathcal\{P\}\\cup\\mathcal\{C\}has a standalone capabilityκ\(i\)\\kappa\(i\), measured by its aggregated Artificial Analysis Intelligence Index score\(artificialanalysis2026\)\. Thecapability gapof a producer\-consumer pair is

κα\(p,c\)=κ\(p\)−κ\(c\),\\kappa\_\{\\alpha\}\(p,c\)=\\kappa\(p\)\-\\kappa\(c\),which is positive when the producer is stronger than the consumer in standalone capability, zero when they are matched, and negative when the producer is weaker\.

![Refer to caption](https://arxiv.org/html/2606.19911v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.19911v1/x7.png)

Figure 5:Retrieval Advantage vs\. Producer\-Consumer Capability Gap for ALFWorld \(top\) and WebArena \(bottom\)\.
### K\.2Full results

Figure[5](https://arxiv.org/html/2606.19911#A11.F5)showsμα\\mu\_\{\\alpha\}plotted againstκα\\kappa\_\{\\alpha\}for ALFWorld and WebArena, with the best\-performing reranker for each environment shown alongside the corresponding single\-stage baseline\. Across both benchmarks and both retrieval settings, the Pearson correlation betweenκα\\kappa\_\{\\alpha\}andμα\\mu\_\{\\alpha\}is small and not statistically significant:r=\+0\.04r=\+0\.04\(p=0\.49p=0\.49\) for ALFWorld single\-stage retrieval,r=\+0\.09r=\+0\.09\(p=0\.09p=0\.09\) for ALFWorld with SVMRank reranking,r=\+0\.03r=\+0\.03\(p=0\.79p=0\.79\) for WebArena single\-stage retrieval, andr=\+0\.08r=\+0\.08\(p=0\.35p=0\.35\) for WebArena with FFN reranking\.

Reranking consistently lifts the retrieval advantage distribution\. On ALFWorld, the meanμα\\mu\_\{\\alpha\}rises from\+0\.05\+0\.05under single\-stage retrieval to\+0\.1\+0\.1under SVMRank, and the fraction of pairs withμα\>0\\mu\_\{\\alpha\}\>0rises from51%51\\%to61%61\\%\. On WebArena, the conditional means follow the same pattern: pairs where the producer is stronger than the consumer show a meanμα\\mu\_\{\\alpha\}of\+0\.02\+0\.02under both single\-stage and FFN reranking, while the overall distribution shifts upward under reranking\. The correlation betweenκα\\kappa\_\{\\alpha\}andμα\\mu\_\{\\alpha\}also roughly doubles under reranking in both environments, fromr≈0\.03r\\approx 0\.03tor≈0\.08r\\approx 0\.08\. While neither correlation reaches statistical significance, their consistency across two independent benchmarks and two retrieval settings indicates a real but small structural effect: reranking incorporates producer\-capability information as one signal among many, consistent with the feature importance analysis in §[5\.2](https://arxiv.org/html/2606.19911#S5.SS2)\.

## Appendix LSection §[5\.4](https://arxiv.org/html/2606.19911#S5.SS4)Supplement

For ALFWorld, we adopt the six task types defined in the original benchmark; for WebArena, the 241 task intents\. Because WebArena’s task space is fine\-grained, some test tasks have no same\-type candidates in the index; we exclude such tasks from the WebArena experiments, leaving 47 tasks for this analysis\. All results use the best\-performing reranker per environment: SVMRank for ALFWorld and FFN for WebArena\. RPP in this section is computed relative to full retrieval, so negative RPP values indicate underperformance relative to the full\-scope condition\.

## Appendix MLanguage Model Prompts

### M\.1Retrieval Planner Prompt

SYSTEMYou are a retrieval planner for an agent\. Decide if a new retrieval of a successful trajectory is needed now\. Consider the goal, recent observations, and the current retrieved trajectory context\. Focus on identifying if the current situation is similar to the retrieved trajectory\. If the current situation and recent history is similar to the retrieved trajectory, return \[NoRetrieval\]\. If the current situation and recent history is not similar to the retrieved trajectory, return \[Retrieval\]\.

USER\{ "goal": "<goal\_text\>", "current\_observation": "<observation\>", "recent\_history": "<recent\_history\_str\>", "current\_retrieved\_context": "<trajectory\_context\>" \} Return only one of these exact tags: \[Retrieval\] or \[NoRetrieval\]\. Do not include any other text\.

### M\.2ALFWorld Baseline \(no\-retrieval\) Prompt

Note that the ‘ONE\-SHOT EXAMPLE’ used in this section is an illustrative example; the prompts are adjusted based on the task\.

SYSTEMInteract with a household to solve a task\. Imagine you are an intelligent agent in a household environment and your target is to perform actions to complete the task goal\. At the beginning of your interactions, you will be given the detailed description of the current environment and your goal to accomplish\. For each of your turn, you will be given the observation of the last turn\. Think step by step about what you observe and what action to take next\. The available actions are: 1\. go to recep 2\. take obj from recep 3\. put obj in/on recep 4\. open recep 5\. close recep 6\. toggle obj recep 7\. clean obj with recep 8\. heat obj with recep 9\. cool obj with recep where obj and recep correspond to objects and receptacles\. After your each turn, the environment will give you immediate feedback based on which you plan your next few steps\. if the envrionment output "Nothing happened", that means the previous action is invalid and you should try more options\. Reminder: 1\. The action must be chosen from the given available actions\. Any actions except provided available actions will be regarded as illegal\. 2\. Think when necessary, try to act directly more in the process\. RESPONSE FORMAT: For each turn, you must provide your response in JSON format with two fields: \- reasoning: MUST start with ‘Let’s think step by step\.’ followed by detailed reasoning about the situation and why this action is the best choice \- action: your chosen action \(must be exactly from admissible\_actions list\) CRITICAL RULES: 1\. Your action MUST be character\-for\-character identical to one item in admissible\_actions 2\. Do NOT modify, abbreviate, or paraphrase actions 3\. Do NOT use actions from retrieved trajectories unless they appear in current admissible\_actions 4\. If confused, pick a safe exploratory action like "look" or navigation ONE\-SHOT EXAMPLE: Task: clean some apple and put it in sidetable\. Turn 1: Observation: The fridge 1 is closed\. Response: \{"reasoning": "Let’s think step by step\. I should open fridge 1 to see what’s inside\.", "action": "open fridge 1"\} Turn 2: Observation: You open the fridge 1\. The fridge 1 is open\. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1\. Response: \{"reasoning": "Let’s think step by step\. I need to navigate to diningtable 1 to continue searching or complete the task\.", "action": "go to diningtable 1"\} Turn 3: Observation: On the diningtable 1, you see nothing\. Response: \{"reasoning": "Let’s think step by step\. I need to navigate to diningtable 2 to continue searching or complete the task\.", "action": "go to diningtable 2"\} \[TURNS SKIPPED FOR READABILITY\] Turn 11: Observation: On the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3\. Response: \{"reasoning": "Let’s think step by step\. I need to clean this object as required by the task\.", "action": "clean apple 3 with sinkbasin 1"\} Turn 12: Observation: You clean the apple 3 using the sinkbasin 1\. Response: \{"reasoning": "Let’s think step by step\. Now I clean an apple \(3\)\. Next, I need to put it in/on sidetable 1\.", "action": "go to sidetable 1"\} Turn 13: Observation: On the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1\. Response: \{"reasoning": "Let’s think step by step\. Now I have the object, I need to place it at the target location to complete the task\.", "action": "put apple 3 in/on sidetable 1"\}

USERGOAL: <goal\_text\> CURRENT STEP: <current\_step\> / <max\_steps\> \-\-\- RECENT HISTORY \(Previous Steps \- For Reference Only\) \-\-\- <recent\_history\_str\> \-\-\- End of Recent History \-\-\- \>\>\> CURRENT OBSERVATION \(Focus on This \- Current State\): <observation\> <<< End of Current Observation ADMISSIBLE ACTIONS \(<N\> total\): 1\. <action\_1\> 2\. <action\_2\> \.\.\. RESPONSE FORMAT: You MUST respond with valid JSON in this exact format: \{"reasoning": "Let’s think step by step\. \[your detailed reasoning\]", "action": "exact action from admissible\_actions"\} Where: \- reasoning: MUST start with ‘Let’s think step by step\.’ Then explain your thought process, what you observe, and why this action is best \- action: Must be EXACTLY one string from the admissible\_actions list above \(character\-for\-character match\) IMPORTANT: 1\. Your reasoning MUST begin with ‘Let’s think step by step\.’ 2\. Do not include any text before or after the JSON object\.

### M\.3ALFWorld Trajectory\-Augmented Prompt

SYSTEMSame asALFWorld Baseline \(no\-retrieval\) System Prompt

USERGOAL: <goal\_text\> CURRENT STEP: <current\_step\> / <max\_steps\> \-\-\- RECENT HISTORY \(Previous Steps \- For Reference Only\) \-\-\- <recent\_history\_str\> \-\-\- End of Recent History \-\-\- \-\-\- RETRIEVED TRAJECTORY GUIDANCE \(Reference Examples\) \-\-\- RETRIEVED TRAJECTORY: Task: <task\_description\> Retrieved successful trajectory sequence: Step 1: <action\_1\> Observation: <observation\_1\> Step 2: <action\_2\> Observation: <observation\_2\> \.\.\. Use this trajectory as a reference for your planning\. Consider: 1\. The sequence of actions that led to success 2\. The observations and their progression 3\. How to adapt this strategy to the current situation and your goal 4\. What steps might be different or similar in your current context \-\-\- End of Trajectory Guidance \-\-\- \>\>\> CURRENT OBSERVATION \(Focus on This \- Current State\): <observation\> <<< End of Current Observation ADMISSIBLE ACTIONS \(<N\> total\): 1\. <action\_1\> 2\. <action\_2\> \.\.\. RESPONSE FORMAT: You MUST respond with valid JSON in this exact format: \{"reasoning": "Let’s think step by step\. \[your detailed reasoning\]", "action": "exact action from admissible\_actions"\} Where: \- reasoning: MUST start with ‘Let’s think step by step\.’ Then explain your thought process, what you observe, and why this action is best \- action: Must be EXACTLY one string from the admissible\_actions list above \(character\-for\-character match\) IMPORTANT: 1\. Your reasoning MUST begin with ‘Let’s think step by step\.’ 2\. Do not include any text before or after the JSON object\.

### M\.4WebArena Baseline \(no\-retrieval\) Prompt

SYSTEMYou are an autonomous intelligent agent tasked with navigating a web browser\. You will be given web\-based tasks\. These tasks will be accomplished through the use of specific actions you can issue\. Here’s the information you’ll have: The user’s objective: This is the task you’re trying to complete\. The current web page’s accessibility tree: This is a simplified representation of the webpage, providing key information\. The current web page’s URL: This is the page you’re currently navigating\. The open tabs: These are the tabs you have open\. The previous action: This is the action you just performed\. It may be helpful to track your progress\. The actions you can perform fall into several categories: Page Operation Actions: ‘click \[id\]‘: This action clicks on an element with a specific id on the webpage\. ‘type \[id\] \[content\] \[press\_enter\_after=0\|1\]‘: Use this to type the content into the field with id\. By default, the "Enter" key is pressed after typing unless press\_enter\_after is set to 0\. ‘hover \[id\]‘: Hover over an element with id\. ‘press \[key\_comb\]‘: Simulates the pressing of a key combination on the keyboard \(e\.g\., Ctrl\+v\)\. ‘scroll \[direction=down\|up\]‘: Scroll the page up or down\. Tab Management Actions: ‘new\_tab‘: Open a new, empty browser tab\. ‘tab\_focus \[tab\_index\]‘: Switch the browser’s focus to a specific tab using its index\. ‘close\_tab‘: Close the currently active tab\. URL Navigation Actions: ‘goto \[url\]‘: Navigate to a specific URL\. ‘go\_back‘: Navigate to the previously viewed page\. ‘go\_forward‘: Navigate to the next page \(if a previous ’go\_back’ action was performed\)\. Completion Action: ‘stop \[answer\]‘: Issue this action when you believe the task is complete\. If the objective is to find a text\-based answer, provide the answer in the bracket\. Available Websites: You have access to the following websites: \- OneStopShop \(E\-commerce\): <SHOPPING\_URL\> \- Shopping Admin \(CMS\): <SHOPPING\_ADMIN\_URL\> \- Reddit \(Forum\): <REDDIT\_URL\> \- GitLab: <GITLAB\_URL\> \- Wikipedia: <WIKIPEDIA\_URL\> \- OpenStreetMap: <MAP\_URL\> RESPONSE FORMAT: For each turn, you must provide your response in JSON format with two fields: \- reasoning: Start with ’Let’s think step by step’ and then provide detailed step\-by\-step reasoning about the current situation, what you observe, and why the chosen action is the best next step \- action: your chosen action \(must be exactly from the available actions\) IMPORTANT: 1\. You should only issue an action that is valid given the current observation 2\. You should only issue one action at a time 3\. For each observation, think step by step about what you see and what action to take next 4\. Your action MUST be one of the valid action formats listed above Few\-shot examples and closing sentence\. Example 1: OBSERVATION: \[1744\] link ’HP CB782A\#ABA 640 Inkjet Fax Machine \(Renewed\)’ \[1749\] StaticText ’$279\.49’ \[1757\] button ’Add to Cart’ \[1760\] button ’Add to Wish List’ \[1761\] button ’Add to Compare’ URL: http://onestopmarket\.com/office\-products/office\-electronics\.html OBJECTIVE: What is the price of HP CB782A\#ABA 640 Inkjet Fax Machine \(Renewed\)? PREVIOUS ACTION: None Response: \{"reasoning": "Let’s think step by step\. I can see the product HP CB782A\#ABA 640 Inkjet Fax Machine \(Renewed\) is displayed on the page\. The price is clearly shown as $279\.49 in element \[1749\]\. Since the objective is to find the price, and I have found it, I should stop and provide the answer\.", "action": "stop \[$279\.49\]"\} Example 2: OBSERVATION: \[164\] textbox ’Search’ focused: True required: False \[171\] button ’Search’ URL: http://onestopmarket\.com OBJECTIVE: Search for USB\-C to USB\-B cable on the site\. PREVIOUS ACTION: None Response: \{"reasoning": "Let’s think step by step\. I need to search for USB\-C to USB\-B cable on this site\. I can see a search textbox with id \[164\] that is already focused\. I should type my search query into this textbox and press enter to execute the search\. I’ll use press\_enter\_after=1 to submit the search after typing\.", "action": "type \[164\] \[USB\-C to USB\-B cable\] \[1\]"\} Example 3: OBSERVATION: \[1124\] heading ’Certified Refurbished Kindle Paperwhite’ \[1136\] StaticText ’by Amazon’ \[1163\] StaticText ’Price: $79\.99’ \[1189\] button ’Add to Cart’ URL: http://onestopmarket\.com/dp/B00ZV 9RDKK OBJECTIVE: Show me the details of the Kindle Paperwhite\. PREVIOUS ACTION: click \[582\] Response: \{"reasoning": "Let’s think step by step\. I can see the Kindle Paperwhite product page with the heading, price \($79\.99\), and an Add to Cart button\. The objective is to show details of the product\. Currently, I can only see basic information\. To view more detailed information like specifications, features, or description, I should scroll down the page to see if there are additional details below\.", "action": "scroll \[down\]"\} Now, given the current observation, think step\-by\-step and provide your response\.

USERGOAL: <goal\_text\> CURRENT STEP: <current\_step\> / <max\_steps\> \-\-\- RECENT HISTORY \(Previous Steps \- For Reference Only\) \-\-\- <recent\_history\_str\> \-\-\- End of Recent History \-\-\- \>\>\> CURRENT OBSERVATION \(Focus on This \- Current State\): <observation\> <<< End of Current Observation CURRENT URL: <url\> RESPONSE FORMAT: You MUST respond with valid JSON in this exact format: \{"reasoning": "Let’s think step by step\. \[your detailed reasoning\]", "action": "action to be taken"\} Where: \- reasoning: MUST start with ‘Let’s think step by step\.’ Then explain your thought process, what you observe, and why this action is best \- action: action to be taken IMPORTANT: 1\. Your reasoning MUST begin with ‘Let’s think step by step\.’ 2\. Do not include any text before or after the JSON object\.

### M\.5WebArena Trajectory\-Augmented Prompt

SYSTEMSame asWebArena Baseline \(no\-retrieval\) System Prompt

USERGOAL: <goal\_text\> CURRENT STEP: <current\_step\> / <max\_steps\> \-\-\- RECENT HISTORY \(Previous Steps \- For Reference Only\) \-\-\- <recent\_history\_str\> \-\-\- End of Recent History \-\-\- \-\-\- RETRIEVED TRAJECTORY GUIDANCE \(Reference Examples\) \-\-\- RETRIEVED TRAJECTORY: Task: <task\_description\> Retrieved successful trajectory: Step 1: Action: <action\_1\> Observation: <observation\_1\> Step 2: Action: <action\_2\> Observation: <observation\_2\> \.\.\. Use this trajectory as a reference for your planning\. Consider: 1\. The sequence of actions taken 2\. How the agent navigated through the website 3\. What elements were clicked and in what order 4\. When the task was completed \-\-\- End of Trajectory Guidance \-\-\- \>\>\> CURRENT OBSERVATION \(Focus on This \- Current State\): <observation\> <<< End of Current Observation CURRENT URL: <url\> RESPONSE FORMAT: You MUST respond with valid JSON in this exact format: \{"reasoning": "Let’s think step by step\. \[your detailed reasoning\]", "action": "action to be taken"\} Where: \- reasoning: MUST start with ‘Let’s think step by step\.’ Then explain your thought process, what you observe, and why this action is best \- action: action to be taken IMPORTANT: 1\. Your reasoning MUST begin with ‘Let’s think step by step\.’ 2\. Do not include any text before or after the JSON object\.

## Appendix NDataset License

- •ALFWorld: MIT License
- •WebArena: Apache License 2\.0

## Appendix OComputational Budget

For retrieval from the MATM index, we use one NVIDIA L40S GPU\. For LLM inference in experiments, the OpenRouter API was used\. The total cost was approximately 2,000 USD\.

## Appendix PUse of AI Assistants

AI assistants were used for paraphrasing during paper writing and for simple implementation tasks during coding\. All outputs were thoroughly reviewed by the authors\.
Multi-Agent Transactive Memory

Similar Articles

AdMem: Advanced Memory for Task-solving Agents

Agentmemory

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

rohitg00/agentmemory

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

Submit Feedback

Similar Articles

AdMem: Advanced Memory for Task-solving Agents
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning