MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

arXiv cs.AI Papers

Summary

This paper introduces MemToolAgent, a framework that enhances LLM agents' tool-using capabilities by integrating a memory system that stores and retrieves past experiences, achieving significant improvements on multiple benchmarks without requiring model fine-tuning.

arXiv:2606.07909v1 Announce Type: new Abstract: Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:53 AM

# MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback
Source: [https://arxiv.org/html/2606.07909](https://arxiv.org/html/2606.07909)
Suleyman Armagan Er\*,1,2,Danilo Ribeiro†,1,Yogesh Virkar1,Surafel Lakew1, Adi Kalyanpur1,James Gung1,Thomas Delteil1,Arshit Gupta1

1AWS AI,2University of Washington

###### Abstract

Modern large language model \(LLM\) agents can use external tools to help users solve complex tasks\. However, for problems that require learning from long\-term historical events or from previous agent\-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences\. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents’ tool\-using capabilities through past user\-agent conversations\. We propose MemToolAgent, a framework that improves tool use through memory management\. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries\. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine\-tuning\. In summary, this work has three main contributions: \(1\) a unified memory entry format that improves both general\-purpose and personalized tool use without LLM fine\-tuning, \(2\) a reflection\-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and \(3\) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution\. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively\.

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

Suleyman Armagan Er\*,1,2, Danilo Ribeiro†,1, Yogesh Virkar1, Surafel Lakew1,Adi Kalyanpur1,James Gung1,Thomas Delteil1,Arshit Gupta11AWS AI,2University of Washington

22footnotetext:Corresponding Author:[danilrib@amazon\.com](https://arxiv.org/html/2606.07909v1/mailto:[email protected])11footnotetext:Work done during internship at AWS AI## 1Introduction

Combining powerful LLMs with additional components like memory systems, workflow mechanisms, APIs, and other tools to create LLM agents is an emerging trend\. These agents are designed to solve complex tasks that require sequential reasoning and multi\-step planning\. In this context, tools are functions utilized by agents to interact with their environment\. Integrating LLMs with external tools and APIs allows these language models to leverage various systems, enabling end\-users to interact with those systems through a chat interface\. An LLM agent’s tool use ability, i\.e\., using appropriate tools for solving a task, is an important capability\.

Despite significant advancements in recent years, fixed context windows of LLMs limit maintaining coherence and thematic continuity during long real\-world conversations\(Tanet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib11)\)\. Due to long sequences containing abundant irrelevant information, the attention mechanism loses its effectiveness, which is not desired for critical scenarios in education and healthcare\. As a result, integrating memory mechanisms into agents is crucial for long\-term interaction ability\. LLM agents need reliable external feedback to correct themselves\(Kamoiet al\.,[2024](https://arxiv.org/html/2606.07909#bib.bib26)\), and memory can provide this feedback by storing and retrieving past interactions\. Using experiences allows agents to more accurately anticipate user needs and learn from mistakes\.

![Refer to caption](https://arxiv.org/html/2606.07909v1/use_cases_figure.png)Figure 1:Integrating memory for two main tool use tasksIntegrating memory into an LLM agent supports two main types of tasks: personalized and general\-purpose tool use, as shown in Figure[1](https://arxiv.org/html/2606.07909#S1.F1)\. Users utilize the same tools in particular ways, such as using different input parameters, or they might use different tools to achieve the same goal\. In scenarios where the agent has access to tools with similar functionalities, it requires user context to correctly execute the task\. Storing the past interactions in the memory allows for better personalization\. For instance, personalization can resolve the tool\-level ambiguity if "create a ticket" could refer to Zendesk ticket creation or Jira issue creation\. Regarding general\-purpose tool use, there are common error patterns in tool use agents, such as dependency violation, parameter hallucination, and premature tool calling\. Having past experiences in memory can help the agent use the correct parameters for tools and use the tools in the correct order\. For example, this is useful in scenarios where the agent utilizes the same set of tools for a group of users to solve generic tasks like forwarding emails\.

Our work offers three main contributions: \(1\) a unified memory entry format of storing past experiences for better general\-purpose and personalized tool use performance without LLM fine\-tuning, \(2\) an environment and user feedback\-based reflection approach for memory writing to distill failed past experiences into critiques, and \(3\) a retrieval module based on similarity distribution that dynamically selects relevant memory entries to guide the agent for the current task\.

## 2Related Work

There has been a growing interest in memory management for LLM agents in recent years\. A related earlier idea is Case\-based reasoning \(CBR\)\(Aamodt and Plaza,[1994](https://arxiv.org/html/2606.07909#bib.bib32)\)where systems retrieve and use similar past cases \(experiences\) to solve new problems\. MemGPT \(MemoryGPT\)\(Packeret al\.,[2023](https://arxiv.org/html/2606.07909#bib.bib8)\)introduces an OS\-inspired memory hierarchy with two memory types called the main context and the external context\. More recent work includes graph\-based architectures to organize the information more flexibly and expressively\. In Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib6)\), the memory is modeled as a directed knowledge graph to express complex relationships\. This allows dynamic and more advanced storage and retrieval\. A\-MEM \(Agentic Memory\)\(Xuet al\.,[2025b](https://arxiv.org/html/2606.07909#bib.bib5)\)proposes a memory based on the Zettelkasten knowledge management system\. The memory entries are linked as interconnected networks, and the existing entries are updated based on new experiences\. Similarly, Zep\(Rasmussenet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib9)\)and Cognitive Weave\(Vishwakarmaet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib10)\)are based on temporally aware knowledge graphs and Spatio\-Temporal Resonance Graphs, respectively\.

Reflexion agents\(Shinnet al\.,[2023](https://arxiv.org/html/2606.07909#bib.bib33)\)turn task feedback into self\-reflections that are stored in an episodic memory for subsequent runs\. Reflective Memory Management \(RMM\)\(Tanet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib11)\)uses a Reinforcement Learning based approach to rank memory entries based on their utilities in past executions\. PersonaAgent\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.07909#bib.bib4)\)combines episodic and semantic memory for an LLM agent with personalized memory that enables the agent to perform actions customized to the user\. Agent Workflow Memory\(Wanget al\.,[2024b](https://arxiv.org/html/2606.07909#bib.bib13)\)stores workflows in the memory to guide the agent in web navigation tasks\. EXPEREPAIR\(Muet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib14)\)introduces an agent with a dual\-memory mechanism for code repair\. T1\(Chakrabortyet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib12)\)uses a simple cache for agents to store tool call results\. However, the main goal is to reuse the previously accessed data\.

![Refer to caption](https://arxiv.org/html/2606.07909v1/overview_updated_v2.png)Figure 2:MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memoryM​e​mpMem^\{p\}\(Fanget al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib19)\)is a procedural memory framework that extracts both fine\-grained instructions and high\-level abstractions from past agent trajectories\. Memento\(Zhouet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib18)\)stores past experiences in an episodic memory, learns and continually updates a case\-selection policy based on environmental feedback\. Training\-Free GRPO\(Caiet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib23)\)is a training\-free alternative to agentic RL that optimizes an experience library consisting of natural\-language experiential knowledge\. TOOLMEM\(Xiaoet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib20)\)stores memories of tool capabilities from previous interactions and helps the agent select the most suitable tool for a given task at inference\. LEGOMem\(Hanet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib21)\)is a procedural memory approach for multi\-agent systems that distills only successful task trajectories into memory entries with brief reflections\. ReMem agents\(Weiet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib22)\)combine reasoning, action, and memory refinement and actively exploit, prune, and reorganize their memory while solving the current task\.

There are in\-context learning \(ICL\) approaches closely related to agents with memory\. For example, STE\(Wanget al\.,[2024a](https://arxiv.org/html/2606.07909#bib.bib15)\)creates tool use trajectories based on a trial\-and\-error approach offline and stores them in a static memory for inference\. In\-Context Abstraction Learning \(ICAL\)\(Sarchet al\.,[2024](https://arxiv.org/html/2606.07909#bib.bib25)\)creates memory abstractions from sub\-optimal trajectories based on human\-in\-the\-loop learning and self\-reflection\.

Unlike the related work, MemToolAgent addresses both personalized and general\-purpose tool use\. It proposes a general memory entry format and stores and leverages detailed information like tool trajectories and a reflection/critique about the execution\. Importantly, MemToolAgent also incorporates user feedback as a signal to improve the quality of memory entries\.

## 3Method: MemToolAgent

MemToolAgent consists of two main components: the memory retrieval and memory extraction modules\. Figure[2](https://arxiv.org/html/2606.07909#S2.F2)illustrates the general overview of our proposed approach\. Memory entries, i\.e\., past experiences, are retrieved from the memory based on the semantic similarity between the query and the entries\. Dynamic "n" Top\-n Embedding\-based Retrieval is proposed for retrieval\. Instead of retrieving a fixed number of memory entries for each query, the number of entries to retrieve is dynamically determined based on the cosine similarities between the embeddings of memory entries’ queries and the test query\. Then, the entries are used as in\-context learning to guide the agent for the current task\.

To add a new entry to the memory based on the current query, an LLM\-based "Memory Extraction" module provides a reflection/critique of the execution, which is then stored in the memory with the query and user feedback\.

MemToolAgent specifies what to store in memory with a general memory entry format\. Each memory entrymmis represented as a tuplem=\(q,a,f,r\)m=\(q,a,f,r\)where:

- •qq: the user query
- •a=\{a1,a2,…,an\}a=\\\{a\_\{1\},a\_\{2\},\\ldots,a\_\{n\}\\\}: the sequence of tool calls executed by the agent
- •f∈\{0,1\}f\\in\\\{0,1\\\}: binary user feedback indicating task success \(f=1f=1\) or failure \(f=0f=0\)
- •rr: reflection or critique generated by the memory extractor whenf=0f=0\(otherwiser=∅r=\\emptyset\)

With binary user feedback, we simulate the scenarios where the user does not know anything about the underlying tools/APIs and gives feedback based on the outcome\. However, this feedback could easily be replaced by any other type of feedback, such as text or score\. Figure[3](https://arxiv.org/html/2606.07909#S3.F3)illustrates an example memory entry with reflection\.

Query:Wastotalvisitsmorethan21atanytimeinthelast1weeks?Ifso,please

plotitasalinechart

Answer \(Tool Calls\):\[’analytics\.total\_visits\_count\.func\(time\_min="2023\-11\-23",

time\_max="2023\-11\-30"\)’,’analytics\.create\_plot\.func\(time\_min="2023\-11\-23",

time\_max="2023\-11\-30",value\_to\_plot="total\_visits",plot\_type="line"\)’\]

User Feedback:Incorrect

Reflection:Theagentcreatedthelinechartdespitetheconditionnotbeingmet\.The

agentconcludedcorrectlythatthetotalvisitsdidnotexceed21atanypoint

\(thehighestwas19visitson2023\-11\-28\),butthenproceededtocreatethe

linechartanyway\.Thiscontradictstheoriginalinstructionwhichspecifiedtoplot

thedataONLYIFvisitsexceeded21\.

Figure 3:Memory entry with negative feedback for WorkBench### 3\.1Memory Retrieval: Dynamic "n" Top\-n Embedding\-based Retrieval

Retrieving a fixed number of memory entries for every query is not optimal since some queries might not have relevant memory entries, while there might be many similar entries for other queries\. Prior work\(Taguchiet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib30); Xuet al\.,[2025c](https://arxiv.org/html/2606.07909#bib.bib31)\)in retrieval\-augmented generation \(RAG\) addresses the trade\-off between missing information with too small k and adding redundancy with too large k, where k is the number of retrieved documents\. These methods dynamically decide the number k and show that the optimal amount of retrieved context varies across queries\. Adaptive\-k\(Taguchiet al\.,[2025](https://arxiv.org/html/2606.07909#bib.bib30)\)selects "k" as the largest adjacent similarity\-score gap\.

Our Dynamic "n" approach has four main steps illustrated by the Algorithm[1](https://arxiv.org/html/2606.07909#alg1)\. The first two steps are calculating the cosine similarity values between the query and memory entries and sorting them in descending order\. The third step calculates a first\-order derivative estimate of sorted similarities based on a sliding window approach wherexxis an array of sorted similarities, andRRis considered as the "radius" of the derivative estimate\. The final step is finding the peaks, i\.e\., sudden drops in similarity, ideally showing a class boundary\. The peak corresponds to our “n” value for Dynamic "n" retrieval\. The entire process is inspired by the concept of finding “knees” for designing systems\(Satopaaet al\.,[2011](https://arxiv.org/html/2606.07909#bib.bib17)\)\. Thefind\_peaks\*\*\*[https://docs\.scipy\.org/doc/scipy/reference/generated/scipy\.signal\.find\_peaks\.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)function from the SciPy library is utilized for peak finding\.

Figure[4](https://arxiv.org/html/2606.07909#S3.F4)illustrates an example\. The blue line shows the sorted similarity values, while the red line shows the first derivative estimation\. As observed, the locations where there are local maxima, i\.e\., a peak, correspond to locations with a sudden drop in similarity\. Dynamic "n" retrieval has three parameters, namely R, prominence, and the peak number\. R is a smoothing parameter, and larger R values result in smoother derivative estimates\. Prominence is a parameter of thefind\_peaksfunction to filter important peaks based on comparison of neighboring values\. The peak number is simply which peak to select\.

Similarity curves can be noisy, have multiple knees and multi\-step drops rather than huge gaps\. We hypothesize that finding peaks based on the derivative estimate rather than raw similarity values results in a more robust approach than Adaptive\-k\. Dynamic "n" robustly finds points of change or cluster boundaries on noisy similarity curves, while Adaptive\-k finds the biggest drop and works well only when the curve has one clean knee\.

Algorithm 1Dynamic”​n​”"n"Retrieval1:input:

qq\(query\),

ℳ\\mathcal\{M\}\(memory\),

RR\(radius\), peak number, prominence

2:output:

nn\(number of entries to select\)

3:

S←\[CalculateSimilarity\(q,m\)S\\leftarrow\[\\text\{CalculateSimilarity\}\(q,m\)
4:

form∈ℳ\]\\qquad\\text\{ for \}m\\in\\mathcal\{M\}\]
5:

x←SortDescending​\(S\)x\\leftarrow\\text\{SortDescending\}\(S\)
6:for

j=Rj=Rto

\|x\|−R−1\|x\|\-R\-1do

7:

y​\(j\)←\[−∑τ=−RRτ​x​\(j\+τ\)∑τ=−RRτ2\]y\(j\)\\leftarrow\\left\[\-\\frac\{\\sum\_\{\\tau=\-R\}^\{R\}\\tau x\(j\+\\tau\)\}\{\\sum\_\{\\tau=\-R\}^\{R\}\\tau^\{2\}\}\\right\]
8:endfor

9:

p​e​a​k​s←f​i​n​d​\_​p​e​a​k​s​\(y,p​r​o​m​i​n​e​n​c​e\)peaks\\leftarrow\\text\{$find\\\_peaks$\}\(y,prominence\)
10:

n←p​e​a​k​s​\[p​e​a​k​\_​n​u​m​b​e​r\]n\\leftarrow peaks\[peak\\\_number\]

![Refer to caption](https://arxiv.org/html/2606.07909v1/dynamic_n_retrieval.png)Figure 4:Descending similarity values \(blue\), derivative estimate \(red\), and the peaks
### 3\.2Memory Extraction Module

The agent needs to apply experience to the current task based on the user feedback and the context\. An important observation from our experiments is that agents struggle to effectively learn from retrieved negative experiences when they are presented as raw execution traces\. We hypothesize this is because the agent must simultaneously understand what went wrong in the past example while solving the current task\. To make the agent’s work easier, a "Memory Extraction" module is proposed\. It reflects on the context and user feedback to understand what was wrong\. For wrong executions, the reflection/critique is added to the memory instead of the raw context, which might be too complicated and contain irrelevant parts\. Doing the reflection as a separate step helps the agent focus only on the current task and learn from its mistakes\.

The memory extraction module is an LLMℰ\\mathcal\{E\}that takes the following as input: \(1\) current queryqq, \(2\) set of available tools𝒯=\{t1,t2,…,tm\}\\mathcal\{T\}=\\\{t\_\{1\},t\_\{2\},\\ldots,t\_\{m\}\\\}, \(3\) agent’s response \(actions\)a=\{a1,a2,…,an\}a=\\\{a\_\{1\},a\_\{2\},\\ldots,a\_\{n\}\\\}including tool calls, \(4\) environment feedbackeeincluding tool responses, \(5\) user feedbackf∈\{0,1\}f\\in\\\{0,1\\\}, and \(6\) retrieved semantically similar memory entriesℳ′⊂ℳ\\mathcal\{M\}^\{\\prime\}\\subset\\mathcal\{M\}\. The module outputs a reflection/critiquer=ℰ​\(q,𝒯,a,e,f,ℳ′\)r=\\mathcal\{E\}\(q,\\mathcal\{T\},a,e,f,\\mathcal\{M\}^\{\\prime\}\)\.

Using retrieved memory entries that are semantically similar to the current user query is important\. The main use case for general\-purpose tool use is having different queries with common mistakes, which are only sometimes detected by the memory extractor\. A similar entry with a correct reflection could help the memory extractor produce a correct critique\. For personalized tool use, memory extractor is slightly different\. Semantically similar memory entries are used to find user patterns and preferences\. The memory extraction module is used to extract high\-level user preferences, such as “User prefers tools underA​P​I​\_​XAPI\\\_X"\.

## 4Evaluation

Memory can be applied to multiple problem settings in tool\-using agents\. Our experiments focus on two main tasks: \(1\) general\-purpose tool use and \(2\) personalization\. Three benchmarks, WorkBench\(Styleset al\.,[2024](https://arxiv.org/html/2606.07909#bib.bib1)\), NESTFUL\(Basuet al\.,[2024](https://arxiv.org/html/2606.07909#bib.bib16)\), and PEToolBench\(Xuet al\.,[2025a](https://arxiv.org/html/2606.07909#bib.bib3)\), were used for evaluation\. WorkBench and NESTFUL are general\-purpose tool use benchmarks, while PEToolBench is a personalized tool orchestration benchmark\.

Table 1:MemToolAgent Results for General Purpose Tool Use Benchmarks### 4\.1Benchmarks

WorkBench is an outcome\-centric tool use benchmark that evaluates agents on workplace tasks such as sending emails and scheduling meetings\. It includes five databases and 27 tools\. WorkBench tasks are from different domains like Analytics, Calendar, etc\. The evaluation metric is accuracy, which is defined as the percentage of tasks where the outcome from the agent’s actions matches the expected outcome\. The correct outcome means the correct changes in the database values\.

WorkBench queries were created based on pre\-defined templates such as ’Delete an email from person X’ or ’Schedule a meeting with person Y’\. There are 10 queries per template\. Since the benchmark has no train\-test split, one query from each template is used to populate the memory\. This ensures that there is at least one directly relevant memory entry for each test query\. This setup is used to show the advantages of using the memory\. The test set consists of 108 Analytics, 99 Calendar, 72 CRM, 81 Email, and 189 Multi\-domain tasks\.

NESTFUL benchmark evaluates LLMs on nested sequences of tool calls, where the output of one tool call is used as input for the next tool call\. It includes 1891 test queries based on MathQA\(Aminiet al\.,[2019](https://arxiv.org/html/2606.07909#bib.bib28)\)and StarCoder2\-Instruct\(Weiet al\.,[2024](https://arxiv.org/html/2606.07909#bib.bib29)\)benchmarks\. The first 500 tasks are used as test set while the next 500 tasks are used to populate the memory\. The benchmark uses five evaluation metrics, which are F1 scores for tool and parameter generation, Partial Sequence Matching, Full Sequence Matching, and Win Rate\. Partial Sequence Matching measures how many predicted APIs with their argument\-value pairs match the gold API sequence, Full Sequence Matching evaluates whether the model produces the exact full sequence, and Win Rate assesses whether the predicted APIs lead to the gold answer when executed\.

PEToolBench is a personalized tool orchestration benchmark\. Each query comes with an instruction history consisting of 80\-100 past user queries, where each past query includes a binary rating \(1 for correct/desired execution, 0 otherwise\)\. For our experiments, this instruction history forms the memory, where each memory entry consists of a past query, its corresponding tool call, and binary user feedback\. Test queries are designed such that the agent must rely on user preferences from memory for correct tool execution\. A subset of 591 queries is used as the test set after filtering the queries with sufficiently long instruction histories \(at least 80 past queries\)\. This makes sure that the agent needs a memory rather than using the full history in context\. The benchmark uses two evaluation metrics: Tool Accuracy, which evaluates the model’s ability to call the appropriate tool based on user preferences, and Parameter Accuracy, which assesses whether the model generates the correct parameters\.

### 4\.2Experiment Results

This section presents the main experiments with details, ablation studies demonstrating the contribution of individual MemToolAgent parts, and a cost analysis\.

#### 4\.2\.1Experiment Settings

Claude 3\.5 Sonnet\(Anthropic,[2024](https://arxiv.org/html/2606.07909#bib.bib35)\)is used for PEToolBench results while NESTFUL and WorkBench experiments use Claude 3\.7 Sonnet\(Anthropic,[2025](https://arxiv.org/html/2606.07909#bib.bib36)\)\. Qwen3\-Embedding\-8B\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.07909#bib.bib34)\)embedding model is used for memory retrieval\. All experimental results are based on a single run\.

Dynamic "n" retrieval hyperparameters are manually tuned\. R is set to 10 and prominence is set to10−510^\{\-5\}\. Empirically, using the second peak for PEToolBench performs better while the first peak is the optimal choice for WorkBench and NESTFUL\.

#### 4\.2\.2General Purpose Tool Use Experiments

Table[1](https://arxiv.org/html/2606.07909#S4.T1)compares MemToolAgent with no\-memory, extended thinking, and ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2606.07909#bib.bib2)\)in NESTFUL and WorkBench\. For NESTFUL, MemToolAgent nearly doubles the Full Sequence Matching performance \(from 15\.6% to 30\.4%\) compared to the no\-memory baseline, while the Win Rate increases from 51% to 70%\. Enabling extended thinking improves Win Rate but yields only a slight increase in Full Sequence Matching\. These results demonstrate that extra reasoning alone is not enough, and the model requires concrete examples from memory to properly follow the tool calling format\. MemToolAgent demonstrates significant performance improvements on WorkBench, with Analytics task accuracy increasing from 18% to 86% and overall accuracy rising from 57% to 85% compared to the no\-memory baseline\.

#### 4\.2\.3Personalized Tool Orchestration Experiments

Table 2:MemToolAgent Results for PEToolBenchTable[2](https://arxiv.org/html/2606.07909#S4.T2)shows MemToolAgent’s performance compared to state\-of\-the\-art memory architectures A\-MEM and Mem0 on PEToolBench\. Due to the cost of LLM calls for memory addition in these methods and the MemToolAgent with memory extractor, the experiments used a 50\-query subset consisting of the first 50 queries from the 591\-query test set\. Top\-15 embedding\-based retrieval was used for A\-MEM and Mem0\. MemToolAgent achieves a Tool Accuracy of 0\.82, representing a 17% relative improvement over A\-MEM and Mem0\. This demonstrates that MemToolAgent is more effective at leveraging memories for selecting the appropriate tool based on user preferences\. All methods perform equally well at generating the appropriate parameters\.

![Refer to caption](https://arxiv.org/html/2606.07909v1/topk_vs_dynamic_comparison.png)Figure 5:Ablation studies comparing fixed top\-k memory retrieval with the Dynamic top\-n approach\(a\)Effect of memory extraction module on MemToolAgent Performance
\(b\)Comparison of Different Memory Types for WorkBench

Table 3:Ablation Studies for memory extraction module and Memory Types

### 4\.3Ablation Studies

#### 4\.3\.1Comparing Fixed Top\-k Retrieval with Dynamic Top\-n

Figure[5](https://arxiv.org/html/2606.07909#S4.F5)compares fixed top\-k with the Dynamic top\-n memory approach on 100 PEToolBench queries \(the first 100 queries from the 591\-query test set\)\. Dynamic top\-n retrieval finds a sweet spot for the optimal number of memory entries\. The figure also demonstrates that Dynamic top\-n outperforms the fixed top\-k method across different k values on the first 50 queries from the NESTFUL test set\. The results confirm our hypothesis that the optimal number of memory entries depends on the test query, and the proposed Dynamic top\-n approach moves us in that direction\.

#### 4\.3\.2Effect of Memory Extraction module

Tables[3\(a\)](https://arxiv.org/html/2606.07909#S4.T3.st1)and[3\(b\)](https://arxiv.org/html/2606.07909#S4.T3.st2)show the performance improvements achieved by incorporating the memory extraction module into MemToolAgent\. Due to cost and time constraints, PEToolBench uses 50 queries while NESTFUL and WorkBench use full test sets\. Using the memory extractor to store high\-level user preferences in memory improves Tool Accuracy in PEToolBench\. Furthermore, NESTFUL experiments with top\-5 retrieval show that adding reflection/critique to memory significantly boosts performance\. WorkBench experiments with top\-1 retrieval further demonstrate that reflections enable better learning from mistakes than raw execution context alone\.

#### 4\.3\.3Hard Preference Guiding and Simple Ranking for PEToolBench

Table[4](https://arxiv.org/html/2606.07909#S4.T4)demonstrates the effectiveness of simple ranker and Hard Preference Guiding ideas through ablation studies\. The original approach for PEToolBench provides the agent with the entire instruction history, which is not optimal, as irrelevant queries can make the task more challenging\. Embedding\-based Retrieval\_v1 improves this by retrieving only the top\-10 semantically similar queries from memory\. Retrieval\_v2 builds on v1 by introducing a simple ranker that separates memory entries by rating \(1 vs\. 0\) in the system prompt to emphasize user preferences\. This is a small modification for personalization tasks\. Although it substantially increases the tool accuracy, there is a decrease in parameter accuracy\. This is due to binary feedback, which only provides correct/incorrect information\. Negative feedback causes the agent to consider correct parameters as incorrect, as the tool itself can be incorrect even with correct parameters\.

Error analysis revealed cases where the agent selected tools despite negative user preferences from the memory\. To address this, we introduce an alternative baseline called Hard Preference Guiding which removes tools with a rating of 0 from the system prompt\. This is a substitute for the memory extraction module’s approach of providing user preferences as soft guidance\. It provides a cost\-efficient alternative to the LLM\-based memory extraction module, eliminating the need for additional LLM calls during memory addition\. Hard Preference Guiding extends MemToolAgent w/o memory extractor and improves Tool Accuracy from 0\.72 to 0\.84, as shown in Table[4](https://arxiv.org/html/2606.07909#S4.T4)\. Although less general than memory extraction, Hard Preference Guiding is effective for scenarios like PEToolBench and could be used as a complementary approach\.

Table 4:MemToolAgent Ablation Results for PEToolBench
#### 4\.3\.4Experiments with Different Memory Entry Types

Table[3\(b\)](https://arxiv.org/html/2606.07909#S4.T3.st2)compares different memory entry formats for WorkBench\. The "MemToolAgent w/ paraphrased queries and w/o memory extractor" setting uses paraphrased versions of queries that are used for generating memory entries while ensuring they have the same ground truth \(GT\)\. Everything else remains identical to MemToolAgent w/o memory extractor\. Since the memory entries and test queries are similar to each other, this setting simulates a scenario where memory entries are slightly different\. Using different query formulations can expand the agent’s exploration space\. Using paraphrased queries leads to better performance on Multi\-domain and Calendar tasks than original queries\.

The "MemToolAgent w/ synthetic queries and w/o memory extractor" setting addresses the cold\-start scenario where no queries exist to populate the memory before inference\. Given the tool descriptions, an LLM generates synthetic queries, which are then executed and added to the memory\. Using synthetic queries improves results compared to the no\-memory baseline, except on multi\-domain tasks\. For multi\-domain tasks, it is challenging to generate queries that are both complex enough and sufficiently similar to the test set\. Nevertheless, this still demonstrates that synthetic queries can help avoid starting with an empty memory\.

### 4\.4Token Usage Analysis

Table[5](https://arxiv.org/html/2606.07909#S4.T5)shows the total number of input and output tokens used by Claude 3\.7 Sonnet for 50 NESTFUL test queries\. Although MemToolAgent increases the total number of input tokens, the number of output tokens has a much lower increase compared to the Extended Thinking and ReAct baselines\. This is an advantage given that output tokens tend to be more expensive and contribute more to the overall response latency\.

Table 5:Total number of input and output tokens

## 5Conclusion

MemToolAgent is a memory framework that enhances the tool\-using capabilities of LLM agents by using past experiences\. The results show significant improvements across multiple benchmarks in both personalized and general\-purpose tool use scenarios\. The main contributions of the framework include \(1\) a retrieval mechanism that dynamically determines the number of relevant memory entries to retrieve, \(2\) a memory extraction module that generates reflections on incorrect executions to help avoid similar mistakes, and \(3\) an overall effective approach for both general\-purpose and personalized tool use based on user and environment feedback\.

## Limitations

Despite the improvements in general\-purpose and personalized tool use performance achieved by MemToolAgent, several limitations remain for future work\. One limitation is the cold\-start problem of memory\. Moreover, the benchmarks used contain fewer than 40 tools\. However, there are typically more tools in real\-life applications\. As the number of tools increases, it is more challenging to cover the usage of all tools with memory\. This could potentially make the cold\-start problem more severe\. Our experiments using synthetically generated queries show that even a simple approach can help the cold\-start problem\.

Another limitation emerges in conversational scenarios with multiple turns\. The benchmarks used in our experiments are single\-turn\. For single\-turn tasks, the memory retrieval is done once at the beginning of the conversation\. This might not be optimal for multiple\-user turn tasks where the agent needs to obtain the necessary information from the user across multiple turns\. One possible solution is to extend MemToolAgent to retrieve from memory after each user turn, so that the agent can focus on the sub\-task associated with that specific turn\.

## Ethical Considerations

The proposed MemToolAgent approach has been implemented and evaluated using the publicly available NESTFUL, WorkBench, and PEToolBench benchmarks\.

## References

- Case\-based reasoning: foundational issues, methodological variations, and system approaches\.AI communications7\(1\),pp\. 39–59\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi \(2019\)Mathqa: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 \(long and short papers\),pp\. 2357–2367\.Cited by:[§4\.1](https://arxiv.org/html/2606.07909#S4.SS1.p3.1)\.
- Anthropic \(2024\)Claude 3\.5 sonnetNote:Large language modelExternal Links:[Link](https://www.anthropic.com/news/claude-3-5-sonnet)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.07909#S4.SS2.SSS1.p1.1)\.
- Anthropic \(2025\)Claude 3\.7 sonnetNote:Large language modelExternal Links:[Link](https://www.anthropic.com/news/claude-3-7-sonnet)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.07909#S4.SS2.SSS1.p1.1)\.
- K\. Basu, I\. Abdelaziz, K\. Kate, M\. Agarwal, M\. Crouse, Y\. Rizk, K\. Bradford, A\. Munawar, S\. Kumaravel, S\. Goyal,et al\.\(2024\)Nestful: a benchmark for evaluating llms on nested sequences of api calls\.arXiv preprint arXiv:2409\.03797\.Cited by:[§4](https://arxiv.org/html/2606.07909#S4.p1.1)\.
- Y\. Cai, S\. Cai, Y\. Shi, Z\. Xu, L\. Chen, Y\. Qin, X\. Tan, G\. Li, Z\. Li, H\. Lin,et al\.\(2025\)Training\-free group relative policy optimization\.arXiv preprint arXiv:2510\.08191\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.
- A\. Chakraborty, P\. Dashore, N\. Bathaee, A\. Jain, A\. Das, S\. Zhang, S\. Sahu, M\. Naphade, and G\. I\. Winata \(2025\)T1: a tool\-oriented conversational dataset for multi\-turn agentic planning\.arXiv preprint arXiv:2505\.16986\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2025\)Memp: exploring agent procedural memory\.arXiv preprint arXiv:2508\.06433\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.
- D\. Han, C\. Couturier, D\. M\. Diaz, X\. Zhang, V\. Rühle, and S\. Rajmohan \(2025\)Legomem: modular procedural memory for multi\-agent llm systems for workflow automation\.arXiv preprint arXiv:2510\.04851\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.
- R\. Kamoi, Y\. Zhang, N\. Zhang, J\. Han, and R\. Zhang \(2024\)When can llms actually correct their own mistakes? a critical survey of self\-correction of llms\.Transactions of the Association for Computational Linguistics12,pp\. 1417–1440\.Cited by:[§1](https://arxiv.org/html/2606.07909#S1.p2.1)\.
- F\. Mu, J\. Wang, L\. Shi, S\. Wang, S\. Li, and Q\. Wang \(2025\)EXPEREPAIR: dual\-memory enhanced llm\-based repository\-level program repair\.arXiv preprint arXiv:2506\.10484\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- C\. Packer, V\. Fang, S\. Patil, K\. Lin, S\. Wooders, and J\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- G\. Sarch, L\. Jang, M\. Tarr, W\. W\. Cohen, K\. Marino, and K\. Fragkiadaki \(2024\)Vlm agents generate their own memories: distilling experience into embodied programs of thought\.Advances in Neural Information Processing Systems37,pp\. 75942–75985\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p4.1)\.
- V\. Satopaa, J\. Albrecht, D\. Irwin, and B\. Raghavan \(2011\)Finding a" kneedle" in a haystack: detecting knee points in system behavior\.In2011 31st international conference on distributed computing systems workshops,pp\. 166–171\.Cited by:[§3\.1](https://arxiv.org/html/2606.07909#S3.SS1.p2.2)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- O\. Styles, S\. Miller, P\. Cerda\-Mardini, T\. Guha, V\. Sanchez, and B\. Vidgen \(2024\)Workbench: a benchmark dataset for agents in a realistic workplace setting\.arXiv preprint arXiv:2405\.00823\.Cited by:[§4](https://arxiv.org/html/2606.07909#S4.p1.1)\.
- C\. Taguchi, S\. Maekawa, and N\. Bhutani \(2025\)Efficient context selection for long\-context qa: no tuning, no iteration, just adaptive\-kk\.arXiv preprint arXiv:2506\.08479\.Cited by:[§3\.1](https://arxiv.org/html/2606.07909#S3.SS1.p1.1)\.
- Z\. Tan, J\. Yan, I\. Hsu, R\. Han, Z\. Wang, L\. T\. Le, Y\. Song, Y\. Chen, H\. Palangi, G\. Lee,et al\.\(2025\)In prospect and retrospect: reflective memory management for long\-term personalized dialogue agents\.arXiv preprint arXiv:2503\.08026\.Cited by:[§1](https://arxiv.org/html/2606.07909#S1.p2.1),[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- A\. Vishwakarma, H\. Lee, M\. Suresh, P\. S\. Sharma, R\. Vishwakarma, S\. Gupta, and Y\. A\. Chauhan \(2025\)Cognitive weave: synthesizing abstracted knowledge with a spatio\-temporal resonance graph\.arXiv preprint arXiv:2506\.08098\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- B\. Wang, H\. Fang, J\. Eisner, B\. Van Durme, and Y\. Su \(2024a\)Llms in the imaginarium: tool learning through simulated trial and error\.arXiv preprint arXiv:2403\.04746\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p4.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024b\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- T\. Wei, N\. Sachdeva, B\. Coleman, Z\. He, Y\. Bei, X\. Ning, M\. Ai, Y\. Li, J\. He, E\. H\. Chi,et al\.\(2025\)Evo\-memory: benchmarking llm agent test\-time learning with self\-evolving memory\.arXiv preprint arXiv:2511\.20857\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.
- Y\. Wei, F\. Cassano, J\. Liu, Y\. Ding, N\. Jain, Z\. Mueller, H\. de Vries, L\. Von Werra, A\. Guha, and L\. Zhang \(2024\)Selfcodealign: self\-alignment for code generation\.Advances in Neural Information Processing Systems37,pp\. 62787–62874\.Cited by:[§4\.1](https://arxiv.org/html/2606.07909#S4.SS1.p3.1)\.
- Y\. Xiao, Y\. Li, H\. Wang, Y\. Tang, and Z\. Z\. Wang \(2025\)ToolMem: enhancing multimodal agents with learnable tool capability memory\.arXiv preprint arXiv:2510\.06664\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.
- Q\. Xu, Y\. Li, H\. Xia, F\. Liu, M\. Yang, and W\. Li \(2025a\)PEToolLLM: towards personalized tool learning in large language models\.arXiv preprint arXiv:2502\.18980\.Cited by:[§4](https://arxiv.org/html/2606.07909#S4.p1.1)\.
- W\. Xu, K\. Mei, H\. Gao, J\. Tan, Z\. Liang, and Y\. Zhang \(2025b\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p1.1)\.
- Y\. Xu, V\. Gupta, R\. Aggarwal, V\. Mahadevan, and B\. Krishnamachari \(2025c\)Cluster\-based adaptive retrieval: dynamic context selection for rag applications\.arXiv preprint arXiv:2511\.14769\.Cited by:[§3\.1](https://arxiv.org/html/2606.07909#S3.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.48550/arXiv.2210.03629)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.07909#S4.SS2.SSS2.p1.1)\.
- W\. Zhang, X\. Zhang, C\. Zhang, L\. Yang, J\. Shang, Z\. Wei, H\. P\. Zou, Z\. Huang, Z\. Wang, Y\. Gao,et al\.\(2025a\)Personaagent: when large language model agents meet personalization at test time\.arXiv preprint arXiv:2506\.06254\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p2.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025b\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§4\.2\.1](https://arxiv.org/html/2606.07909#S4.SS2.SSS1.p1.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§2](https://arxiv.org/html/2606.07909#S2.p3.1)\.

## Appendix AAppendix

Table 6:WorkBench Accuracy Results with Claude 4 SonnetYouareanagentresponsibleforanalyzingtheINCORRECTexecutionofatool\-use

agent\.

Today’sdateisThursday,2023\-11\-30andthecurrenttimeis00:00:00\.

Rememberthecurrentdateandtimewhenansweringqueries\.Meetingsmustnotstart

before9amorendafter6pm\.

Theagenthadaccesstothesetools:

\{TOOLDESCRIPTIONS\}

Herearesimilarqueries,theirexecutions,anduserfeedback:

\{SIMILARENTRIESFROMMEMORY\}

Foreachquerywhereuserfeedbackindicatesincorrectexecution,reflectonwhat

wentwrongandusetheseinsights\.

Thewholeagentexecutionincludingtoolcallsandcorrespondingresponsesis:

\{AGENTEXECUTION\}

Checkiftheagentusedacorrectorderoftools

Checkiftheagenthallucinatedanytoolparameters

Checkiftheagentusedwrongvaluesfortoolparameters

Checkiftheagentonlyusedtheinformationfromtoolsordirectlyfromuser\.

Checkiftheagenthadacorrecttemporalandconditionalreasoning\.

Beforesendinganyerrormessage,verifyyou’renotcontradictingyourself

Rememberthisisawrongresponse,thereisatleastonemistakeintoolusage

Reflectonthequery,response,andothersimilarqueriestocatchthemistake,

trytokeepyourresponseshort\.

Figure 6:Memory extraction module’s system prompt for general purpose tool use \(WorkBench\)Query:We’removingDakotaWilsontoRaj\.Canyoumakethatchangeinthecrm?

Answer:\[’search\_customers\.func\(customer\_name="DakotaWilson"\)’,

’find\_email\_address\.func\(name="Raj"\)’,

’update\_customer\.func\(customer\_id="00000076",field="assigned\_to\_email",

new\_value="raj\.patel@atlas\.com"\)’\]

User Feedback:

Execution:Correct

Figure 7:Memory entry with positive feedback for WorkBenchTest query:

Ihaven’tspokentoQuinnThomasinawhile\.Canyoucheckifit’sbeenover

14days?Ifso,booka30\-minutemeetingwithwhoeverisassignedtothemcalled

’UpdateonQuinnThomas’atthefirsttimeI’mfreetomorrow

Related memory entry:

Query:Ifwehaven’tspoketoCameronAndersoninthepastfortnightbooka

30\-minutemeetingwithwhoeverisassignedtothemcalled

’UpdateonCameronAnderson’atthefirsttimeI’mfreetomorrow

Answer:\[’customer\_relationship\_manager\.search\_customers\.func\(customer\_name=

"CameronAnderson"\)’,’calendar\.search\_events\.func\(time\_min="2023\-12\-0109:00:00",

time\_max="2023\-12\-0118:00:00"\)’,’calendar\.create\_event\.func\(event\_name="Updateon

CameronAnderson",participant\_email="sofia\.santos@atlas\.com",

event\_start="2023\-12\-0113:00:00",duration="30"\)’\]

UserFeedback:

Execution:Correct

Incorrect response of the agent without memory:

\[’customer\_relationship\_manager\.search\_customers\.func\(customer\_name="QuinnThomas"\)’

,’company\_directory\.find\_email\_address\.func\(name="AkiraSato"\)’,

’company\_directory\.find\_email\_address\.func\(name="Akira"\)’,

’calendar\.create\_event\.func\(event\_name="UpdateonQuinnThomas",

participant\_email="akira\.sato@atlas\.com",event\_start="2023\-12\-0109:00:00",

duration="30"\)’\]

Correct response of the agent with memory:

\[’customer\_relationship\_manager\.search\_customers\.func\(customer\_name="QuinnThomas"\)’

,’calendar\.search\_events\.func\(time\_min="2023\-12\-0109:00:00",time\_max="2023\-12\-01

18:00:00"\)’,’calendar\.create\_event\.func\(event\_name="UpdateonQuinnThomas",

participant\_email="akira\.sato@atlas\.com",event\_start="2023\-12\-0113:00:00",

duration="30"\)’\]

Figure 8:Example from WorkBench where a positive memory entry helpsTest query:CanyoumakealinechartoftotalvisitssinceSeptember24ifitwas

morethan21sinceatanytimeSeptember24?

Related memory entry:

Query:wastotalvisitslessthan3atanytimesinceSeptember13?Ifso,

pleaseplotitasalinechart

Answer:\[’analytics\.total\_visits\_count\.func\(time\_min="2023\-09\-13",

time\_max="2023\-11\-30"\)’,’analytics\.create\_plot\.func\(time\_min="2023\-09\-13",

time\_max="2023\-11\-30",value\_to\_plot="total\_visits",plot\_type="line"\)’\]

UserFeedback:Incorrect

Reflection:Theprimarymistakewasalogicalerrorininterpretingthedataand

understandingtheuser’squestion\.Theagentcorrectlyretrievedthedatashowingno

dayshadlessthan3visitsbutincorrectlyreportedOctober1ashavingexactly3

visits\(whichiscorrectbutirrelevant\)\.Itgeneratedtheplotasrequesteddespite

thecondition"Ifso"notbeingmet\(nodayshadlessthan3visits\)\.Theagent

shouldhaveclearlystatedthattheconditionwasnotmet\.

Incorrect response of the agent without memory:

\[’analytics\.total\_visits\_count\.func\(time\_min="2023\-09\-24",

time\_max="2023\-11\-30"\)’,’analytics\.create\_plot\.func\(time\_min="2023\-09\-24",

time\_max="2023\-11\-30",value\_to\_plot="total\_visits",plot\_type="line"\)’\]

Correct response of the agent with memory:

\[’analytics\.total\_visits\_count\.func\(time\_min="2023\-09\-24",time\_max="2023\-11\-30"\)’\]

Figure 9:Example from WorkBench benchmark where a negative memory entry helpsToday’sdateisThursday,2023\-11\-30andthecurrenttimeis00:00:00\.Rememberthe

currentdateandtimewhenansweringqueries\.Meetingsmustnotstartbefore9amor

endafter6pm\.

Herearesimilarqueries,theirexecutions,anduserfeedback:

\{RETRIEVEDMEMORIES\}

Foreachquerywhereuserfeedbackindicatesincorrectexecution,reflectonwhat

wentwrongandusetheseinsightstoavoidsimilarmistakes\.Considertheseexamples

whensolvingthecurrenttask\.

Figure 10:System Prompt for WorkBench1\)Deletemylastemailfromnadia→\\rightarrowRemovethemostrecentemailIreceivedfrom

Nadia

2\)Allmyemailsfromyukifromthelast3daysneedtobedeleted\.Canyoudothat?

→\\rightarrowRemoveYuki’semailsthatarrivedinthepreviousthreedays

3\)carlosneedsalltheemailsfromchenweilastweekabout’UpdateonSupplyChain

EnhancementWorkshop’\.Canyouforwardthem?→\\rightarrowCouldyouforwardtoCarlosevery

emailChenweisentlastweekabout’UpdateonSupplyChainEnhancementWorkshop’?

Figure 11:Examples of original WorkBench email queries and their paraphrased formsYouareanexpertsystemdesignedtogeneraterealisticqueriesthatareexecutable

usingavailabletoolstocreatehigh\-qualitytrainingdatafortool\-useagents\.

\#\#YourObjectives:

Generatediverse,realisticuserqueriesthatcanbesolvedusingtheavailable

tools

\#\#AvailableTools:

\{TOOLS\}

\#\#GenerationGuidelines:

Today’sdateisThursday,2023\-11\-30andthecurrenttimeis00:00:00\.

Rememberthecurrentdateandtimewhenansweringqueries\.Meetingsmustnotstart

before9amorendafter6pm\.

\#\#\#QueryDiversity:

\-Includemulti\-toolscenarios

\-Createqueriesthatrequiredifferentreasoningpatterns\(sequential,parallel,

conditional\)

\#\#\#QueryQualityStandards:

\-Queriesshouldberealisticandrepresentactualuserneeds

\-Usenaturallanguagethatrealuserswouldemploy

\-Avoidoverlyartificialorcontrivedscenarios

\#\#\#ExecutionRequirements:

\-Followpropertoolcallingconventionsanderrorhandling

\-Provideclearreasoningfortoolselectionandparameterchoices

\-Eachqueryshouldincludeatleastonewriteoperation

\-Queriesshouldnotgiveexplicitinformation

Figure 12:System prompt for the synthetic WorkBench query generation1\)FindemailsfromLenaabouthomepagealignmentissuesfromOctober,deletethem

sincethatprojectisdone,andsendheranemailaboutthenewhomepagedesign

meetingscheduledforDecember5that11am\.

2\)ForwardFatimathelatestquarterlytechroundupemailfromDmitri\.Iwantto

makesureshe’suptospeedbeforeourmeetingtomorrow\.

3\)IneedtoreplytoSantiago’slatestemailabouttheUXfeedbacksubmission

improvements\.TellhimI’verevieweditanditlooksgreat,butI’dliketodiscuss

theprojecttimelineadjustmentshementionednextweek\.

Figure 13:Example synthetic queries for WorkBench email domainYourtaskistouseatoolthatnotonlymeetsreal\-timeuserinstructionsbutalso

alignswithuserpreferences\.

Thesepreferencescanbeinferredfromtheuser’sinteractionhistoryprovided

below,includingpastuserinstructionsandcorrespondingtoolcalls,withuser’s

binaryratingsreflectingtheuser’ssatisfactionwiththetoolcalls\.

Interactionhistoryis:

instruction1:IwanttosearchfortheTVshow’StrangerThings’with

anoffsetof0andlimitsuggestionsto20andlimittitlesto50

tool\_call1:

\{"tool\_name":"<Data\>\.<NetflixData\>\.<search\>,"parameters":\{"query":

"stranger","offset":0,"limit\_suggestions":20,"limit\_titles":50\}\},

"rating":1\},

instruction2:\.\.\.\.\.

tool\_call2:\.\.\.\.

\.

\.

\.

instructionN:\.\.\.\.\.

tool\_callN:\.\.\.\.

Availabletoolsyoucancallwithinputparametersarelistedhere:

\{TOOLS\}

Remember:

GenerateonlythetoolcallinJSONformatwithoutanyadditionaloutputs\.

Youwillnowreceiveauserinstruction;pleasegeneratethetoolcall\.

Begin\!

Figure 14:Original system prompt for PEToolBenchYouareapreferenceanalysisassistantthatexaminestoolusagepatternsacross

queriestoidentifyuserpreferencesbasedonbinaryratings\(1=positive,0=

negative\)\.Yourtaskistoanalyzethecurrentqueryandhistoricalquerieswith

theirtoolexecutionsandbinaryratings\.

Currentquery,itsexecutionandbinaryratingis:

\{current\_execution\}

Historicalqueries,theirexecutionsandbinaryratings:

\{retrieved\_memories\}

Rules:

IMPORTANT:Onlyconsideritapreferencepatternifadomain,APIortoolappearsin

atleastTWOquerieswithpositiveratings\(1\)ornegativeratings\(0\)\)

\-Onlyreportpreferenceswherethereisaclearpattern

\-Ifnoclearpreferencesexistatanylevel,state"Noclearpreferenceidentified"

\-Onlyconsiderthepreferencesrelatedtothecurrentquery

Outputyouranalysisinthefollowingformat:

\*\*\*PREFERENCES\*\*\*:

\-DomainPreferences:\[Listdomainsthatshowconsistentpositiveornegative

ratings\]

\-APIPreferences:\[ListAPIsthatshowconsistentpositiveornegativeratings\]

ExampleOutput

\*\*\*PREFERENCES\*\*\*:

\-DomainPreferences:<Sports\>

\-APIPreferences:<Sports\>\.<FootballData\>

\-DomainNonpreferences\(ifthereisany\):<Health\>

\-APINonpreferences\(ifthereisany\):<Health\>\.<HeartRhythmData\>

Figure 15:Memory extraction module’s system prompt for PEToolBenchDomainPreferences:<Business\>

APIPreferences:<Business\>\.<Idealista\_v2\>,<Business\>\.<Bayut\>

Figure 16:Example memory extractor output for PEToolBenchExampleQuery:Apieceofworkcanfinishbyacertainnumberofmenin100days\.

Ifhowever,therewere10menless,itwouldtake10daysmorefortheworktobe

finished\.Howmanymenwerethereoriginally?

ToolCallsUsed:

\[\{"name":"multiply","label":"$var\_1","arguments":\{"arg\_0":100,"arg\_1":"x"\}\},

\{"name":"subtract","label":"$var\_2","arguments":\{"arg\_0":"x","arg\_1":10\}\},

\{"name":"multiply","label":"$var\_3","arguments":\{"arg\_0":"$var\_2\.result$",

"arg\_1":110\}\},

\{"name":"divide","label":"$var\_4","arguments":\{"arg\_0":"$var\_1\.result$",

"arg\_1":100\}\},

\{"name":"divide","label":"$var\_5","arguments":\{"arg\_0":"$var\_3\.result$",

"arg\_1":110\}\},

\{"name":"subtract","label":"$var\_6","arguments":\{"arg\_0":"$var\_5\.result$",

"arg\_1":"$var\_4\.result$"\}\},

\{"name":"divide","label":"$var\_7","arguments":\{"arg\_0":1000,"arg\_1":10\}\},

\{"name":"divide","label":"$var\_8","arguments":\{"arg\_0":"$var\_7\.result$",

"arg\_1":10\}\}\]

Reflection/Critic:

Theagent’sapproachtosolvethisworkproblemcontainsmultipleerrors\.Let’s

examinetheproblemandtheagent’ssolution:

ProblemUnderstanding

Wehaveacertainnumberofmen\(let’scallitx\)whocanfinishworkin100days

With10fewermen\(x\-10\),thesameworktakes110days\(100\+10days\)

CorrectApproach

Thekeyinsightisthatwork=men×\\timesdays,andthetotalworkisconstant\.So:

x×100=\(x−10\)×110x\\times 100=\(x\-10\)\\times 110

ErrorsintheAgent’sSolution:

ParameterTypeError:Inthefirstfunctioncall,theagentuses"x"asastring

insteadofanumericvariable,whichwouldcauseexecutionissues\.

MathematicalLogicError:Theagent’ssequenceofoperationsdoesn’tcorrectlyset

uporsolvetheequation\.Theproperapproachrequiressolvingtheequationx×100=\(x−10\)×110x\\times 100=\(x\-10\)\\times 110\.

UnnecessaryOperations:Thelasttwodivisionswith1000and10don’trelatetothe

problem’ssolutionandappeararbitrary\.

MissingFinalCalculation:Theagentdoesn’tactuallysolvefortheoriginalnumber

ofmen\.

CorrectSolutionWouldBe:

Setuptheequation:x×100=\(x−10\)×110x\\times 100=\(x\-10\)\\times 110

Expand:100​x=110​x−1100100x=110x\-1100

Rearrange:100​x−110​x=−1100100x\-110x=\-1100

Simplify:−10​x=−1100\-10x=\-1100

Solve:x=110x=110

Theoriginalnumberofmenwas110\.

Theagentfailedtoproperlyformulateandsolvethealgebraicequationneededfor

thisproblem\.

Figure 17:Example memory entry for NESTFULQuery:TwoemployeesXandYarepaidatotalofRs\.440perweekbytheir

employer\.IfXispaid120percentofthesumpaidtoY,howmuchisYpaidper

week?

Youareahelpfulassistantwithaccesstothefollowingfunctioncalls\.

Yourtaskistoproduceasequenceoffunctioncallsnecessarytogenerateresponse

totheuserutterance\.HereisalistoffunctionsinJSONformatthatyoucan

invoke:\{TOOLS\}

DONOTtrytoanswertheuserquestion,justinvokethetoolsneededtorespondto

theuser,ifany\.TheoutputMUSTstrictlyadheretothefollowingJSONformat:

\[\{"name":"func\_name1","arguments":\{"argument1":"value1","argument2":"value2"\},

"label":"$var\_1"\},\.\.\.\(moretoolcallsasrequired\)\]\.Pleasemakesure

theparametertypeiscorrectandfollowthedocumentationforparameterformat\.

Ifnofunctioncallisneeded,pleasedirectlyoutputanemptylist\.

Herearesomeexamplesandwhatwaswrongwiththeagentresponse:

ExampleQuery:TwoemployeesXandYarepaidatotalofRs\.506perweekbytheir

employer\.IfXispaid120percentofthesumpaidtoY,howmuchisYpaidper

week?

ToolCallsUsed:

\[\{"name":"add","label":"$var\_1","arguments":\{"arg\_0":1,"arg\_1":1\.2\}\},

\{"name":"divide","label":"$var\_2","arguments":\{"arg\_0":506,"arg\_1":

"$var\_1\.result$"\}\},

\{"name":"multiply","label":"$var\_3","arguments":\{"arg\_0":"$var\_2\.result$",

"arg\_1":1\}\}\]

Reflection/Critic:correcttoolcalls

========================================

ExampleQuery:Rs\.385weredividedamongX,Y,ZinsuchawaythatXhadRs\.20more

thanYandZhadRs15morethanX\.HowmuchwasYY’sshare?

ToolCallsUsed:

\[\{"name":"add","label":"$var\_1","arguments":\{"arg\_0":20,"arg\_1":15\}\},

\{"name":"subtract","label":"$var\_2","arguments":\{"arg\_0":385,"arg\_1":

"$var\_1\.result$"\}\},

\{"name":"divide","label":"$var\_3","arguments":\{"arg\_0":"$var\_2\.result$",

"arg\_1":3\}\}\]

Reflection/Critic:Ineedtoanalyzetheincorrecttoolexecutionforsolvingthis

problemaboutdividingRs\.385amongX,Y,andZwithspecificconditions\.

Themistakeintheagent’sexecution:

Theagenttriedtosolvetheproblemby:

1\.Adding20\+15=35

2\.Subtractingthisfrom385toget350

3\.Dividing350by3

ThisapproachisincorrectbecauseitmisunderstandstherelationshipsbetweenX,

Y,andZ\.

Thecorrectapproachshouldaccountfor:

\-XhasRs\.20morethanY\(soX=Y\+20\)

\-ZhasRs\.15morethanX\(soZ=X\+15=Y\+20\+15=Y\+35\)

IfwecallY’sshareY,then:

\-X=Y\+20

\-Z=Y\+35

Thetotalis385:

Y\+\(Y\+20\)\+\(Y\+35\)=385

3Y\+55=385

3Y=330

Y=110

Theagentshouldhave:

1\.Added20\+15=35\(forthetotalofadditionalmoneybeyondY\)

2\.Added35toYtorepresentZ\(notneededasaseparatestep\)

3\.Realizedthattheequationis3Y\+55=385

4\.Subtracted55from385toget330

5\.Divided330by3togetY=110

Themistakewasinnotproperlysettinguptheequationtoaccountforallthree

sharesintermsofYbeforesolving\.

Usetheseinsightstoavoidsimilarmistakes\.Considertheseexampleswhen

solvingthecurrenttask\.

Figure 18:Example NESTFUL system prompt with a test query and retrieved memory entries

Similar Articles

AdMem: Advanced Memory for Task-solving Agents

arXiv cs.AI

This paper introduces AdMem, a unified memory framework for LLM-based agents that integrates semantic, episodic, and procedural memory with a bi-level short-term and long-term store, using a multi-agent architecture for automatic memory generation and adaptive retrieval. Experiments show improved robustness and success on long multi-turn tasks.

MemGym: a Long-Horizon Memory Environment for LLM Agents

arXiv cs.CL

MemGym is a benchmark for evaluating memory formation in LLM agents over long-horizon tasks, unifying existing agent gyms and synthetic pipelines with memory-isolated scores. It spans tool-use dialogue, multi-turn search, coding, and computer use, and includes a lightweight reward model (MemRM) for efficient evaluation.

Learning to Learn from Multimodal Experience

arXiv cs.AI

This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.