Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents
Summary
This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.
View Cached Full Text
Cached at: 06/02/26, 03:37 PM
# Dual-Level Long-Term Memory for Text-to-SQL Agents
Source: [https://arxiv.org/html/2606.00547](https://arxiv.org/html/2606.00547)
Yibo Wang1Nikki Lijing Kuang2Philip S\. Yu1 Zhewei Yao2Yuxiong He2 1University of Illinois Chicago2Snowflake AI Research \{ywang633, psyu\}@uic\.edu \{nikki\.kuang, zhewei\.yao, yuxiong\.he\}@snowflake\.com
###### Abstract
Interactive text\-to\-SQL agents solve database tasks through multi\-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision\. Long\-term memory helps agents reuse past experiences, but existing retrieval methods remain limited\. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon\. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state\-conditioned execution\. We proposeMERIT, a dynamic multi\-horizon memory retrieval framework\.MERITmaintains episode\-level memory for global strategic guidance and turn\-level memory for local decision support\. Both levels use learned retrieval policies optimized with reinforcement learning\. To train turn\-level retrieval despite limited intermediate supervision,MERITuses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection\. Experiments on BIRD\-Interact show thatMERIToutperforms no\-memory, static\-retrieval, and dynamic\-retrieval baselines in success rate while reducing average interaction turns\. Transfer results on Spider2\-Snow further show positive cross\-benchmark transfer without benchmark\-specific tuning\. These results suggest that multi\-horizon retrieval improves experience reuse in interactive text\-to\-SQL agents\.
Learning to Retrieve: Dual\-Level Long\-Term Memory for Text\-to\-SQL Agents
Yibo Wang1Nikki Lijing Kuang2Philip S\. Yu1Zhewei Yao2Yuxiong He21University of Illinois Chicago2Snowflake AI Research\{ywang633, psyu\}@uic\.edu\{nikki\.kuang, zhewei\.yao, yuxiong\.he\}@snowflake\.com
## 1Introduction
\(a\)Conceptual overview ofMERIT\.
\(b\)Performance comparison with baselines\.
Figure 1:Conceptual Overview and Performance Advantages ofMERIT\.\(a\)MERITperforms horizon\-aware retrieval by decoupling memory into episode\-level \(global strategy\) and turn\-level \(local hints\), using a PRM to provide the dense rewards required for turn\-level policy learning\. \(b\)MERIToutperforms static and single\-horizon dynamic baselines in success rates while requiring fewer interaction turns\.Large language models \(LLMs\) are increasingly deployed as multi\-step interactive agents that act, observe, and refine their behavior to solve complex tasks\(Li,[2025](https://arxiv.org/html/2606.00547#bib.bib11); Mohammadi et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib16); Xu et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib28); Yuan et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib33)\)\. In text\-to\-SQL, this shift moves the task beyond single\-pass semantic parsing\. Instead of producing SQL in a single step, an agent may need to inspect schemas, execute candidate SQL queries, ask clarifying questions, observe feedback, and revise its plan over time\. Thus, success depends not only on generating a valid final query but also on making reliable intermediate decisions under evolving context\(Wang et al\.,[2025b](https://arxiv.org/html/2606.00547#bib.bib25); Shao et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib20)\)\.
Long\-term memory provides a natural mechanism for improving such decisions\. Interactive text\-to\-SQL tasks often share recurring structures and solution patterns, including join templates, aggregation logics, clarification strategies, and debugging routines\. Reusing these experiences can help agents avoid redundant exploration and make informed decisions in new interactions\. However, existing memory\-augmented approaches commonly rely on static similarity heuristics\(Maharana et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib15); Sun et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib22); Salama et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib19); Yan et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib30); Kweon et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib9)\), such as dense embedding similarity, lexical matching, or hand\-designed schema heuristics\. These heuristics are useful for finding related candidates, yet do not directly optimize whether a retrieved memory will improve the next decision\.
Recent dynamic frameworks address this limitation by learning memory selection from downstream feedback\. For example, Memento\(Zhou et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib38)\)and MemRL\(Zhang et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib36)\)update retrieval policies or memory utilities based on final outcomes\. However, these methods mainly retrieve at the episode level, selecting full trajectories based on the initial request and learning from sparse terminal rewards\. This mismatches interactive text\-to\-SQL, where the agent’s state evolves through schema exploration, query execution, observations, and plan revision\. Thus, retrieval should match the decision horizon\. For example, a trajectory useful for initial planning may not help after a SQL error, where a local debugging pattern is more relevant\. Moreover, a final success or failure signal provides weak supervision for determining which memory was useful at a specific intermediate turn\.
Therefore, we propose a dynamicMulti\-horizon mEmoryRetrieval framework usingIntermediate andTerminal rewards \(MERIT\)\.MERITformulates memory retrieval as a learning problem across two decision horizons\. As illustrated in Figure[1\(a\)](https://arxiv.org/html/2606.00547#S1.F1.sf1), the episode\-level retriever selects past trajectories before an interaction begins, providing global strategic guidance for planning\. The turn\-level retriever selects compact state\-action\-observation memories during the interaction, providing local hints conditioned on the current state\. This dual\-level design separates long\-range task strategy from short\-range execution support\.
MERITuses a two\-stage retrieval pipeline at both levels\. An embedding retriever first generates a candidate set, and a learned selector then chooses which memories to expose to the agent\. Both selectors are optimized with reinforcement learning \(RL\), shaping retrieval by downstream task utility rather than fixed similarity alone\. The episode\-level selector is trained with terminal rewards since its retrieval decision affects the overall trajectory\. The turn\-level selector requires denser supervision because intermediate turns do not have reliable direct outcome labels\(Choudhury,[2025](https://arxiv.org/html/2606.00547#bib.bib3); Xi et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib26)\)\. To address this credit\-assignment challenge,MERITintroduces a lightweight Process Reward Model \(PRM\) that evaluates the utility of retrieved turn\-level memories with a structured rubric\. The resulting dense rewards enable RL for state\-conditioned turn\-level retrieval\.
We evaluateMERITon the interactive text\-to\-SQL benchmark\(Huo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib6)\)for in\-domain interactive performance and on Spider2\-Snow\(Lei et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib10)\)for cross\-benchmark transfer\. As summarized in Figure[1\(b\)](https://arxiv.org/html/2606.00547#S1.F1.sf2),MERITachieves higher success rates than no\-memory, static\-retrieval, and dynamic\-retrieval baselines, while reducing the average number of interaction turns in the in\-domain setting\. On Spider2\-Snow,MERITachieves positive transfer gains without benchmark\-specific training or tuning, suggesting that the learned retrieval policies capture reusable memory\-selection behavior beyond the source benchmark\.
Our contributions are summarized as follows:
- •We proposeMERIT, which aligns memory retrieval with decision horizon through episode\-level global guidance and turn\-level state\-conditioned local support\.
- •We formulate memory retrieval as learned policies at both decision horizons, allowing memory selection to be optimized by downstream task utility rather than fixed similarity alone\.
- •We introduce a lightweight PRM that provides dense structured rewards for turn\-level retrieval, mitigating the credit\-assignment problem caused by sparse terminal outcomes\.
- •We evaluateMERITin both in\-domain and transfer settings, showing improved success rates, interaction efficiency, and positive cross\-benchmark transfer\.
## 2Related Work
### 2\.1Text\-to\-SQL
Early neural text\-to\-SQL research typically formulates the task as single\-pass semantic parsing from a natural\-language question and database schema directly to SQL\(Zhong et al\.,[2017](https://arxiv.org/html/2606.00547#bib.bib37); Xu et al\.,[2017](https://arxiv.org/html/2606.00547#bib.bib29); Yu et al\.,[2018](https://arxiv.org/html/2606.00547#bib.bib32); Lin et al\.,[2020](https://arxiv.org/html/2606.00547#bib.bib12)\)\. Recent LLM\-based approaches move beyond one\-shot generation by introducing interaction, tool use, and iterative refinement\(Zhang et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib35); Tian et al\.,[2023](https://arxiv.org/html/2606.00547#bib.bib23); Xiong et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib27); Wang et al\.,[2025a](https://arxiv.org/html/2606.00547#bib.bib24)\)\. These studies show that text\-to\-SQL agents benefit from intermediate feedback and adaptive decision\-making\. We build on this interactive setting and study experience reuse through learned long\-term memory retrieval\.
### 2\.2Memory Systems in LLM Agents
Long\-term memory enables cross\-episode experience reuse\(Park et al\.,[2023](https://arxiv.org/html/2606.00547#bib.bib18); Lu et al\.,[2023](https://arxiv.org/html/2606.00547#bib.bib14)\), and prior work has explored long\-term memory writing and organization, including operating\-system\-like memory management\(Packer et al\.,[2023](https://arxiv.org/html/2606.00547#bib.bib17)\), semantic compression\(Liu et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib13)\), graph\-based memory representations\(Chhikara et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib2)\), trajectory distillation\(Fang et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib4); Yan et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib30)\), and structural memory representations\(Zeng et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib34)\)\. These studies improve the storage and representation of agent experience\. MERIT instead focuses on retrieving stored memories in the interactive setting\. Most memory\-augmented agents retrieve memories with fixed heuristics, such as dense embedding similarity, lexical matching, topic coverage, or predefined structural relations\(Arslan et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib1); Kweon et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib9); Kagaya et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib8); Zeng et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib34)\)\. These methods are effective for finding related candidates, yet their retrieval criteria are fixed and do not directly optimize downstream utility\. Recent dynamic methods learn memory utility from feedback\. Memory\-R1\(Yan et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib30)\)jointly optimizes memory management and retrieval with the downstream agent\. MemRL\(Zhang et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib36)\)updates memory Q\-values through runtime learning, and Memento\(Zhou et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib38)\)learns a memory\-selection policy without fine\-tuning base LLMs\. They move beyond static similarity, but primarily operate at the episode level and rely on sparse terminal feedback\. MERIT instead separates episode\-level retrieval for global strategy from turn\-level retrieval for local support, and trains the turn\-level retriever with dense PRM\-based rewards\.
## 3Preliminary
We formulate interactive text\-to\-SQL as a sequential decision\-making process\. Given an initial natural\-language requestxx, the agent interacts with a database environment and, when necessary, a user simulator\. Unlike single\-pass semantic parsing, the agent may take multiple actions before producing the final SQL query, including schema inspection, knowledge retrieval, SQL execution, clarification, and final submission\.
At turntt, the agent has an interaction history
ht−1=\(\(a1,o1\),\(a2,o2\),…,\(at−1,ot−1\)\),h\_\{t\-1\}=\\big\(\(a\_\{1\},o\_\{1\}\),\(a\_\{2\},o\_\{2\}\),\\ldots,\(a\_\{t\-1\},o\_\{t\-1\}\)\\big\),whereata\_\{t\}is the agent action, andoto\_\{t\}is the corresponding observation\. The agent observes a state
st=\(x,ht−1,bt\),s\_\{t\}=\(x,h\_\{t\-1\},b\_\{t\}\),wherebtb\_\{t\}is the remaining interaction budget\. Conditioned onsts\_\{t\}, the agent selects an actionata\_\{t\}, receives an observationoto\_\{t\}, and appends\(at,ot\)\(a\_\{t\},o\_\{t\}\)to the history\. An episode yields a trajectory
whereTTis the terminal turn\. The episode terminates when the agent submits a final SQL query, exhausts its interaction budget, or reaches the maximum allowable number of turns\. The final query is evaluated by execution correctness against the target SQL semantics\.
## 4Data Augmentation
Learning retrieval policies requires diverse executable SQL\. Since high\-quality interactive text\-to\-SQL data is limited, we construct additional training data through an execution\-grounded augmentation pipeline\. Rather than perturbing natural\-language requests directly, we first synthesize executable SQL targets from seed examples and then generate user requests grounded in those targets\. This SQL\-first design reduces semantic drift and encourages executable SQL, faithful SQL\-query alignment, and structural diversity\.
Given a seed query, we apply constrained SQL\-level perturbations, including compatible column substitutions, numeric literal perturbations, sorting\-direction changes, and limit\-value modifications\. These operations diversify projections, grouping, thresholds, rankings, and result sizes while preserving executability\. For each augmented SQL target, an LLM generates a fully specified request, an intentionally underspecified request, and grounding metadata that links underspecified phrases to corresponding SQL fragments\. We discard examples whose generated text or grounding metadata cannot be verified against the SQL logic\. Each retained example therefore provides an executable text\-to\-SQL task with aligned SQL, requests, and grounding metadata for retrieval\-policy training\. Additional details and prompts are provided in Appendix[A](https://arxiv.org/html/2606.00547#A1)\.
## 5Methodology
Figure 2:Overview of the proposed dual\-level long\-term memory framework for interactive Text\-to\-SQL\. Episode\-level memory is retrieved at episode start for global strategic guidance and updated at episode end, while turn\-level memory is retrieved and updated per turn to provide state\-conditioned local hints\. Both levels employ a two\-stage pipeline: embedding\-based candidate generation followed by an RL\-trained selector\. A Process Reward Model \(PRM\), trained offline with SFT and GRPO, provides dense online rewards for turn\-level retrieval, while episode\-level retrieval is optimized with terminal task reward\.We presentMERITby first introducing the dual\-level memory architecture that separates global strategic guidance from local state\-conditioned support \(§[5\.1](https://arxiv.org/html/2606.00547#S5.SS1)\)\. We then formulate memory retrieval at both levels as learned policies optimized by reinforcement learning \(§[5\.2](https://arxiv.org/html/2606.00547#S5.SS2)\)\. Finally, we describe the Process Reward Model \(PRM\), which provides dense supervision for turn\-level retrieval \(§[5\.3](https://arxiv.org/html/2606.00547#S5.SS3)\)\. The complete architecture is shown in Figure[2](https://arxiv.org/html/2606.00547#S5.F2)\.
### 5\.1Dual\-Level Long\-Term Memory
MERITmaintains two long\-term memory banks\. Episode\-level memory stores complete interactions, and turn\-level memory stores compact state\-action\-observation snippets for local decisions\.
#### Episode\-level memory\.
After each completed episodeii, we write an episode\-level memory
miE=\(kiE,viE\),m\_\{i\}^\{E\}=\(k\_\{i\}^\{E\},v\_\{i\}^\{E\}\),wherekiEk\_\{i\}^\{E\}is the embedding of user requestxix\_\{i\}and
viE=\(xi,τi,ℓi,zi,gi\)v\_\{i\}^\{E\}=\(x\_\{i\},\\tau\_\{i\},\\ell\_\{i\},z\_\{i\},g\_\{i\}\)stores the original requestxix\_\{i\}, the full trajectoryτi\\tau\_\{i\}, the terminal outcome labelℓi\\ell\_\{i\}, distilled episode insightsziz\_\{i\}, and structured task\-specific constraints or factsgig\_\{i\}resolved during interaction\. The distilled insights summarize reusable strategic knowledge, such as successful plans, common failure modes, task characteristics, and guidance for similar future tasks\. The prompt for insights extraction is in Appendix[D\.4](https://arxiv.org/html/2606.00547#A4.SS4)\.
For a new requestxx, we retrieve episode\-level candidates by cosine similarity over the stored episode keys\{kiE\}\\\{k\_\{i\}^\{E\}\\\}\. Conditioned onxxand compact candidate summaries, the episode\-level selector chooses a subset of memories to inject into the initial agent prompt\. These memories provide global guidance before the interaction begins\.
#### Turn\-level memory\.
Turn\-level memory is indexed by a compact state signature rather than the full interaction state\. At turntt, we define state signature
σt=g\(st\),\\sigma\_\{t\}=g\(s\_\{t\}\),whereg\(⋅\)g\(\\cdot\)maps the full statests\_\{t\}into a textual retrieval key\. The signature contains the original request, the current query context, a coarse description of the previous action, and the previous observation\. We keep the previous action coarse in the retrieval key so that similar decision states can match even when their exact SQL queries differ\.
After the agent turnjj, we write a turn memory
mjT=\(kjT,vjT\),m\_\{j\}^\{T\}=\(k\_\{j\}^\{T\},v\_\{j\}^\{T\}\),wherekjTk\_\{j\}^\{T\}is the embedding ofσj\\sigma\_\{j\}, and
vjT=\(σj,a~j,o~j,ℓjT\)v\_\{j\}^\{T\}=\(\\sigma\_\{j\},\\tilde\{a\}\_\{j\},\\tilde\{o\}\_\{j\},\\ell\_\{j\}^\{T\}\)stores the state signature, an abstracted action representationa~j\\tilde\{a\}\_\{j\}, a short observation snippeto~j\\tilde\{o\}\_\{j\}, and an execution\-status labelℓjT\\ell\_\{j\}^\{T\}\. The label records immediate action outcomes, such as invalid tool calls and SQL syntax errors, but does not capture whether it contributes to broader task progress\. The abstracted action may include normalized SQL skeletons or other tool\-call details, allowing reusable local patterns to transfer across queries\.
Before intermediate turntt, we retrieve turn\-level candidates by cosine similarity over\{kjT\}\\\{k\_\{j\}^\{T\}\\\}\. Conditioned onσt\\sigma\_\{t\}and compact candidate summaries, the turn selector returns a subset of memories to inject as brief hints for the next agent action\.
### 5\.2Retrieval Policy Learning
MERITlearns retrieval policies that decide which candidate memories to expose to the agent\. At both levels, similarity\-based retrieval first obtains candidates, and then a learned selector chooses a subset for prompt injection\. We optimize both selectors with Proximal Policy Optimization \(PPO\), using reward signals aligned with their respective decision horizons\.
#### Episode\-level retrieval policy learning\.
Given an initial requestxxand an episode\-level candidate setCE\(x\)C^\{E\}\(x\), the episode\-level selector samples a retrieval decision
dE∼πθE\(⋅∣x,CE\(x\)\),d^\{E\}\\sim\\pi^\{E\}\_\{\\theta\}\(\\cdot\\mid x,C^\{E\}\(x\)\),wheredEd^\{E\}specifies the selected memory subsetM^E⊆CE\(x\)\\hat\{M\}^\{E\}\\subseteq C^\{E\}\(x\)\. The selected memories are injected before the interaction begins\. After the agent completes trajectoryτ\\tau, the selector receives a terminal rewardRE\(τ\)R^\{E\}\(\\tau\), derived from final execution correctness\. We optimizeπθE\\pi\_\{\\theta\}^\{E\}with PPO to learn episode\-level retrieval from downstream outcomes\.
#### Turn\-level retrieval policy learning\.
Given a state signatureσt\\sigma\_\{t\}and a turn\-level candidate setCT\(σt\)C^\{T\}\(\\sigma\_\{t\}\), the turn\-level selector samples
dtT∼πϕT\(⋅∣σt,CT\(σt\)\),d\_\{t\}^\{T\}\\sim\\pi^\{T\}\_\{\\phi\}\(\\cdot\\mid\\sigma\_\{t\},C^\{T\}\(\\sigma\_\{t\}\)\),wheredtTd\_\{t\}^\{T\}specifies the selected memory subsetM^tT⊆CT\(σt\)\\hat\{M\}\_\{t\}^\{T\}\\subseteq C^\{T\}\(\\sigma\_\{t\}\)\. The selected memories are injected as local hints before the next agent action\. Turn\-level retrieval cannot be trained directly from reliable immediate environment rewards\. A memory selected at an intermediate turn may affect later actions, while its contribution may only be reflected in the final episode outcome\. To mitigate this credit\-assignment challenge, we use the PRM to estimate the utility of selected turn\-level memories\. The PRM scores each selected state\-memory pair, and the turn\-level reward is computed as
rtT=1\|M^tT\|∑m∈M^tTfPRM\(σt,m\),r\_\{t\}^\{T\}=\\frac\{1\}\{\|\\hat\{M\}\_\{t\}^\{T\}\|\}\\sum\_\{m\\in\\hat\{M\}\_\{t\}^\{T\}\}f\_\{\\mathrm\{PRM\}\}\(\\sigma\_\{t\},m\),wherefPRMf\_\{\\mathrm\{PRM\}\}maps a state signature and a memory to a scalar utility score\. We optimizeπϕT\\pi\_\{\\phi\}^\{T\}with PPO using these dense rewards\.
### 5\.3Process Reward Model for Turn\-Level Supervision
While turn\-level retrieval requires dense supervision, terminal rewards provide delayed, noisy feedback on intermediate choices\.MERITtherefore uses a Process Reward Model \(PRM\) to estimate candidate memory utility given the current state\.
#### Structured Utility Prediction\.
Given a state\-memory pair\(σt,m\)\(\\sigma\_\{t\},m\), the PRM predicts a structured rubric
yt=fψ\(σt,m\)\.y\_\{t\}=f\_\{\\psi\}\(\\sigma\_\{t\},m\)\.The rubric evaluates memory usefulness along seven dimensions: state match, actionability value, pattern generalizability, outcome reliability, clarity for agent, confidence in assessment, and overall utility\. Each dimension contains a score in\[0,5\]\[0,5\]and a brief rationale\. The rubric also includes binary decisions on whether to inject the memory and whether to use it as a warning, along with concise textual feedback on how to use it\. The full rubric schema and prompt are provided in Appendix[D\.5](https://arxiv.org/html/2606.00547#A4.SS5)\.
For turn\-level retrieval policy optimization, we use the trained PRM to score each state\-memory pair\. We denote the resulting scalar utility score asfPRM\(σt,m\)f\_\{\\mathrm\{PRM\}\}\(\\sigma\_\{t\},m\), which is extracted from the predicted rubric using the overall utility score and the injection decision\. This score serves as the dense reward for turn\-level PPO, while the structured rubric grounds the reward in interpretable aspects of memory usefulness\.
#### Two\-Stage Training\.
We train the PRM from teacher\-annotated state\-memory pairs\. For each pair, a stronger teacher model produces the full rubric\. We first apply supervised fine\-tuning \(SFT\) to imitate the teacher rubrics, then refine the PRM with GRPO using a structured matching reward\.
Lety^\\hat\{y\}denote the PRM\-generated rubric andy∗y^\{\*\}denote the teacher rubric\. Ify^\\hat\{y\}is not a valid JSON object with all required fields, we assign
RPRM\(y^,y∗\)=−1\.R\_\{\\mathrm\{PRM\}\}\(\\hat\{y\},y^\{\*\}\)=\-1\.Otherwise, we first compute a structured matching scoreSPRMS\_\{\\mathrm\{PRM\}\}, and then clip it to obtain the PRM training reward:
SPRM=0\.7Rdim\+0\.2Rbin\+0\.1Rtext−0\.5𝕀extra,S\_\{\\mathrm\{PRM\}\}=0\.7R\_\{\\mathrm\{dim\}\}\+0\.2R\_\{\\mathrm\{bin\}\}\+0\.1R\_\{\\mathrm\{text\}\}\-0\.5\\mathbb\{I\}\_\{\\mathrm\{extra\}\},RPRM\(y^,y∗\)=clip\(SPRM,−1,1\),R\_\{\\mathrm\{PRM\}\}\(\\hat\{y\},y^\{\*\}\)=\\mathrm\{clip\}\(S\_\{\\mathrm\{PRM\}\},\-1,1\),where𝕀extra=1\\mathbb\{I\}\_\{\\mathrm\{extra\}\}=1if the output contains text outside the JSON object, and0otherwise\.
The dimension\-level reward compares the seven scored rubric dimensions\. For each dimensiondd, lets^d,sd∗∈\[0,5\]\\hat\{s\}\_\{d\},s\_\{d\}^\{\*\}\\in\[0,5\]be the predicted and teacher numeric scores, and letr^d,rd∗\\hat\{r\}\_\{d\},r\_\{d\}^\{\*\}be the corresponding short reasons\. We define
Rdim=1\|𝒟\|∑d∈𝒟\[\\displaystyle R\_\{\\mathrm\{dim\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{d\\in\\mathcal\{D\}\}\\Bigg\[0\.7⋅max\(0,1−\|s^d−sd∗\|5\)\\displaystyle 7\\cdot\\max\\\!\\left\(0,1\-\\frac\{\|\\hat\{s\}\_\{d\}\-s\_\{d\}^\{\*\}\|\}\{5\}\\right\)\+0\.3⋅J\(r^d,rd∗\)\],\\displaystyle\+3\\cdot J\(\\hat\{r\}\_\{d\},r\_\{d\}^\{\*\}\)\\Bigg\],where𝒟\\mathcal\{D\}is the set of the seven rubric dimensions andJ\(⋅,⋅\)J\(\\cdot,\\cdot\)denotes Jaccard similarity between token sets\. The binary termRbinR\_\{\\mathrm\{bin\}\}is the average exact\-match accuracy over the two binary decisions, andRtextR\_\{\\mathrm\{text\}\}is the average similarity over the two free\-text feedback fields\. This reward encourages the PRM to produce valid structured outputs that match teacher utility scores, decisions, and rationales\.
## 6Experiments
MethodsPhase 1Phase 2Avg\. \#Turns↓\\downarrowWRITEREADOverallWRITEREADOverallWRITEREADOverall\\rowcolorred\!15No MemoryVanilla GPT\-524\.71\-18\.978\.69\-6\.3313\.33\-10\.0012\.07\-8\.043\.05\-7\.285\.67\-7\.50\-\-\-No Memory43\.6815\.0223\.3320\.1110\.3313\.178\.249\.218\.93Demo54\.60\+10\.9218\.31\+3\.2928\.83\+5\.5028\.16\+8\.0511\.97\+1\.6416\.67\+3\.508\.44\+0\.209\.93\+0\.729\.50\+0\.57\\rowcoloryellow\!15Static MemoryBM2552\.87\+9\.1919\.48\+4\.4629\.17\+5\.8428\.74\+8\.6313\.85\+3\.5218\.17\+5\.007\.65\-0\.598\.53\-0\.688\.27\-0\.66Cosine Similarity51\.72\+8\.0419\.48\+4\.4628\.83\+5\.5028\.74\+8\.6313\.62\+3\.2918\.00\+4\.837\.63\-0\.618\.79\-0\.428\.45\-0\.48TopicK48\.85\+5\.1718\.31\+3\.2927\.17\+3\.8425\.86\+5\.7511\.97\+1\.6416\.00\+2\.837\.46\-0\.788\.64\-0\.578\.30\-0\.63\\rowcolorblue\!15Dynamic MemoryMemento50\.57\+6\.8919\.72\+4\.7028\.67\+5\.3428\.74\+8\.6313\.38\+3\.0517\.83\+4\.667\.80\-0\.448\.91\-0\.308\.59\-0\.34MemRL51\.15\+7\.4719\.95\+4\.9329\.00\+5\.6730\.46\+10\.3514\.08\+3\.7518\.83\+5\.667\.58\-0\.668\.79\-0\.428\.44\-0\.49\\rowcolorgray\!15MERIT\(ours\)53\.49\+9\.8124\.64\+9\.6233\.01\+9\.6834\.30\+14\.1917\.22\+6\.8922\.17\+9\.007\.09\-1\.158\.15\-1\.067\.84\-1\.09
Table 1:Performance comparison ofMERITagainst baselines on Success Rates across two phases and Average Number of Turns\. Performance is partitioned into queries with WRITE operations and READ\-only operations\. The backbone of the main agent is GPT\-5 and the user simulator is GPT\-4o for all methods\. Subscripts denote absolute differences relative to the No Memory baseline\.MethodsSuccess RateAvg\. \#Turns\\rowcolorred\!15No MemoryVanilla GPT\-537\.29\-28\.71\-No Memory66\.0013\.29\\rowcoloryellow\!15Static MemoryBM2567\.09\+1\.0914\.58\+1\.29Cosine Similarity68\.19\+2\.1913\.48\+0\.19TopicK66\.73\+0\.7314\.54\+1\.25\\rowcolorblue\!15Dynamic MemoryMemento68\.56\+2\.5614\.33\+1\.04MemRL66\.91\+0\.9114\.95\+1\.66\\rowcolorgray\!15MERIT\(ours\)69\.29\+3\.2913\.44\+0\.15
Table 2:Transferability comparison on Spider2\-Snow\. The backbone of the main agent is GPT\-5\. TopicK, Memento, MemRL, andMERITuse the trained checkpoints/artifacts/\. Subscripts denote absolute differences relative to the No Memory baseline\.### 6\.1Experimental Settings
#### Datasets and Metrics\.
We evaluateMERITon BIRD\-Interact\(Huo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib6)\)for in\-domain interactive text\-to\-SQL and on Spider2\-Snow\(Lei et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib10)\)for cross\-benchmark transfer\. For BIRD\-Interact, we train the retrieval policies and the PRM on BIRD\-Interact\-Lite and evaluate on the unseen BIRD\-Interact\-Full dataset\. Since BIRD\-Interact\-Lite contains only 236 base tasks \(excluding 64 overlapped examples with BIRD\-Interact\-Full\), we apply the data augmentation pipeline in §[4](https://arxiv.org/html/2606.00547#S4)and Appendix[A](https://arxiv.org/html/2606.00547#A1)to construct approximately 7,000 training examples, with benchmark\-specific conversion details in Appendix[B](https://arxiv.org/html/2606.00547#A2)\.
BIRD\-Interact reports results in two benchmark\-defined phases\. Phase 1 evaluates the initial user request, while Phase 2 evaluates state\-dependent follow\-up requests\. We measure success by execution correctness and report the average number of interaction turns as an efficiency metric\. We also report READ and WRITE results separately, where READ tasks answer analytical queries without modifying the database, while WRITE tasks require state\-changing operations\. This split tests whetherMERIThelps different interaction types that may require different strategies\. For Spider2\-Snow, all retrieval artifacts and learned policies are trained only on BIRD\-Interact and transferred without Spider2\-Snow\-specific training or tuning\.
#### Baselines\.
We compareMERITwith no\-memory, static\-retrieval, and dynamic\-retrieval baselines\. The no\-memory baselines include Vanilla GPT\-5, a non\-agentic LLM baseline, No Memory Agent, the base interactive agent without historical trajectories, and Demo Agent, which uses the officially released static demonstrations \(available only for BIRD\-Interact\)\. Static\-retrieval baselines include BM25, Cosine Similarity withtext\-embedding\-3\-small, and TopicK\(Kweon et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib9)\)\. Dynamic\-retrieval baselines include Memento\(Zhou et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib38)\)and MemRL\(Zhang et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib36)\), which learn memory selection or utility but operate primarily at the episode level\. For applicable memory\-based methods, we keep the episode\-level memory writing mechanism identical to isolate the effect of retrieval\.
#### Model Configurations\.
The base reasoning agent uses GPT\-5\(Singh et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib21)\)with low reasoning effort, and the user simulator uses GPT\-4o\(Hurst et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib7)\)\. The PRM is initialized fromDeepSeek\-R1\-Distill\-Qwen\-7B\(Guo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib5)\), and the retrieval selectors use LoRA\-tunedQwen3\-4B\-Instruct\-2507\(Yang et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib31)\)\. We use 5,346 teacher\-annotated state\-memory pairs for PRM SFT, 42,768 pairs for GRPO, and 5,346 pairs for PRM evaluation\.
More details are provided in Appendix[B](https://arxiv.org/html/2606.00547#A2)\.
### 6\.2Experimental Results
#### Main Results\.
Table[6](https://arxiv.org/html/2606.00547#S6)comparesMERITwith no\-memory, static\-retrieval, and dynamic\-retrieval baselines on BIRD\-Interact\.MERITachieves the best overall success rates in both Phase 1 and Phase 2, reaching 33\.01% and 22\.17%, respectively\. The gains hold across both READ and WRITE tasks, indicating that the learned retrieval policies help both analytical extraction and state\-changing operations\. Furthermore,MERITachieves the lowest average number of interaction turns\. These results indicate thatMERITimproves task success while reducing interaction turns\.
#### Leaderboard Results\.
We additionally transfer the retrieval artifacts trained with GPT\-5 as the reasoning agent, GPT\-4o as the user simulator, and the same Qwen\-based retrievers used in the main experiments to two stronger agent configurations\. Using GPT\-5\.4 as the reasoning agent with GPT\-4o as the user simulator, and Claude\-Opus\-4\.6 as the reasoning agent with Claude\-Haiku\-4\-5 as the user simulator, MERIT achieves state\-of\-the\-art results on the BIRD\-Interact leaderboard as of May 2026\.111[https://bird\-interact\.github\.io/](https://bird-interact.github.io/)
#### Transferability\.
Table[6](https://arxiv.org/html/2606.00547#S6)evaluates whether retrieval policies trained on BIRD\-Interact transfer to Spider2\-Snow without Spider2\-Snow\-specific training or tuning\.MERITachieves the highest success rate among all methods, outperforming the strongest baseline\. The gain is smaller than on BIRD\-Interact, as expected under cross\-benchmark transfer\. Nevertheless, the positive improvement suggests thatMERITlearns memory\-selection behavior that is not limited to the source benchmark\.MERITalso keeps average turns close to the No Memory Agent and below all memory\-retrieval baselines, indicating improved success without substantial interaction cost\.
\(a\)Pass@kkanalysis\.
\(b\)Number of retrieved memories\.
Figure 3:Retrieval robustness and retrieval size\. MERIT consistently outperforms the non\-RL variant acrosskk, while the retrieval\-size analysis shows that injecting too many memories can introduce noisy or redundant guidance\.Table 3:Ablation studies evaluating the contribution of individual framework components—episode\-level strategy, turn\-level hints, and RL optimization—on task success rates and interaction efficiency\.
#### Ablation Studies\.
Table[3](https://arxiv.org/html/2606.00547#S6.T3)ablates eachMERITcomponent\. Removing episode\-level memory causes the largest drop, reducing overall success to 23\.50% in Phase 1 and 12\.70% in Phase 2\. This shows that episode\-level experience is important in interactive text\-to\-SQL\. Removing turn\-level memory also degrades performance, reducing overall success to 28\.50% and 17\.20% in the two phases and increasing the average number of turns to 8\.39\. This suggests that turn\-level hints complement episode\-level by supporting decisions as the interaction state evolves\. Finally, removing RL optimization has a modest effect on Phase 1 but lowers Phase 2 success from 22\.17% to 19\.50%\. This suggests that utility\-based selector optimization is most beneficial for state\-dependent follow\-up interactions, where memory usefulness depends more strongly on the evolving context\.
Table 4:The Performance of PRM\.
#### PRM Evaluation\.
Table[4](https://arxiv.org/html/2606.00547#S6.T4)evaluates whether the PRM can produce structured utility judgments aligned with teacher annotations\. Compared with the vanilla base model and the SFT\-only model, SFT\+GRPO achieves 100% valid formatting, higher annotation agreement, and lower numeric\-score errors\. Specifically, SFT\+GRPO improves average accuracy to 73\.59% and reduces MAE and MSE to 0\.29 and 0\.34, respectively\. These results show that GRPO improves agreement with teacher utility rubrics; downstream tests whether this reward helps retrieval\. Since direct intermediate utility labels are unavailable, we use this teacher\-aligned score as a structured proxy reward and examine its downstream effects through turn\-level and RL ablations\.
#### Retrieval Behavior Analysis\.
We further analyze the robustness of learned retrieval and the effect of retrieval size\. Figure[3\(a\)](https://arxiv.org/html/2606.00547#S6.F3.sf1)comparesMERITwith its non\-RL variant under different Pass@kkbudgets\.MERITachieves higher success across allkk, indicating that RL\-based retrieval remains beneficial across different generation budgets\. Figure[3\(b\)](https://arxiv.org/html/2606.00547#S6.F3.sf2)studies the number of retrieved memories\. Retrieving one to three memories yields comparable performance, whereas larger retrieval budgets reduce success rates\. This suggests that excessive memory context may introduce redundant or distracting information\. We provide an additional analysis of memory composition in Appendix[C](https://arxiv.org/html/2606.00547#A3)\.
Table 5:Token and interaction efficiency comparison\.
#### Computational and Token Efficiency\.
AlthoughMERITintroduces learned retrieval modules, the additional cost is limited because the selectors are lightweight and operate on compact state signatures and memory summaries\. Table[5](https://arxiv.org/html/2606.00547#S6.T5)compares token usage and interaction turns across ablations\.MERITuses 100\.4M tokens, fewer than the variants without turn\-level memory or RL optimization, which require 108\.5M and 104\.2M tokens, respectively\. The variant without episode\-level memory uses fewer tokens, yet it substantially reduces success rates and increases interaction turns\. Overall,MERIToffers a favorable success\-efficiency trade\-off by improving task performance while reducing unnecessary interactions and tokens\.
## 7Conclusion
We proposedMERIT, a dual\-level long\-term memory framework that learns episode\-level and turn\-level retrieval policies for interactive text\-to\-SQL agents\. By separating global strategic guidance from local state\-conditioned hints and using a lightweight PRM to provide dense rewards for turn\-level retrieval,MERITenables horizon\-aware memory selection\. Experiments on BIRD\-Interact show improved success rates with fewer interaction turns, while Spider2\-Snow results provide evidence of transfer beyond the source benchmark\.
## Limitations
First, the PRM is trained from teacher\-generated rubrics rather than ground\-truth labels of turn\-level memory utility\. This is a practical limitation of interactive retrieval settings, where the usefulness of a memory at an intermediate turn is difficult to observe directly because the agent state, future actions, and environment feedback all evolve after the retrieval decision\. We therefore use a strong teacher model to provide structured utility annotations as a proxy supervision signal\. While this does not provide causal ground\-truth attribution for each retrieved memory, the downstream results and ablation studies suggest that the PRM\-trained turn\-level selector improves state\-conditioned retrieval and contributes to better follow\-up interaction performance\.
Second, our current evaluation uses a strong proprietary backbone for the main agent and teacher annotation\. This setting allows us to study memory retrieval under a capable interactive agent, but it leaves open whether the same gains hold across open\-source agent backbones or weaker reasoning models\. Evaluating MERIT with different base agents, teacher models, and simulator configurations is an important direction for future work\.
## References
- Arslan et al\. \(2024\)Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz\. 2024\.A survey on rag with llms\.*Procedia computer science*, 246:3781–3790\.
- Chhikara et al\. \(2025\)Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\. 2025\.Mem0: Building production\-ready ai agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*\.
- Choudhury \(2025\)Sanjiban Choudhury\. 2025\.Process reward models for llm agents: Practical framework and directions\.*arXiv preprint arXiv:2502\.10325*\.
- Fang et al\. \(2025\)Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang\. 2025\.Memp: Exploring agent procedural memory\.*arXiv preprint arXiv:2508\.06433*\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others\. 2025\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*\.
- Huo et al\. \(2025\)Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, and 1 others\. 2025\.Bird\-interact: Re\-imagining text\-to\-sql evaluation for large language models via lens of dynamic interactions\.*arXiv preprint arXiv:2510\.05318*\.
- Hurst et al\. \(2024\)Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others\. 2024\.Gpt\-4o system card\.*arXiv preprint arXiv:2410\.21276*\.
- Kagaya et al\. \(2024\)Tomoyuki Kagaya, Thong Jing Yuan, Yuxuan Lou, Jayashree Karlekar, Sugiri Pranata, Akira Kinose, Koki Oguri, Felix Wick, and Yang You\. 2024\.Rap: Retrieval\-augmented planning with contextual memory for multimodal llm agents\.*arXiv preprint arXiv:2402\.03610*\.
- Kweon et al\. \(2025\)Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, and Hwanjo Yu\. 2025\.Topic coverage\-based demonstration retrieval for in\-context learning\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 19911–19923\.
- Lei et al\. \(2025\)Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, and 1 others\. 2025\.Spider 2\.0: Evaluating language models on real\-world enterprise text\-to\-sql workflows\.In*International Conference on Learning Representations*, volume 2025, pages 28691–28735\.
- Li \(2025\)Xinzhe Li\. 2025\.A review of prominent paradigms for llm\-based agents: Tool use, planning \(including rag\), and feedback learning\.In*Proceedings of the 31st international conference on computational linguistics*, pages 9760–9779\.
- Lin et al\. \(2020\)Xi Victoria Lin, Richard Socher, and Caiming Xiong\. 2020\.[Bridging textual and tabular data for cross\-domain text\-to\-SQL semantic parsing](https://doi.org/10.18653/v1/2020.findings-emnlp.438)\.In*Findings of the Association for Computational Linguistics: EMNLP 2020*, pages 4870–4888, Online\. Association for Computational Linguistics\.
- Liu et al\. \(2026\)Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao\. 2026\.Simplemem: Efficient lifelong memory for llm agents\.*arXiv preprint arXiv:2601\.02553*\.
- Lu et al\. \(2023\)Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu\. 2023\.Memochat: Tuning llms to use memos for consistent long\-range open\-domain conversation\.*arXiv preprint arXiv:2308\.08239*\.
- Maharana et al\. \(2024\)Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\. 2024\.Evaluating very long\-term conversational memory of llm agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 13851–13870\.
- Mohammadi et al\. \(2025\)Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip\. 2025\.Evaluation and benchmarking of llm agents: A survey\.In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2*, pages 6129–6139\.
- Packer et al\. \(2023\)Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez\. 2023\.Memgpt: towards llms as operating systems\.
- Park et al\. \(2023\)Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein\. 2023\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th annual acm symposium on user interface software and technology*, pages 1–22\.
- Salama et al\. \(2025\)Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba\. 2025\.Meminsight: Autonomous memory augmentation for llm agents\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 33124–33140\.
- Shao et al\. \(2025\)Zhihui Shao, Shubin Cai, Rongsheng Lin, and Zhong Ming\. 2025\.Enhancing text\-to\-sql with question classification and multi\-agent collaboration\.In*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 4340–4349\.
- Singh et al\. \(2025\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others\. 2025\.Openai gpt\-5 system card\.*arXiv preprint arXiv:2601\.03267*\.
- Sun et al\. \(2026\)Haoran Sun, Shaoning Zeng, and Bob Zhang\. 2026\.H\-mem: Hierarchical memory for high\-efficiency long\-term reasoning in llm agents\.In*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 341–350\.
- Tian et al\. \(2023\)Yuan Tian, Zheng Zhang, Zheng Ning, Toby Jia\-Jun Li, Jonathan K\. Kummerfeld, and Tianyi Zhang\. 2023\.[Interactive text\-to\-SQL generation via editable step\-by\-step explanations](https://doi.org/10.18653/v1/2023.emnlp-main.1004)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 16149–16166, Singapore\. Association for Computational Linguistics\.
- Wang et al\. \(2025a\)Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, LinZheng Chai, Zhao Yan, Qian\-Wen Zhang, Di Yin, Xing Sun, and Zhoujun Li\. 2025a\.[MAC\-SQL: A multi\-agent collaborative framework for text\-to\-SQL](https://aclanthology.org/2025.coling-main.36/)\.In*Proceedings of the 31st International Conference on Computational Linguistics*, pages 540–557, Abu Dhabi, UAE\. Association for Computational Linguistics\.
- Wang et al\. \(2025b\)Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian\-Wen Zhang, Di Yin, Xing Sun, and 1 others\. 2025b\.Mac\-sql: A multi\-agent collaborative framework for text\-to\-sql\.In*Proceedings of the 31st International Conference on Computational Linguistics*, pages 540–557\.
- Xi et al\. \(2025\)Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, and 1 others\. 2025\.Agentprm: Process reward models for llm agents via step\-wise promise and progress\.*arXiv preprint arXiv:2511\.08325*\.
- Xiong et al\. \(2024\)Guanming Xiong, Junwei Bao, Hongfei Jiang, Yang Song, and Wen Zhao\. 2024\.Interactive\-t2s: Multi\-turn interactions for text\-to\-sql with large language models\.*arXiv e\-prints*, pages arXiv–2408\.
- Xu et al\. \(2025\)Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang\. 2025\.Llm\-based agents for tool learning: A survey: W\. xu et al\.*Data Science and Engineering*, pages 1–31\.
- Xu et al\. \(2017\)Xiaojun Xu, Chang Liu, and Dawn Song\. 2017\.Sqlnet: Generating structured queries from natural language without reinforcement learning\.*arXiv preprint arXiv:1711\.04436*\.
- Yan et al\. \(2025\)Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, and 1 others\. 2025\.Memory\-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning\.*arXiv preprint arXiv:2508\.19828*\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others\. 2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\.
- Yu et al\. \(2018\)Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev\. 2018\.[TypeSQL: Knowledge\-based type\-aware neural text\-to\-SQL generation](https://doi.org/10.18653/v1/N18-2093)\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\)*, pages 588–594, New Orleans, Louisiana\. Association for Computational Linguistics\.
- Yuan et al\. \(2025\)Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Kan Ren, Dongsheng Li, and Deqing Yang\. 2025\.Easytool: Enhancing llm\-based agents with concise tool instruction\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 951–972\.
- Zeng et al\. \(2024\)Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng\. 2024\.On the structural memory of llm agents\.*arXiv preprint arXiv:2412\.15266*\.
- Zhang et al\. \(2024\)Hanchong Zhang, Ruisheng Cao, Hongshen Xu, Lu Chen, and Kai Yu\. 2024\.[CoE\-SQL: In\-context learning for multi\-turn text\-to\-SQL with chain\-of\-editions](https://doi.org/10.18653/v1/2024.naacl-long.361)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 6487–6508, Mexico City, Mexico\. Association for Computational Linguistics\.
- Zhang et al\. \(2026\)Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, and 1 others\. 2026\.Memrl: Self\-evolving agents via runtime reinforcement learning on episodic memory\.*arXiv preprint arXiv:2601\.03192*\.
- Zhong et al\. \(2017\)Victor Zhong, Caiming Xiong, and Richard Socher\. 2017\.Seq2sql: Generating structured queries from natural language using reinforcement learning\.*arXiv preprint arXiv:1709\.00103*\.
- Zhou et al\. \(2025\)Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and 1 others\. 2025\.Memento: Fine\-tuning llm agents without fine\-tuning llms\.*arXiv preprint arXiv:2508\.16153*\.
## Appendix AData Augmentation Details
Learning robust memory retrieval policies requires diverse executable tasks from which meaningful interaction trajectories can be collected\. Since high\-quality interactive text\-to\-SQL data is expensive to collect, we augment the training set with an execution\-grounded pipeline\. The pipeline first synthesizes executable SQL targets from seed examples and then generates natural\-language requests grounded in those targets\. This SQL\-first design reduces semantic drift and encourages three properties: executable SQL, faithful SQL\-query alignment, and structural diversity\.
#### SQL\-level perturbation\.
Given a seed SQL query, we parse its abstract syntax tree to recover alias\-to\-table mappings and identify qualified columns within key clauses such asSELECTandGROUP BY\. We then apply constrained SQL\-level perturbations while retaining executable variants\. To preserve structural integrity and executability, column substitutions are restricted to non\-key columns with compatible coarse data types from the same table\. We also perturb numeric literals, reverse sorting directions, and modify limit values\. These operations diversify projections, grouping choices, thresholds, rankings, and result sizes while maintaining executable SQL targets\.
#### Natural\-language generation\.
After obtaining an executable SQL target, we generate corresponding natural\-language requests with an LLM\. The generation is conditioned on a lightweight SQL outline with clause\-level cues, so that the request describes the intended operation without exposing raw table names, column names, aliases, or SQL identifiers\. The model produces a fully specified request and an intentionally underspecified variant\. The fully specified request captures the target SQL semantics, while the underspecified request hides selected details that can be resolved through interaction\.
#### Grounding metadata\.
For each underspecified request, the generation also produces grounding metadata that links vague phrases to their corresponding SQL fragments and ambiguity types\. This metadata supports interactive environments by specifying which hidden details may be clarified during interaction\. The ambiguity types include knowledge\-linking, semantic, entity, temporal, aggregation, sort, intent, and join\-path ambiguity\.
#### Verification and filtering\.
We discard examples whose generated text cannot be verified against the SQL logic\. Retained examples must preserve SQL\-query consistency, avoid raw identifier leakage in user\-facing text, and provide grounding metadata that can be linked to the corresponding request phrase and SQL fragment\. Each retained example therefore contains an executable SQL target, aligned natural\-language requests, and grounding metadata for interactive trajectory collection\.
#### Prompt\.
The full prompt used for natural\-language generation and ambiguity annotation is shown in Appendix[D\.2](https://arxiv.org/html/2606.00547#A4.SS2)\.
Table 6:Performance comparison between using positive\-only experiences and using all experiences\.
## Appendix BMore Experimental Settings
#### Datasets and evaluation metrics\.
We evaluateMERITon BIRD\-Interact\(Huo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib6)\)and Spider2\-Snow\(Lei et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib10)\)\. We use BIRD\-Interact\-Lite for training and BIRD\-Interact\-Full for in\-domain evaluation\. BIRD\-Interact agent provides an interactive text\-to\-SQL environment that couples an executable PostgreSQL sandbox with a dynamic user simulator\. To train the retrieval policies and the PRM, we apply the data augmentation pipeline described in §[4](https://arxiv.org/html/2606.00547#S4)and Appendix[A](https://arxiv.org/html/2606.00547#A1), expanding the initial base tasks to approximately 7,000 examples\.
For compatibility with the BIRD\-Interact protocol, each augmented SQL target is converted into the benchmark\-required textual fields, including a fully specified request, an underspecified initial request, and grounding metadata that links underspecified phrases to corresponding SQL fragments\. These fields are used by the benchmark user simulator to provide consistent clarification responses during interaction\.
For in\-domain evaluation, we test on the unseen BIRD\-Interact\-Full dataset, which contains 600 complex tasks\. BIRD\-Interact reports results in two benchmark\-defined phases\. Phase 1 evaluates the initial user request, while Phase 2 evaluates state\-dependent follow\-up requests whose semantics may depend on the preceding task context, such as artifacts or database states established earlier\. These phases are part of the benchmark protocol and are distinct from the episode\-level and turn\-level retrieval horizons inMERIT\. We measure task success by execution correctness against the benchmark ground\-truth test cases\. Following the benchmark protocol, we also report results separately for READ tasks, which focus on analytical extraction, and WRITE tasks, which require database modification or other state\-changing operations\. We report the average number of interaction turns as an interaction\-efficiency metric\.
To evaluate cross\-benchmark transfer, we further test on Spider2\-Snow\. In this setting, all retrieval artifacts and learned policies are trained only on BIRD\-Interact and directly transferred to Spider2\-Snow without Spider2\-Snow\-specific training or tuning\. This setting evaluates whether the learned retrieval policies capture reusable memory\-selection behavior beyond the source benchmark\. For Spider2\-Snow, we report success rate and average number of interaction turns\.
#### Baselines\.
We compareMERITwith three classes of baselines\. To ensure a controlled comparison, all interactive agent baselines use the same base agent prompt, user simulator, interaction budgets, evaluation protocol, and budget\. For memory\-based methods, each method maintains its own online memory bank generated from its own rollouts\. Candidate retrieval is performed from the method\-specific memory bank with the same candidate size and memory budget asMERITwhenever applicable; differences therefore come from the retrieval or memory\-selection strategy rather than from the agent prompt or evaluation setup\.
- •No Memory\.We include three baselines without experience retrieval\.Vanilla GPT\-5is a standard non\-agentic LLM baseline\.No Memory Agentis the base interactive agent without access to historical trajectories\.Demo Agentis the interactive agent using the standard static demonstrations released with BIRD\-Interact\.
- •Static Retrieval\.We include fixed retrieval methods, includingBM25,Cosine Similaritywithtext\-embedding\-3\-small, andTopicK\(Kweon et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib9)\)\. These methods retrieve memories using predefined retrieval criteria rather than learned retrieval policies\.
- •Dynamic Retrieval\.We compare with dynamic memory methods that learn memory utility or selection policies, includingMemento\(Zhou et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib38)\)andMemRL\(Zhang et al\.,[2026](https://arxiv.org/html/2606.00547#bib.bib36)\)\. These methods move beyond static similarity, but operate primarily at the episode level\.
#### Model configurations\.
The base reasoning agent uses GPT\-5\(Singh et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib21)\)with low reasoning effort, using the base text\-to\-SQL prompt in Appendix[D\.1](https://arxiv.org/html/2606.00547#A4.SS1)\. The user simulator, which provides clarification responses and execution feedback, is powered by GPT\-4o\(Hurst et al\.,[2024](https://arxiv.org/html/2606.00547#bib.bib7)\)\. Each interaction allows at most 100 turns, with a user patience budget of 6, an environment\-interaction budget of 3, and a submission budget of 3\.
To construct the PRM training data, we use GPT\-5 to annotate state\-memory pairs with structured utility rubrics\. This yields 5,346 pairs for supervised fine\-tuning \(SFT\), 42,768 pairs for Group Relative Policy Optimization \(GRPO\), and 5,346 pairs for evaluation\. The PRM is initialized fromDeepSeek\-R1\-Distill\-Qwen\-7B\(Guo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib5)\); during retrieval\-policy training, turn\-level rewards are computed with the trained PRM checkpoint\. For the dual\-level retrieval selectors, we use LoRA\-tunedQwen3\-4B\-Instruct\-2507\(Yang et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib31)\)\. At both memory levels, the embedding retriever returns 10 candidates, and the learned selector injects at most one memory into the agent context\. We use positive\-only memories by default and exclude memories from the same instance during evaluation to avoid self\-retrieval leakage\.
Both episode\- and turn\-level selectors are optimized with PPO for 4 epochs, using batch size 128, mini\-batch size 16, learning rate1×10−71\\times 10^\{\-7\}, PPO epoch count 2, clipping range 0\.2, value\-function coefficient 0\.2, entropy coefficient 0\.01, initial KL coefficient 0\.02, and one rollout per example\. Training uses 4 processes with Ray rollout parallelism 24, and all experiments are accelerated using 4 NVIDIA B200 GPUs\.
## Appendix CEffect of Memory Composition
MERITuses positive memories by default in the main experiments\. Although unsuccessful trajectories can contain useful warning signals, naively mixing successful and unsuccessful experiences may introduce noisy, contradictory, or overly conservative guidance into the agent context\. We therefore compare two memory\-bank construction strategies\. The first strategy,MERIT\(positive only\), retrieves only memories from successful trajectories\. The second strategy,MERIT\(all\), retrieves memories from both successful and unsuccessful trajectories\.
As shown in Table[6](https://arxiv.org/html/2606.00547#A1.T6), positive\-only retrieval achieves higher success rates in both phases\. Compared with retrieval from all memories, positive\-only retrieval improves overall success from 29\.20% to 33\.01% in Phase 1 and from 18\.70% to 22\.17% in Phase 2\. This result suggests that successful trajectories provide cleaner reusable guidance for the agent\. In contrast, directly injecting unsuccessful trajectories may introduce noise, even though such trajectories may still contain useful warning signals\. Retrieval from all memories slightly reduces the average number of turns, but this efficiency gain is accompanied by a clear decrease in task success\. Based on this result, we use positive memories in the main experiments\.
## Appendix DPrompts
### D\.1Base Prompts
For the base agent prompt, we adapt elements from the prompt released in BIRD\-INTERACT\(Huo et al\.,[2025](https://arxiv.org/html/2606.00547#bib.bib6)\), shown below\.
`Base Agent System Prompt`
`D\.2 Data Augmentation Prompt Data Augmentation Prompt D\.3 Memory Selector Prompts The episode\-level and turn\-level memory selector prompts are shown below\. Episode\-level Memory Selector Prompt Turn\-level Memory Selector Prompt D\.4 Episode Insight Extraction Prompt Episode Insight Extraction Prompt D\.5 PRM Judging Prompt PRM Judging Prompt`Similar Articles
Memory Retrieval for Changing Preferences
This paper proposes a unified framework for memory access and selection in long-context dialogue systems, using Bayes factors to quantify the utility of historical turns for modeling changing user preferences. Experiments show it outperforms embedding-based retrieval on preference-intensive tasks.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem is a novel memory mechanism for LLM-based agents that uses a hybrid structure combining a temporal and semantic tree with a knowledge graph to model memory evolution and improve retrieval, achieving state-of-the-art performance on QA benchmarks.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.