Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

arXiv cs.CL Papers

Summary

Progress-SQL introduces a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL, using an Oracle-guided Diagnostic Tree to provide dense reward signals and improve SQL query generation on benchmarks like BIRD and Spider.

arXiv:2606.06825v1 Announce Type: new Abstract: Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:20 AM

# Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards
Source: [https://arxiv.org/html/2606.06825](https://arxiv.org/html/2606.06825)
Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian East China Normal University shzhang@stu\.ecnu\.edu\.cn,yslan@dase\.ecnu\.edu\.cn

###### Abstract

Reinforcement learning has recently shown promise in improving large language models for Text\-to\-SQL generation, yet existing methods typically optimize one\-shot rewards defined over a single SQL state\. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi\-turn SQL refinement\. In this paper, we propose Progress\-SQL, a multi\-turn reinforcement learning framework with progressive rewards for Text\-to\-SQL\. Our approach introduces an Oracle\-guided Diagnostic Tree \(ODT\), which abstracts SQL queries into clause\-level structural profiles and produces diagnostic feedback for next\-turn refinement\. To provide dense and robust reward signals, we combine ODT\-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL\. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL\. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text\-to\-SQL performance across both primary and robustness evaluations\. Our code is released at[https://github\.com/YooYoo67/ProgressSQL](https://github.com/YooYoo67/ProgressSQL)\.

Progress\-SQL: Improving Reinforcement Learning for Text\-to\-SQL via Progressive Rewards

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan††thanks:Corresponding author\., Weining QianEast China Normal Universityshzhang@stu\.ecnu\.edu\.cn,yslan@dase\.ecnu\.edu\.cn

## 1Introduction

Large Language Models \(LLMs\) have significantly advanced Text\-to\-SQL parsing\(Liet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib31),[2025](https://arxiv.org/html/2606.06825#bib.bib27)\)\. Recent Reinforcement Learning \(RL\) methods further improve these models by optimizing one\-shot rewards\(Pourrezaet al\.,[2025b](https://arxiv.org/html/2606.06825#bib.bib1); Zhanget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib14); Wenget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib24); Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5)\), where the single turn rollout is measured by a one\-shot reward\. This reward is computed solely based on the execution result of the SQL\. However, such a reward is often sparse, which provides limited guidance on SQL generation\. This results in inefficient optimization for the exploration of correct SQLs, especially for complex SQLs involving joins and aggregationPourrezaet al\.\([2025b](https://arxiv.org/html/2606.06825#bib.bib1)\)\.

![Refer to caption](https://arxiv.org/html/2606.06825v1/x1.png)Figure 1:Comparison of reward paradigms\.\(a\) Single\-turn Rollout: the policy model generates a single SQL and receives a reward signal after execution\.\(b\) Multi\-turn Rollout with Progressive Reward \(Ours\): the policy model iteratively refines its SQL overTTturns guided by ODT engine\. The progressive reward measures improvement from the first SQL to the final SQL\.A cutting\-edge study, SkyRL\-SQL[Liuet al\.](https://arxiv.org/html/2606.06825#bib.bib20), introduced multi\-turn rollout for the Text\-to\-SQL task, where multi\-turn interaction is conducted between the LLMs and engines\. The LLM generates a sequence of SQLs for a question\. For each turn, the LLM obtains the execution result from the engine and takes it into consideration for the next\-turn generation\. Nevertheless, SkyRL\-SQL collects the reward based on the last\-turn generation, which cannot yet break away from the limitation of the one\-shot reward\. In other words,one\-shot reward is not enough to capture the dynamic behavior of multi\-turn rollout\.

To address this limitation, we propose Progress\-SQL, a multi\-turn RL framework with progressive rewards for Text\-to\-SQL, as shown in Figure[1](https://arxiv.org/html/2606.06825#S1.F1)\. Specifically, we first introduce an Oracle\-guided Diagnostic Tree \(ODT\), which abstracts SQL queries into clause\-level structural profiles and generates diagnostic feedback for next\-turn refinement\. By comparing the predicted ODT with the gold ODT during training, the model can revise its SQL prediction according to the structured feedback\. Unlike one\-shot rewards, our progressive reward is defined over the SQL trajectory and measures whether the final SQL improves over the initial SQL in terms of structural and lexical alignment\. Together with a progression latency reward and an execution status reward, the objective favors trajectories that improve effectively, reach correctness earlier, and recover from invalid SQL predictions\. We evaluate our RL method on widely used Text\-to\-SQL benchmarks, including BIRD and Spider\. Based on the 7B backbone, our method improves the base model by an average of 8\.5% in execution accuracy across BIRD Dev, Spider Dev, and Spider Test, and by 6\.3% in test\-suite accuracy on Spider Dev\. Compared with LLMs tuned by existing RL methods, our method achieves competitive or superior performance after fine\-tuning\.

Our contributions are summarized as follows:

- •We propose Progress\-SQL, a multi\-turn RL framework for Text\-to\-SQL\. By defining an ODT for clause\-level SQL diagnosis, Progress\-SQL collects fine\-grained feedback for next\-turn SQL generation\.
- •We design a progressive reward that explicitly measures the improvement from the initial SQL to the final SQL, complemented by early\-correctness and execution\-status rewards for efficient and robust refinement\.
- •Extensive experiments on multiple Text\-to\-SQL benchmarks demonstrate that our method consistently improves both execution accuracy and test\-suite accuracy with different base models\.

## 2Related Work

### 2\.1Text\-to\-SQL with Large Language Models

Due to the outstanding performance of LLMs in various NLP tasks, we observe a methodology trend of Text\-to\-SQL systems from heuristic rules and deep learning to LLMsZelle and Mooney \([1996](https://arxiv.org/html/2606.06825#bib.bib32)\); Popescuet al\.\([2003](https://arxiv.org/html/2606.06825#bib.bib33)\); Li and Jagadish \([2014](https://arxiv.org/html/2606.06825#bib.bib34)\); Yuet al\.\([2018a](https://arxiv.org/html/2606.06825#bib.bib35)\); Wanget al\.\([2020](https://arxiv.org/html/2606.06825#bib.bib36)\)\. In the early stages, LLM\-based Text\-to\-SQL systems relied heavily on In\-Context Learning \(ICL\) and structured prompting strategies\. DIN\-SQL\(Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.06825#bib.bib15)\)utilizes prompting instruction to decompose complex queries into sub\-problems, while DAIL\-SQL\(Gaoet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib16)\)constructs efficient few\-shot demonstrations via question skeleton matching\. More recent pipelines such as CHASE\-SQL\(Pourrezaet al\.,[2025a](https://arxiv.org/html/2606.06825#bib.bib19)\)and XiYan\-SQL\(Liuet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib18)\)further combine candidate generation and selection strategies to refresh the accuracy on competitive benchmarks\. Alongside prompting, post\-training has become a standard paradigm\(Liet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib31),[2025](https://arxiv.org/html/2606.06825#bib.bib27)\), improving the Text\-to\-SQL performance for open\-source models in curated training splits\.

### 2\.2Reinforcement Learning for Text\-to\-SQL

Regarding the post\-training procedure, recent studies have framed Text\-to\-SQL as a sequential decision\-making problem optimized with reinforcement learning, where reward design is crucial to measuring SQL quality\. Recent studies have explored RL for Text\-to\-SQL by exploring different rewards\. The most direct signal is binary execution accuracy \(EX\), but it is highly sparse because rewards are obtained only when the generated SQL matches the gold execution result\. To provide denser supervision, Reward\-SQL\(Zhanget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib14)\)introduces a Process Reward Model for step\-wise reasoning supervision, while Reasoning\-SQL\(Pourrezaet al\.,[2025b](https://arxiv.org/html/2606.06825#bib.bib1)\)designs SQL\-specific partial rewards such as schema\-linking accuracy, n\-gram similarity, and syntax validity\. Graph\-Reward\-SQL\(Wenget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib24)\)further incorporates structural tree matching to capture logical alignment\. SkyRL\-SQL\([Liuet al\.,](https://arxiv.org/html/2606.06825#bib.bib20)\)extends RL to multi\-turn SQL refinement, but its reward is still derived from the final SQL state\. However, these methods mainly define rewards over single or final SQL states, leaving trajectory\-level revision behavior underexplored\. However, these methods mainly define rewards over single or final SQL states, leaving the trajectory\-level revision behavior underexplored\.

## 3Preliminaries

### 3\.1Problem Definition

The Text\-to\-SQL task can be formulated as a semantic parsing problem that translates a natural language question into an executable SQL query\. Formally, letq=\{q1,q2,…,ql\}q=\\\{q\_\{1\},q\_\{2\},\\dots,q\_\{l\}\\\}denote the natural language question andSSdenote the corresponding database schema \(comprising tables, columns, and foreign key constraints\)\. Given the input contextx=\(q,S\)x=\(q,S\), the goal is to generate a target SQL queryy=\{w1,w2,…,wm\}y=\\\{w\_\{1\},w\_\{2\},\\dots,w\_\{m\}\\\}which consists of a sequence of tokens and can correctly retrieve the answers from the database engineℰ\\mathcal\{E\}\.

### 3\.2Reinforcement Learning Protocol

Existing studies\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.06825#bib.bib21)\)view the generation process through a parameterized policyπθ​\(y∣x\)\\pi\_\{\\theta\}\(y\\mid x\), which is initialized from a pre\-trained instruction\-tuned LLM\. We denote it as the reference policyπref\\pi\_\{\\text\{ref\}\}\. The Text\-to\-SQL task can be solved via the reinforcement learning protocol, which optimize the policy model by maximizing:

𝒥​\(θ\)=\\displaystyle\\mathcal\{J\}\(\\theta\)=𝔼x∼P​\(x\),y∼πθ\[ℛ\(y,y∗\)\\displaystyle\\mathbb\{E\}\_\{x\\sim P\(x\),y\\sim\\pi\_\{\\theta\}\}\[\\mathcal\{R\}\(y,y^\{\*\}\)−β𝔻KL\(πθ\(⋅\|x\)∥πref\(⋅\|x\)\)\],\\displaystyle\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\(\\cdot\|x\)\\\|\\pi\_\{\\text\{ref\}\}\(\\cdot\|x\)\)\],wherexxis sampled from a Text\-to\-SQL dataset,πref\\pi\_\{\\text\{ref\}\}is the initial reference model, andπθ\\pi\_\{\\theta\}is iteratively updated\. KL divergence penalty prevents the updated policy from degrading its fundamental language capabilities during training\. Reward function measures the distance between the generated SQL queryyyand the gold SQL queryy∗y^\{\*\}\. The similar protocol with different RL algorithms such as GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib22)\), GSPO\(Zhenget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib12)\)are also widely utilized to solve the task\.

### 3\.3One\-shot Reward Design

In RL protocol, one of the core objectives is to maximize the rewardℛ​\(y,y∗\)\\mathcal\{R\}\(y,y^\{\*\}\), which can be deemed as a one\-shot measurement between the generated SQL and the gold SQL\. We notice that a number of one\-shot rewards have been proposed for solving Text\-to\-SQL task, which can be summarized below\.

Execution Matching\. Measuring the consistency of the execution results between the generated SQL and the gold SQL is an intuitive way to evaluate the policy model\. Standard approach formulates it as a binary reward \(11for exact match,0otherwise\)Pourrezaet al\.\([2025b](https://arxiv.org/html/2606.06825#bib.bib1)\)\. Due to the sparse reward signal, recent studies improve the reward by introducing fractional execution based on the proportion of matching columns and cellsHaoet al\.\([2025](https://arxiv.org/html/2606.06825#bib.bib2)\); Papicchioet al\.\([2025](https://arxiv.org/html/2606.06825#bib.bib3)\)\.

Query Matching\. To construct a denser reward landscape, researchers incorporate static structural similarities betweenyyandy∗y^\{\*\}\. Reasoning\-SQLPourrezaet al\.\([2025b](https://arxiv.org/html/2606.06825#bib.bib1)\)calculates the Jaccard similarity of extracted schema items and 2\-grams between the generated and the annotated SQLs, thus guiding the query alignment beyond database execution\. Besides the semantics matching, syntax is also considered in the rewardPourrezaet al\.\([2025b](https://arxiv.org/html/2606.06825#bib.bib1)\); Aliet al\.\([2025](https://arxiv.org/html/2606.06825#bib.bib4)\)\.

Format and Process Regularization\. With the advent of reasoning models \(*e\.g\.*, DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib26)\)\), enforcing specific Chain\-of\-Thought \(CoT\) behaviors has become integral to reward design\. Models are incentivized via format rewards to encapsulate their reasoning processes within specific tags \(*e\.g\.*,<think\>and<sql\>\)\(Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5); Papicchioet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib3); Pourrezaet al\.,[2025b](https://arxiv.org/html/2606.06825#bib.bib1)\)\. Furthermore, to prevent reward hacking and excessive verbosity, regularization terms such as tag count\(Papicchioet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib3)\)and length penalties\(Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5)\)are applied to penalize redundant reasoning\. And some studies even involve schema keywords, runtime log in the rewards\(Berdnyk and Collery,[2025](https://arxiv.org/html/2606.06825#bib.bib6); Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5)\)\.

## 4Methodology

![Refer to caption](https://arxiv.org/html/2606.06825v1/x2.png)Figure 2:Overall framework of Progress\-SQL, our multi\-turn reinforcement learning method for Text\-to\-SQL\. The policy model iteratively generates SQL queries and receives ODT\-based diagnostic feedback after each execution\. The final trajectory is optimized using progressive rewards that combine structural/lexical alignment improvement, progression latency reward, execution\-status transition reward, and format reward\.### 4\.1Multi\-turn Rollout with ODT Feedback

Standard RL methods for Text\-to\-SQL usually implement single\-turn rollout, where the policy generates a SQL query once and receives sparse rewards from the single decoded SQL\. To address this limitation, we extend the standard rollout process into a multi\-turn SQL debugging trajectory\. Inspired by SkyRL\-SQL\([Liuet al\.,](https://arxiv.org/html/2606.06825#bib.bib20)\), we allow the policy model to iteratively revise its SQL prediction based on feedback from previous turns as shown in Figure[2](https://arxiv.org/html/2606.06825#S4.F2)\. Remarkably, we introduce anOracle\-guided Diagnostic Tree\(ODT\) as the structural feedback after each attempt to refine the next\-turn rollout\.

Formally, given a user questionqqand database schemaSS, the initial input is defined as:

x\(1\)=\(q,S\)\.\\displaystyle x^\{\(1\)\}=\(q,S\)\.At thett\-th turn, the policy modelπθ\\pi\_\{\\theta\}generates a new SQL prediction as the rollout with the consideration of the inputx\(t\)x^\{\(t\)\}:

y\(t\)∼πθ\(⋅∣x\(t\)\)\.\\displaystyle y^\{\(t\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x^\{\(t\)\}\)\.
Regardingx\(t\)x^\{\(t\)\}, we construct the input by appending ODT structural feedback generated from the previous prediction:

f\(t−1\)\\displaystyle f^\{\(t\-1\)\}=ODT​\(y\(t−1\),y∗\),\\displaystyle=\\mathrm\{ODT\}\(y^\{\(t\-1\)\},y^\{\*\}\),x\(t\)\\displaystyle x^\{\(t\)\}=\(q,S,y\(1\),f\(1\),…,y\(t−1\),f\(t−1\)\)\.\\displaystyle=\(q,S,y^\{\(1\)\},f^\{\(1\)\},\\dots,y^\{\(t\-1\)\},f^\{\(t\-1\)\}\)\.The rollout continues until the model produces a correct executable SQL or the maximum interaction budgetTTis reached\. This process yields a multi\-turn trajectoryY=\(y\(1\),y\(2\),…,y\(T\)\)Y=\(y^\{\(1\)\},y^\{\(2\)\},\\dots,y^\{\(T\)\}\), which is then evaluated by our progressive rewards\.

Oracle\-guided Diagnostic Tree\. Above, directly comparing a predicted SQL with the gold SQL is non\-trivial due to the compositional and nested syntax of SQL queries\. Following the common practice of using Abstract Syntax Trees \(ASTs\) to capture SQL syntactic structures, we parse both the predicted SQLyyand the gold SQLy∗y^\{\*\}into ASTs\. Rather than directly matching raw AST nodes, we abstract each SQL into an Oracle\-guided Diagnostic Tree \(ODT\)\. As shown in Figure[2](https://arxiv.org/html/2606.06825#S4.F2), each node in the tree represents a clause\-level structural profile, such as selected columns, join signatures, filtering predicates, grouping columns, ordering clauses, or nested subqueries\. By comparing the ODTs ofyyandy∗y^\{\*\}, the diagnostic module produces two outputs: a structural similarity score and a set of clause\-level error tags\. The error tags are verbalized into diagnostic feedback for the next\-turn refinement, while the structural similarity score serves as the structural alignment term in our reward function \(refer to Section[4\.2](https://arxiv.org/html/2606.06825#S4.SS2)\)\. For example, in Figure[2](https://arxiv.org/html/2606.06825#S4.F2), the first\-turn SQL only selects student names with an age predicate, while the gold SQL requires department\-level aggregation by joiningstudentswithdepartmentsand grouping by department\. The ODT comparison detects mismatches in selection, join, and grouping structures, producing tags such asSELECT\_ERROR,JOIN\_MISSING, andGROUP\_BY\_MISSING\. These tags are verbalized as feedback to guide the next\-turn SQL refinement\.

ODT acts as a fixed, non\-differentiable diagnostic component of the environment\. It compares the predicted SQL with the gold SQL and appends discrete error tags as next\-turn observations; these tags do not participate in back\-propagation, and gradients are computed only through the policy model over generated tokens\. Details of the multi\-turn algorithm and ODT construction are provided in Appendix[C](https://arxiv.org/html/2606.06825#A3)and Appendix[D](https://arxiv.org/html/2606.06825#A4), respectively\.

### 4\.2Progressive Reward Formulation

Given a multi\-turn SQL trajectoryY=\(y\(1\),y\(2\),…,y\(T\)\)Y=\(y^\{\(1\)\},y^\{\(2\)\},\\dots,y^\{\(T\)\}\)and the gold SQLy∗y^\{\*\}, our goal is not only to evaluate the final prediction, but also to measure whether the trajectory makes meaningful progress\. We therefore design a comprehensive reward composed of four parts:Progressive Alignment Reward,Progression Latency Reward,Execution Status Reward, andFormat Reward\.

#### Progressive Alignment Reward\.

To measure the similarity between the predicted and ground truth SQLs, we define both structural and lexical aspects of alignment:

- •Structural Alignment\.The structural alignment score is computed by the ODT tree described above\. Specifically, we abstract both the generated SQL and the gold SQL into ODTs and compare their clause\-level structural profiles\. To focus on structural correspondence, lexical values are normalized or replaced with placeholders before matching\. For each node, we compute a similarity score by combining local clause similarity and child\-subtree similarity: snode=α⋅slocal\+\(1−α\)⋅schild,s\_\{\\mathrm\{node\}\}=\\alpha\\cdot s\_\{\\mathrm\{local\}\}\+\(1\-\\alpha\)\\cdot s\_\{\\mathrm\{child\}\},whereslocals\_\{\\mathrm\{local\}\}measures the weighted Jaccard similarity over clause\-level feature sets\. For nested structures,schilds\_\{\\mathrm\{child\}\}is computed recursively by matching child subtrees under node\-type compatibility constraints\. The final structural alignment score is given by the root node score: ℱstruct​\(y,y∗\)=sroot\.\\displaystyle\\mathcal\{F\}\_\{\\mathrm\{struct\}\}\(y,y^\{\*\}\)=s\_\{\\mathrm\{root\}\}\.Hence, structural alignment measures clause\-level structural consistency betweenyyandy∗y^\{\*\}while reducing sensitivity to surface lexical differences\. The detailed ODT construction and scoring are provided in Appendix[D](https://arxiv.org/html/2606.06825#A4)\.
- •Lexical Alignment\.Structural parsing may fail when the generated SQL contains severe syntax errors, especially during early training\. To avoid zero\-reward regions in such cases, we further introduce a lexical fallback score\. Specifically, we compute the 2\-gram Jaccard similarity between the tokenized generated SQL and the gold SQL, denoted asℱlex​\(y,y∗\)\\mathcal\{F\}\_\{\\mathrm\{lex\}\}\(y,y^\{\*\}\)\. The lexical score provides a dense signal when structural parsing is unavailable\.

Rather than rewarding only endpoint correctness, we explicitly measure whether the final SQL improves over the initial prediction\. The final alignment score is defined as:

ℱ​\(y,y∗\)=\\displaystyle\\mathcal\{F\}\(y,y^\{\*\}\)=12​\(ℱstruct​\(y,y∗\)\+ℱlex​\(y,y∗\)\),\\displaystyle\\frac\{1\}\{2\}\\left\(\\mathcal\{F\}\_\{\\mathrm\{struct\}\}\(y,y^\{\*\}\)\+\\mathcal\{F\}\_\{\\mathrm\{lex\}\}\(y,y^\{\*\}\)\\right\),Δ=\\displaystyle\\Delta=ℱ​\(y\(T\),y∗\)−ℱ​\(y\(1\),y∗\)\.\\displaystyle\\mathcal\{F\}\(y^\{\(T\)\},y^\{\*\}\)\-\\mathcal\{F\}\(y^\{\(1\)\},y^\{\*\}\)\.This endpoint\-difference design reduces the influence of intermediate oscillations and encourages the trajectory to end with a better SQL query than it starts with\.

The progressive reward is defined as:

ℛalign=\{ωalign\+⋅Δ,if​Δ\>0,ωalign−,if​Δ≤0,\\mathcal\{R\}\_\{\\mathrm\{align\}\}=\\begin\{cases\}\\omega\_\{\\mathrm\{align\}\}^\{\+\}\\cdot\\Delta,&\\text\{if \}\\Delta\>0,\\\\ \\omega\_\{\\mathrm\{align\}\}^\{\-\},&\\text\{if \}\\Delta\\leq 0,\\\\ \\end\{cases\}\(1\)whereωalign\+\\omega\_\{\\mathrm\{align\}\}^\{\+\}controls the reward for positive progress andωalign−\\omega\_\{\\mathrm\{align\}\}^\{\-\}penalizes stagnant trajectories\. This encourages the model to perform meaningful revisions instead of meaningless repeats\.

#### Progression Latency Reward\.

To encourage immediate progression to SQL correction, we terminate the rollout once the generated SQL first matches the execution result of the gold SQL\. Letk∗k^\{\*\}denote the first successful turn, where success means execution result matches with the gold SQL\. We apply a geometric decay to late corrections:

ℛlate=\{ωacc⋅γk∗−1,1≤k∗≤T,0,otherwise,\\mathcal\{R\}\_\{\\mathrm\{late\}\}=\\begin\{cases\}\\omega\_\{\\mathrm\{acc\}\}\\cdot\\gamma^\{k^\{\*\}\-1\},&1\\leq k^\{\*\}\\leq T,\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(2\)whereωacc\>0\\omega\_\{\\mathrm\{acc\}\}\>0controls the base reward for execution correctness andγ∈\(0,1\)\\gamma\\in\(0,1\)is the per\-turn decay factor\. Earlier successful turns therefore receive larger rewards, encouraging the model to produce a correct SQL faster\.

#### Execution Status Reward\.

Execution correctness can be too sparse, but executability itself provides useful information about whether the model is recovering from invalid SQL\. We therefore reward status transitions in executability between the initial and final predictions\. Letℰ​\(y\)\\mathcal\{E\}\(y\)be a boolean indicator that returns true ifyycan be executed by the database engine without syntax or runtime errors\. We define:

ℛexec=\{ωkeep,if​ℰ​\(y\(1\)\)∧ℰ​\(y\(T\)\),ωrec,if​¬ℰ​\(y\(1\)\)∧ℰ​\(y\(T\)\),ωdet,if​ℰ​\(y\(1\)\)∧¬ℰ​\(y\(T\)\),0,otherwise\.\\mathcal\{R\}\_\{\\mathrm\{exec\}\}=\\begin\{cases\}\\omega\_\{\\mathrm\{keep\}\},&\\text\{if \}\\mathcal\{E\}\(y^\{\(1\)\}\)\\land\\mathcal\{E\}\(y^\{\(T\)\}\),\\\\ \\omega\_\{\\mathrm\{rec\}\},&\\text\{if \}\\neg\\mathcal\{E\}\(y^\{\(1\)\}\)\\land\\mathcal\{E\}\(y^\{\(T\)\}\),\\\\ \\omega\_\{\\mathrm\{det\}\},&\\text\{if \}\\mathcal\{E\}\(y^\{\(1\)\}\)\\land\\neg\\mathcal\{E\}\(y^\{\(T\)\}\),\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(3\)Here,ωkeep\>0\\omega\_\{\\mathrm\{keep\}\}\>0rewards SQLs that remain executable across refinement,ωrec\>0\\omega\_\{\\mathrm\{rec\}\}\>0rewards recovery from an initially unexecutable SQL to an executable one, andωdet<0\\omega\_\{\\mathrm\{det\}\}<0penalizes deterioration from an executable SQL to an invalid one\.

#### Format Reward\.

Following prior endpoint\-reward designs\(Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5); Papicchioet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib3); Pourrezaet al\.,[2025b](https://arxiv.org/html/2606.06825#bib.bib1)\), we define the format reward asℛfmt=ωfmt\\mathcal\{R\}\_\{\\mathrm\{fmt\}\}=\\omega\_\{\\mathrm\{fmt\}\}if the final output follows the required<think\>and<sql\>templates, andℛfmt=0\\mathcal\{R\}\_\{\\mathrm\{fmt\}\}=0otherwise\.

#### Overall Reward\.

The final trajectory\-level reward is the sum of all components:

ℛ​\(Y\)=ℛfmt\+ℛexec\+ℛlate\+ℛalign\.\\mathcal\{R\}\(Y\)=\\mathcal\{R\}\_\{\\mathrm\{fmt\}\}\+\\mathcal\{R\}\_\{\\mathrm\{exec\}\}\+\\mathcal\{R\}\_\{\\mathrm\{late\}\}\+\\mathcal\{R\}\_\{\\mathrm\{align\}\}\.\(4\)This reward jointly evaluates format validity, executability transition, endpoint correctness, and the improvement of progressive alignment over the multi\-turn trajectory\.

Reward Analysis\. Figure[2](https://arxiv.org/html/2606.06825#S4.F2)illustrates the theoretical upper bound of reward under different first correct turnk∗k^\{\*\}with keeping format and execution status unchanged\. We elaborately configure the defined weights \(refer to Appendix[B](https://arxiv.org/html/2606.06825#A2)\) to show an overall downward trend as shown in the Figure[2](https://arxiv.org/html/2606.06825#S4.F2)\. This upper\-bound trend shows that the reward design explicitly favors earlier positive progression and also encourages the success of the earlier rollout\.

## 5Experiments

### 5\.1Experimental Setup

Datasets\. We conduct experiments on widely recognized Text\-to\-SQL benchmarks:

- •BIRD\(Liet al\.,[2023](https://arxiv.org/html/2606.06825#bib.bib7)\): A large\-scale, cross\-domain dataset featuring complex reasoning questions and real\-world database schema\. It is highly challenging and serves as our primary testbed for multi\-turn reasoning capabilities\.
- •Spider & its variants: We also evaluate on the standardSpider\(Yuet al\.,[2018b](https://arxiv.org/html/2606.06825#bib.bib8)\)dataset to assess cross\-domain generalization\. To further test the model’s robustness against schema synonyms and domain knowledge perturbations, we report results on its challenging variants:Spider\-Syn\(Ganet al\.,[2021a](https://arxiv.org/html/2606.06825#bib.bib9)\),Spider\-Realistic\(Denget al\.,[2021](https://arxiv.org/html/2606.06825#bib.bib10)\), andSpider\-DK\(Ganet al\.,[2021b](https://arxiv.org/html/2606.06825#bib.bib11)\)\.

Table 1:Performance comparison on the primary Text\-to\-SQL benchmarks\. All comparative 7B baselines post\-trained by RL algorithm are based on Qwen2\.5\-Coder\-7B\-Instruct\. We report EX and TS if they are available\. Numbers in red fonts indicate absolute gains over the same\-scale base model\.†indicates results reproduced by us following the data construction described in the corresponding papers and without additional task\-specific supervised fine\-tuning\.Evaluation Metrics\. We employ two standard automatic evaluation metrics in recent Text\-to\-SQL research:

- •Execution accuracy \(EX\): Measures whether the execution output of the generated SQL matches the exact result of the ground\-truth SQL on the target database\.
- •Test\-Suite accuracy \(TS\): A more rigorous metric that evaluates the generated SQL across multiple augmented database test cases to ensure generalization and prevent false positives caused by coincidental execution matches\.

Baselines\. We compare our models against two categories of strong baselines, corresponding to the groupings in Table[1](https://arxiv.org/html/2606.06825#S5.T1)\.

- •Base and Supervised Fine\-tuned Models: This category includes our backbone models, Qwen2\.5\-Coder\-7B/14B\-Instruct, evaluated before applying our RL training, as well as supervised fine\-tuned Text\-to\-SQL models such as OmniSQL\(Liet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib27)\)and SFT CodeS\(Liet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib31)\)\. This comparison measures the gains brought by our RL framework beyond standard supervised or instruction\-tuned models\.
- •Comparative RL methods: This category covers recent Text\-to\-SQL systems trained with reinforcement learning or reasoning\-oriented post\-training objectives, including both single\-turn and multi\-turn RL methods\. All models in this group are based on Qwen2\.5\-Coder\-7B\-Instruct, making them directly comparable to our 7B model under the same backbone family\. Specifically, we include Reasoning\-SQL\(Pourrezaet al\.,[2025b](https://arxiv.org/html/2606.06825#bib.bib1)\), Reward\-SQL\(Zhanget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib14)\), Graph\-Reward\-SQL\(Wenget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib24)\), SQL\-R1\(Maet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib5)\), SkyRL\-SQL\([Liuet al\.,](https://arxiv.org/html/2606.06825#bib.bib20)\), and SQL\-Trail\(Huaet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib30)\)\. The first four methods use single\-turn rollouts with one\-shot rewards, while SkyRL\-SQL and SQL\-Trail introduce multi\-turn refinement but still optimize rewards mainly associated with final SQL states\. These baselines provide direct comparisons to our progressive multi\-turn RL framework\.

Implementation Details\. We adopt Qwen2\.5\-Coder\-7B/14B\-Instruct\(Huiet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib29)\)as base models and train them with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib22)\)using theverlframework\(Shenget al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib13)\)\. Following prior multi\-turn Text\-to\-SQL training setups\(Huaet al\.,[2026](https://arxiv.org/html/2606.06825#bib.bib30)\), we sampleG=8G=8rollouts per input with temperatureτ=1\.0\\tau=1\.0\. Method\-specific hyperparameters are selected by small\-scale pilot experiments on the development set and then kept fixed across all experiments, with the maximum interaction turn set toT=4T=4and the per\-turn decay factor set toγ=0\.5\\gamma=0\.5\. ODT is used only during training for feedback construction and progressive reward computation; evaluation follows standard single\-turn inference without ODT feedback, with transfer behavior analyzed in Appendix[E\.4](https://arxiv.org/html/2606.06825#A5.SS4)\. Test results are reported with majority voting over 8 samples \(Vote@8\)\. Full hyperparameters are provided in Appendix[B](https://arxiv.org/html/2606.06825#A2)\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2606.06825#S5.T1)reports the performance on the widely recognized Text\-to\-SQL benchmarks\. We have the following observations accordingly:

\(1\) LLMs fine\-tuned by our RL method consistently improve over their corresponding base models, which indicates the generalization of our RL method\. For the 7B backbone LLMs, Progress\-SQL achieves an average EX improvement of8\.5%8\.5\\%across the three benchmarks, together with a7\.1%7\.1\\%TS improvement on Spider Dev\. For the 14B backbone LLMs, Progress\-SQL achieves an average EX improvement of4\.4%4\.4\\%across the three benchmarks, together with a3\.4%3\.4\\%TS improvement on Spider Dev\. \(2\) Compared with existing post\-training methods, our approach achieves competitive or superior performance across the evaluated benchmarks\. In particular, Progress\-SQL\-7B reaches87\.1%87\.1\\%EX on Spider Dev and87\.8%87\.8\\%EX on Spider Test, surpassing recent 7B SQL RL methods \(e\.g\. SQL\-R1\-7B and SkyRL\-SQL\-7B\) on these benchmarks\. Progress\-SQL\-14B further improves the results, suggesting that our progressive rewards can scale to stronger base models\.

Table 2:Robustness evaluation on Spider variants\. All comparative 7B RL baselines are based on Qwen2\.5\-Coder\-7B\-Instruct\. We report EX and TS if they are available\. Numbers in red fonts denote absolute gains over the same\-scale base model\.†indicates results reproduced by us following the data construction described in the corresponding papers and without additional task\-specific supervised fine\-tuning\.
### 5\.3Further Analysis

Robustness Evaluation\. Table[2](https://arxiv.org/html/2606.06825#S5.T2)presents the robustness evaluation on Spider\-Syn, Spider\-Realistic, and Spider\-DK, which include various knowledge perturbation\. Our models consistently improves over the corresponding Qwen2\.5\-Coder base models across all three variants, which showcases the robustness of our method on different Text\-to\-SQL scenarios\. Moreover, our models achieve competitive results compared with existing post\-training methods across the Spider variants, suggesting superior effect of post\-training for Text\-to\-SQL\.

![Refer to caption](https://arxiv.org/html/2606.06825v1/x3.png)\(a\)Effects of per\-turn decay\.
![Refer to caption](https://arxiv.org/html/2606.06825v1/x4.png)\(b\)Effects of maximum interaction turnsTT\.

Figure 3:Effects of per\-turn decay and interaction budget\. \(a\) Removing per\-turn decay leads to less stable multi\-turn optimization and longer correction trajectories\. \(b\) Performance improves up toT=4T=4and then decreases when more interaction turns are allowed\.Table 3:Ablation study on BIRD Dev and Spider Dev under Vote@8 decoding\. Numbers in blue fonts in parentheses denote absolute drops compared with the full reward\.Ablation Study\. We conduct an ablation study on BIRD Dev and Spider Dev\. Table[3](https://arxiv.org/html/2606.06825#S5.T3)reports the results under Vote@8 decoding\. As we can see, iteratively removing components from the comprehensive reward leads to performance drop on both benchmarks\. The Single\-turn RL variant removes the multi\-turn rollout and the progressive alignment rewardℛalign\\mathcal\{R\}\_\{\\mathrm\{align\}\}, reducing our framework to a standard one\-shot RL setting where the model generates a single SQL and is optimized using only final\-state rewards\. Among all the variants, removing per\-turn decay causes the largest performance drop\. This indicates that per\-turn decay is critical for preventing the policy from over\-relying on the late feedback\. Figure[3\(a\)](https://arxiv.org/html/2606.06825#S5.F3.sf1)further illustrates the training dynamics behind this degradation\. Without per\-turn decay, the model is not penalized for solving the query in later turns and therefore tends to rely more heavily on ODT feedback to progressively repair its SQL\. This increases the average number of interaction turns and allows the trajectory\-level reward to improve, but weakens the model’s first\-attempt SQL generation ability\. Consequently, when evaluated under the standard single\-turn protocol without ODT feedback, the no\-decay model performs substantially worse\. More detailed analysis is provided in Appendix[E\.2](https://arxiv.org/html/2606.06825#A5.SS2)\.

Impact of Maximum Interaction Turns\. We study the maximum interaction budgetTT, which controls the number of refinement attempts in each rollout\. As shown in Figure[3\(b\)](https://arxiv.org/html/2606.06825#S5.F3.sf2), performance improves asTTincreases up to 4, suggesting that additional turns help correct SQL errors\. However, larger budgets lead to performance degradation, indicating that more refinement is not always beneficial\. Therefore, we setT=4T=4in all main experiments\. Detailed analysis is provided in Appendix[E\.3](https://arxiv.org/html/2606.06825#A5.SS3)\.

## 6Conclusion

In this work, we proposed Progress\-SQL, a multi\-turn reinforcement learning framework with progressive rewards for Text\-to\-SQL\. Progress\-SQL leverages ODT\-based diagnostic feedback to guide iterative SQL refinement and optimizes a progressive reward that jointly encourages alignment improvement, early correctness, and recovery from invalid SQL\. Experiments on multiple Text\-to\-SQL benchmarks show consistent gains across primary and robustness evaluations, demonstrating the effectiveness of our framework for improving Text\-to\-SQL reasoning and refinement\.

## Limitations

Despite the effectiveness of our framework, two limitations remain\. On the one hand, our ODT\-based diagnostic feedback relies on the availability of gold SQL during training\. Therefore, ODT is mainly designed as a training\-time reinforcement learning signal rather than a test\-time feedback module, since gold SQL is unavailable during standard inference\. Nevertheless, our experiments show that such oracle\-guided training feedback can be internalized by the policy model and improve its single\-turn inference ability without using ODT at test time\. On the other hand, the ODT scorer depends on SQL parsing and structural abstraction\. Although lexical alignment is introduced as a fallback signal for ill\-formed SQL predictions, severely malformed queries or dialect\-specific SQL constructs may still reduce the precision of structural diagnostics\. Extending the framework to more robust SQL parsers, diverse SQL dialects, and real\-world interactive database settings remains an important direction for future work\.

## References

- A\. Ali, A\. Baheti, J\. Chang, T\. Chi, B\. Cui, A\. Drozdov, J\. Frankle, A\. Gupta, P\. Koppol, S\. Kulinski,et al\.\(2025\)A state\-of\-the\-art sql reasoning model using rlvr\.arXiv preprint arXiv:2509\.21459\.Cited by:[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p3.2)\.
- M\. Berdnyk and M\. Collery \(2025\)Llm\-based sql generation with reinforcement learning\.InThe First Workshop on Neural Reasoning and Mathematical Discovery at AAAI’2025\. Workshop Paper,Cited by:[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p4.1)\.
- X\. Deng, A\. H\. Awadallah, C\. Meek, O\. Polozov, H\. Sun, and M\. Richardson \(2021\)Structure\-grounded pretraining for text\-to\-sql\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2021, Online, June 6\-11, 2021,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tür, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),pp\. 1337–1350\.External Links:[Link](https://doi.org/10.18653/v1/2021.naacl-main.105),[Document](https://dx.doi.org/10.18653/V1/2021.NAACL-MAIN.105)Cited by:[2nd item](https://arxiv.org/html/2606.06825#S5.I1.i2.p1.1)\.
- Y\. Gan, X\. Chen, Q\. Huang, M\. Purver, J\. R\. Woodward, J\. Xie, and P\. Huang \(2021a\)Towards robustness of text\-to\-SQL models against synonym substitution\.Online,pp\. 2505–2515\.External Links:[Link](https://aclanthology.org/2021.acl-long.195),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.195)Cited by:[2nd item](https://arxiv.org/html/2606.06825#S5.I1.i2.p1.1)\.
- Y\. Gan, X\. Chen, and M\. Purver \(2021b\)Exploring underexplored limitations of cross\-domain text\-to\-sql generalization\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7\-11 November, 2021,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),pp\. 8926–8931\.External Links:[Link](https://doi.org/10.18653/v1/2021.emnlp-main.702),[Document](https://dx.doi.org/10.18653/V1/2021.EMNLP-MAIN.702)Cited by:[2nd item](https://arxiv.org/html/2606.06825#S5.I1.i2.p1.1)\.
- D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou \(2024\)Text\-to\-sql empowered by large language models: A benchmark evaluation\.Proc\. VLDB Endow\.17\(5\),pp\. 1132–1145\.External Links:[Link](https://www.vldb.org/pvldb/vol17/p1132-gao.pdf),[Document](https://dx.doi.org/10.14778/3641204.3641221)Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p4.1)\.
- H\. Hao, W\. Hu, O\. Verkholyak, D\. A\. Tarzanagh, B\. Gutow, S\. Didari, M\. Faraki, H\. Moon, and S\. Min \(2025\)PaVeRL\-sql: text\-to\-sql via partial\-match rewards and verbal reinforcement learning\.arXiv preprint arXiv:2509\.07159\.Cited by:[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p2.2)\.
- H\. Hua, Z\. Han, Z\. Shen, J\. Lee, P\. Guan, Q\. Zhu, S\. Jeoung, Y\. Chen, Y\. Bai, S\. Wang,et al\.\(2026\)SQL\-trail: multi\-turn reinforcement learning with interleaved feedback for text\-to\-sql\.arXiv preprint arXiv:2601\.17699\.Cited by:[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[§5\.1](https://arxiv.org/html/2606.06825#S5.SS1.p5.4),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.2.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.2.2.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\. 5\-coder technical report\.arXiv preprint arXiv:2409\.12186\.Cited by:[§5\.1](https://arxiv.org/html/2606.06825#S5.SS1.p5.4)\.
- F\. Li and H\. V\. Jagadish \(2014\)Constructing an interactive natural language interface for relational databases\.\.Proc\. VLDB Endow\.8\(1\),pp\. 73–84\.Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- H\. Li, S\. Wu, X\. Zhang, X\. Huang, J\. Zhang, F\. Jiang, S\. Wang, T\. Zhang, J\. Chen, R\. Shi, H\. Chen, and C\. Li \(2025\)OmniSQL: synthesizing high\-quality text\-to\-sql data at scale\.Proc\. VLDB Endow\.18\(11\),pp\. 4695–4709\.External Links:[Link](https://www.vldb.org/pvldb/vol18/p4695-li.pdf),[Document](https://dx.doi.org/10.14778/3749646.3749723)Cited by:[Appendix B](https://arxiv.org/html/2606.06825#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.06825#A2.p3.1),[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2606.06825#S5.I3.i1.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.8.6.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.2.8.6.1)\.
- H\. Li, J\. Zhang, H\. Liu, J\. Fan, X\. Zhang, J\. Zhu, R\. Wei, H\. Pan, C\. Li, and H\. Chen \(2024\)Codes: towards building open\-source language models for text\-to\-sql\.Proceedings of the ACM on Management of Data2\(3\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1),[1st item](https://arxiv.org/html/2606.06825#S5.I3.i1.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.9.7.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.2.9.7.1)\.
- J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo, X\. Zhou, C\. Ma, G\. Li, K\. C\. Chang, F\. Huang, R\. Cheng, and Y\. Li \(2023\)Can LLM already serve as A database interface? A big bench for large\-scale database grounded text\-to\-sqls\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/83fc8fab1710363050bbd1d4b8cc0021-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[Appendix B](https://arxiv.org/html/2606.06825#A2.p1.1),[1st item](https://arxiv.org/html/2606.06825#S5.I1.i1.p1.1)\.
- \[15\]S\. Liu, A\. Zhu, S\. Hegde, S\. Cao, S\. Yuan, S\. Suwito, T\. Griggs, M\. Zaharia, J\. E\. Gonzalez, and I\. StoicaSkyRL\-sql: multi\-turn sql data agents via rl\.InFirst Workshop on Multi\-Turn Interactions in Large Language Models,Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06825#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.06825#S4.SS1.p1.1),[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.14.12.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.2.12.10.1)\.
- Y\. Liu, Y\. Zhu, Y\. Gao, Z\. Luo, X\. Li, X\. Shi, Y\. Hong, J\. Gao, Y\. Li, B\. Ding, and J\. Zhou \(2026\)XiYan\-sql: A novel multi\-generator framework for text\-to\-sql\.IEEE Trans\. Knowl\. Data Eng\.38\(4\),pp\. 2474–2487\.External Links:[Link](https://doi.org/10.1109/TKDE.2026.3657851),[Document](https://dx.doi.org/10.1109/TKDE.2026.3657851)Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- P\. Ma, X\. Zhuang, C\. Xu, X\. Jiang, R\. Chen, and J\. Guo \(2026\)Sql\-r1: training natural language to sql reasoning model by reinforcement learning\.Advances in Neural Information Processing Systems38,pp\. 174505–174537\.Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p4.1),[§4\.2](https://arxiv.org/html/2606.06825#S4.SS2.SSS0.Px4.p1.2),[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.1.1.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.1.1.1)\.
- S\. Papicchio, S\. Rossi, L\. Cagliero, and P\. Papotti \(2025\)Think2sql: reinforce llm reasoning capabilities for text2sql\.arXiv preprint arXiv:2504\.15077\.Cited by:[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p2.2),[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p4.1),[§4\.2](https://arxiv.org/html/2606.06825#S4.SS2.SSS0.Px4.p1.2)\.
- A\. Popescu, O\. Etzioni, and H\. Kautz \(2003\)Towards a theory of natural language interfaces to databases\.InProceedings of the 8th international conference on Intelligent user interfaces,pp\. 149–157\.Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- M\. Pourreza, H\. Li, R\. Sun, Y\. Chung, S\. Talaei, G\. T\. Kakkar, Y\. Gan, A\. Saberi, F\. Ozcan, and S\. Ö\. Arik \(2025a\)CHASE\-SQL: multi\-path reasoning and preference optimized candidate selection in text\-to\-sql\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=CvGqMD5OtX)Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- M\. Pourreza and D\. Rafiei \(2023\)DIN\-SQL: decomposed in\-context learning of text\-to\-sql with self\-correction\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/72223cc66f63ca1aa59edaec1b3670e6-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- M\. Pourreza, S\. Talaei, R\. Sun, X\. Wan, H\. Li, A\. Mirhoseini, A\. Saberi, S\. Arik,et al\.\(2025b\)Reasoning\-sql: reinforcement learning with sql tailored partial rewards for reasoning\-enhanced text\-to\-sql\.arXiv preprint arXiv:2503\.23157\.Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06825#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p2.2),[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p3.2),[§3\.3](https://arxiv.org/html/2606.06825#S3.SS3.p4.1),[§4\.2](https://arxiv.org/html/2606.06825#S4.SS2.SSS0.Px4.p1.2),[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.11.9.1),[Table 2](https://arxiv.org/html/2606.06825#S5.T2.2.11.9.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§3\.2](https://arxiv.org/html/2606.06825#S3.SS2.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§E\.6](https://arxiv.org/html/2606.06825#A5.SS6.p1.1),[§3\.2](https://arxiv.org/html/2606.06825#S3.SS2.p1.7),[§5\.1](https://arxiv.org/html/2606.06825#S5.SS1.p5.4)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§5\.1](https://arxiv.org/html/2606.06825#S5.SS1.p5.4)\.
- B\. Wang, R\. Shin, X\. Liu, O\. Polozov, and M\. Richardson \(2020\)Rat\-sql: relation\-aware schema encoding and linking for text\-to\-sql parsers\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 7567–7578\.Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- H\. Weng, P\. Wu, L\. Cui, Y\. Zhan, B\. Liu, Y\. Song, D\. Zeng, Y\. Yang, Q\. Zhang, D\. Huang, X\. Yin, Y\. Sun, and X\. Chen \(2025\)Graph\-reward\-sql: execution\-free reinforcement learning for text\-to\-sql via graph matching and stepwise reward\.InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 12917–12943\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.694/)Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06825#S2.SS2.p1.1),[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.13.11.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§E\.7](https://arxiv.org/html/2606.06825#A5.SS7.p1.1)\.
- T\. Yu, Z\. Li, Z\. Zhang, R\. Zhang, and D\. Radev \(2018a\)Typesql: knowledge\-based type\-aware neural text\-to\-sql generation\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),pp\. 588–594\.Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. R\. Radev \(2018b\)Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 3911–3921\.External Links:[Link](https://doi.org/10.18653/v1/d18-1425),[Document](https://dx.doi.org/10.18653/V1/D18-1425)Cited by:[Appendix B](https://arxiv.org/html/2606.06825#A2.p1.1),[2nd item](https://arxiv.org/html/2606.06825#S5.I1.i2.p1.1)\.
- J\. M\. Zelle and R\. J\. Mooney \(1996\)Learning to parse database queries using inductive logic programming\.InProceedings of the national conference on artificial intelligence,pp\. 1050–1055\.Cited by:[§2\.1](https://arxiv.org/html/2606.06825#S2.SS1.p1.1)\.
- Y\. Zhang, M\. Fan, J\. Fan, M\. Yi, Y\. Luo, J\. Tan, and G\. Li \(2025\)Reward\-sql: boosting text\-to\-sql via stepwise reasoning and process\-supervised rewards\.External Links:2505\.04671,[Link](https://arxiv.org/abs/2505.04671)Cited by:[§1](https://arxiv.org/html/2606.06825#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.06825#S2.SS2.p1.1),[2nd item](https://arxiv.org/html/2606.06825#S5.I3.i2.p1.1),[Table 1](https://arxiv.org/html/2606.06825#S5.T1.2.12.10.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Group sequence policy optimization\.External Links:2507\.18071,[Link](https://arxiv.org/abs/2507.18071)Cited by:[§E\.6](https://arxiv.org/html/2606.06825#A5.SS6.p1.1),[§3\.2](https://arxiv.org/html/2606.06825#S3.SS2.p1.7)\.

## Appendix AUse of Large Language Models

The research presented in this paper, including the core ideas, experimental design, and quantitative results, is the original work of the authors\. A large language model was used as a writing assistant for tasks such as polishing prose, improving clarity, and correcting grammatical errors in the manuscript\. All final content was reviewed and edited by the authors to ensure it accurately reflects our research and contributions\.

## Appendix BImplementation Details

Training Data ConstructionThe RL training corpus is constructed from three sources: theBIRDtraining set\(Liet al\.,[2023](https://arxiv.org/html/2606.06825#bib.bib7)\), theSpidertraining set\(Yuet al\.,[2018b](https://arxiv.org/html/2606.06825#bib.bib8)\), andSynSQL\-2\.5M\(Liet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib27)\), a large\-scale synthetic dataset covering diverse SQL complexity patterns\. We randomly sample a total of 20,000 training instances from the union of these sources for reinforcement learning\.

Training Setup and Infrastructure\.We adopt Qwen2\.5\-Coder\-7B/14B\-Instruct as our base model for all reinforcement learning experiments\. The training pipeline is developed and executed utilizing theverlframework across 8 NVIDIA H100 \(80GB\) GPUs\. We employ Group Relative Policy Optimization \(GRPO\) as our core algorithm, which inherently bypasses the need for a separate value network \(critic\), thereby substantially reducing memory overhead and improving training efficiency\. During the rollout phase, we sampleG=8G=8outputs per input query with a temperature ofτ=1\.0\\tau=1\.0and a maximum sequence length of 8192 tokens\. The maximum number of interaction turns with the database engine is set toT=4T=4\. The comprehensive hyperparameters for both the multi\-turn rollout and policy optimization phases are summarized in Table[4](https://arxiv.org/html/2606.06825#A2.T4)\.

Evaluation Setup\.We follow the evaluation scripts and protocol released by OmniSQL\(Liet al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib27)\)111[https://github\.com/RUCKBReasoning/OmniSQL](https://github.com/RUCKBReasoning/OmniSQL)to ensure comparability with recent Text\-to\-SQL baselines\. At evaluation time, ODT\-based diagnostic feedback is not used, and all models are evaluated under standard single\-turn inference\.

Reward Hyperparameters\.The scalar weights for our multi\-turn reward formulation \(defined in Section[4\.2](https://arxiv.org/html/2606.06825#S4.SS2)\) are systematically detailed in Table[4](https://arxiv.org/html/2606.06825#A2.T4)\.

Table 4:Hyperparameters for RL training and progressive reward formulation\.
## Appendix CMulti\-turn Rollout with ODT Feedback

Algorithm[1](https://arxiv.org/html/2606.06825#alg1)summarizes the training\-time rollout procedure\. At each turn, the policy generates a SQL prediction and receives ODT\-based diagnostic feedback if the prediction is not yet correct\. Here,TTdenotes the maximum turn budget,k∗k^\{\*\}denotes the first successful turn if execution correctness is reached, andKKdenotes the actual stopping turn\. If the rollout succeeds early, thenK=k∗K=k^\{\*\}; otherwise,K=TK=T\.

Algorithm 1Multi\-turn SQL Refinement with ODT Feedback0:Question

xx, database

DD, gold SQL

y∗y^\{\*\}, policy model

πθ\\pi\_\{\\theta\}, maximum turns

TT
0:Trajectory

Y=\{y\(1\),…,y\(K\)\}Y=\\\{y^\{\(1\)\},\\ldots,y^\{\(K\)\}\\\}and final SQL

y\(K\)y^\{\(K\)\}, where

K≤TK\\leq T
1:Initialize prompt

p\(1\)←xp^\{\(1\)\}\\leftarrow x, trajectory

Y←∅Y\\leftarrow\\emptyset
2:Initialize first successful turn

k∗←∅k^\{\*\}\\leftarrow\\varnothingand stopping turn

K←TK\\leftarrow T
3:for

t=1t=1to

TTdo

4:Generate SQL prediction

y\(t\)∼πθ\(⋅∣p\(t\)\)y^\{\(t\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid p^\{\(t\)\}\)and append it to

YY
5:Execute

y\(t\)y^\{\(t\)\}on database

DDto obtain execution status

ℰ​\(y\(t\)\)\\mathcal\{E\}\(y^\{\(t\)\}\)
6:if

y\(t\)y^\{\(t\)\}matches the execution result of gold SQL

y∗y^\{\*\}then

7:Set

k∗←tk^\{\*\}\\leftarrow tand

K←tK\\leftarrow t;break

8:endif

9:if

t<Tt<Tthen

10:Compare

y\(t\)y^\{\(t\)\}with

y∗y^\{\*\}using the ODT diagnostic module

11:Obtain clause\-level diagnostic tags and verbalize them into feedback

f\(t\)f^\{\(t\)\}
12:Update prompt

p\(t\+1\)←Append​\(p\(t\),y\(t\),f\(t\)\)p^\{\(t\+1\)\}\\leftarrow\\mathrm\{Append\}\(p^\{\(t\)\},y^\{\(t\)\},f^\{\(t\)\}\)
13:endif

14:endfor

15:Compute alignment improvement

Δ=ℱ​\(y\(K\),y∗\)−ℱ​\(y\(1\),y∗\)\\Delta=\\mathcal\{F\}\(y^\{\(K\)\},y^\{\*\}\)\-\\mathcal\{F\}\(y^\{\(1\)\},y^\{\*\}\)
16:Compute total reward using alignment, decayed accuracy, execution status, and formatting terms

17:returnTrajectory

YYand final SQL

y\(K\)y^\{\(K\)\}

## Appendix DODT Construction and Scoring

Implementation Overview\.We implement the Oracle\-guided Diagnostic Tree \(ODT\) module as a deterministic structural feedback component\. In our implementation, SQL queries are parsed with JSQLParser 4\.6 and converted into recursive structural profile trees\. Given a predicted SQLyyand the gold SQLy∗y^\{\*\}, ODT extracts clause\-level structural profiles, computes a structural similarity score, and generates diagnostic tags for next\-turn feedback\. The diagnostic feedback is used only during multi\-turn RL training\. At evaluation time, the model performs standard single\-turn inference without ODT feedback\.

SQL Parsing and Failure Handling\.For each generated SQL, ODT first parses the query into an Abstract Syntax Tree \(AST\)\. Before parsing, we apply lightweight normalization, including removing redundant quotation marks, standardizing whitespace, lower\-casing identifiers, and stripping trailing semicolons\. If the predicted SQL cannot be parsed, ODT does not produce fine\-grained structural tags\. Instead, it returns a generic feedback message asking the model to first fix syntax and identifier errors\. In this case, the structural similarity is set to zero, and the reward relies on lexical alignment and execution\-status signals\.

ODT Construction\.Each parsed SQL is converted into a recursive ODT profile tree\. Each node corresponds to a structural unit, includingROOT,SELECT,CTE,SET\_OP, orSUBQUERY\. ASELECTnode contains a local structural profile\. ACTEnode represents a common table expression, aSET\_OPnode represents set operations such asUNION,INTERSECT, orEXCEPT, and aSUBQUERYnode represents a nested query\. Subqueries appearing inFROM,WHERE,HAVING, or SELECT expressions are recursively detected and attached as child nodes\. This representation allows ODT to compare both flat clause\-level structures and nested SQL compositions\.

Local Structural Profile\.For eachSELECTnode, ODT extracts a local structural profile containing selected columns, involved tables, join signatures, filtering predicates, grouping columns, having predicates, ordering signatures, DISTINCT usage, aggregation usage, and structural counts\. Concretely, the profile contains:

p=\(𝐜,𝐛,𝒯,𝒮,𝒲,𝒥,𝒢,ℋ,𝒪\),p=\(\\mathbf\{c\},\\mathbf\{b\},\\mathcal\{T\},\\mathcal\{S\},\\mathcal\{W\},\\mathcal\{J\},\\mathcal\{G\},\\mathcal\{H\},\\mathcal\{O\}\),where𝐜\\mathbf\{c\}denotes count features such as JOIN count and SELECT\-item count, and𝐛\\mathbf\{b\}denotes Boolean indicators for WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, aggregation, and DISTINCT\. The set𝒯\\mathcal\{T\}contains normalized table names,𝒮\\mathcal\{S\}contains SELECT projection signatures,𝒲\\mathcal\{W\}contains normalized WHERE predicates,𝒥\\mathcal\{J\}contains JOIN\-condition signatures,𝒢\\mathcal\{G\}contains GROUP BY column signatures,ℋ\\mathcal\{H\}contains HAVING predicates, and𝒪\\mathcal\{O\}contains ORDER BY signatures\.

Feature Normalization\.To reduce sensitivity to surface\-level SQL variation, ODT normalizes structural features before comparison\. Table names and column references are lower\-cased, and quotation marks are removed\. String and numeric literals are replaced with placeholders\. Common predicate operators, including=,<\>,\!=,\>=,<=,\>,<,LIKE,IN,BETWEEN,IS NULL, andIS NOT NULL, are mapped into canonical predicate signatures\. For equality predicates where both sides are identifiers, the operands are sorted lexicographically, soa=bandb=aare treated as equivalent\. Conjunctive predicates are split byANDand stored as sets of normalized atomic predicates, making the order of independent conditions irrelevant\. Aggregation aliases are ignored during structural comparison; for example,AVG\(price\) AS avg\_priceandAVG\(price\) AS average\_priceare treated as equivalent if the aggregation function and argument match\.

Structural Scoring\.Given the predicted ODT and the gold ODT, we recursively compute node\-level similarity\. For eachSELECTnode, the local similarity is computed as a weighted average over feature\-level similarities, with the feature weights listed in Table[5](https://arxiv.org/html/2606.06825#A4.T5):

slocal​\(p,g\)=∑f∈ℱwf⋅simf​\(pf,gf\)∑f∈ℱwf\.s\_\{\\mathrm\{local\}\}\(p,g\)=\\frac\{\\sum\_\{f\\in\\mathcal\{F\}\}w\_\{f\}\\cdot\\mathrm\{sim\}\_\{f\}\(p\_\{f\},g\_\{f\}\)\}\{\\sum\_\{f\\in\\mathcal\{F\}\}w\_\{f\}\}\.For set\-valued features, we use Jaccard similarity:

Jaccard​\(A,B\)=\|A∩B\|\|A∪B\|\.\\mathrm\{Jaccard\}\(A,B\)=\\frac\{\|A\\cap B\|\}\{\|A\\cup B\|\}\.If both sets are empty, the similarity is set to1\.01\.0; if only one side is empty, it is set to0\.00\.0\. For count\-valued features, including JOIN count and SELECT\-item count, we use:

NumSim​\(a,b\)=1\.0−min⁡\(1\.0,\|a−b\|max⁡\(1,b\)\)\.\\mathrm\{NumSim\}\(a,b\)=1\.0\-\\min\\left\(1\.0,\\frac\{\|a\-b\|\}\{\\max\(1,b\)\}\\right\)\.For Boolean features such as DISTINCT usage, the similarity is1\.01\.0if the two sides match and0\.00\.0otherwise\.

Table 5:Feature weights used in ODT structural scoring\.Recursive Child Matching\.The final score of an ODT node combines local similarity and child\-subtree similarity:

snode=α⋅slocal\+\(1−α\)⋅schild\.s\_\{\\mathrm\{node\}\}=\\alpha\\cdot s\_\{\\mathrm\{local\}\}\+\(1\-\\alpha\)\\cdot s\_\{\\mathrm\{child\}\}\.We setα=0\.70\\alpha=0\.70forSELECTnodes\. For non\-SELECTnodes, which do not contain local profiles, the score is determined by child matching\. To computeschilds\_\{\\mathrm\{child\}\}, ODT matches predicted and gold children with compatible node types\. For CTE nodes, the CTE names are matched case\-insensitively\. For each gold child, the scorer selects the highest\-scoring unmatched predicted child\. The child score is normalized by the maximum number of children on the two sides:

schild=∑\(u,v\)∈ℳsnode​\(u,v\)max⁡\(\|𝒞pred\|,\|𝒞gold\|\),s\_\{\\mathrm\{child\}\}=\\frac\{\\sum\_\{\(u,v\)\\in\\mathcal\{M\}\}s\_\{\\mathrm\{node\}\}\(u,v\)\}\{\\max\(\|\\mathcal\{C\}\_\{\\mathrm\{pred\}\}\|,\|\\mathcal\{C\}\_\{\\mathrm\{gold\}\}\|\)\},whereℳ\\mathcal\{M\}is the set of matched child pairs\. Unmatched gold children indicate missing nested structures, while unmatched predicted children indicate redundant subqueries or CTEs\. The final structural similarity is the root score:

ℱstruct​\(y,y∗\)=sroot\.\\mathcal\{F\}\_\{\\mathrm\{struct\}\}\(y,y^\{\*\}\)=s\_\{\\mathrm\{root\}\}\.
Diagnostic Tag Generation\.In addition to the continuous structural similarity score, ODT produces discrete diagnostic tags for feedback construction\. For each diagnostic dimension, we compare the corresponding feature similarity against a predefined threshold, as listed in Table[6](https://arxiv.org/html/2606.06825#A4.T6)\. If the similarity falls below the threshold, the corresponding error tag is emitted\. Table[7](https://arxiv.org/html/2606.06825#A4.T7)summarizes the diagnostic tags and their meanings\. These tags are then verbalized into concise natural\-language feedback and appended to the next\-turn prompt\. The raw structural similarity score and feature\-level similarities are not exposed to the model; they are used only for reward computation\.

Table 6:Mismatch thresholds for generating ODT diagnostic tags\.Table 7:Diagnostic tags emitted by the ODT module\.Usage During RL Rollouts\.For each unsuccessful non\-final rollout turn, ODT compares the current predictiony\(t\)y^\{\(t\)\}with the gold SQLy∗y^\{\*\}and returns verbalized structural feedbackf\(t\)f^\{\(t\)\}\. The next\-turn input is constructed as:

x\(t\+1\)=\(q,S,y\(1\),f\(1\),…,y\(t\),f\(t\)\)\.x^\{\(t\+1\)\}=\(q,S,y^\{\(1\)\},f^\{\(1\)\},\\ldots,y^\{\(t\)\},f^\{\(t\)\}\)\.ODT is a fixed, non\-differentiable component of the training environment\. It does not participate in back\-propagation, and gradients are computed only through the policy model over generated tokens\.

## Appendix EAdditional Analysis

### E\.1Training Dynamics

To better understand the impact of our progressive reward formulation on optimization dynamics, we analyze the training curves of reward and response length\. Figure[4](https://arxiv.org/html/2606.06825#A5.F4)compares our full multi\-turn progressive reward with a single\-turn one\-shot reward baseline\. The single\-turn baseline generates one SQL prediction per rollout and assigns a score of2\.02\.0if the generated SQL matches the gold execution result,0\.50\.5if the SQL is executable, and0\.50\.5if the output follows the required format\. Unlike our method, this baseline does not use multi\-turn refinement, ODT\-based structural feedback, or trajectory\-level improvement\.

Reward Convergence\.As shown in the left panel of Figure[4](https://arxiv.org/html/2606.06825#A5.F4), both the single\-turn one\-shot baseline and our progressive reward improve steadily during training and reach comparable reward levels\. This indicates that our multi\-turn objective does not make optimization harder despite introducing ODT\-based feedback and trajectory\-level reward components\. Compared with the single\-turn baseline, our reward provides additional structural and lexical alignment signals that guide the policy toward refinement\-oriented behavior rather than only endpoint correctness\.

Different Response\-Length Dynamics\.The right panel of Figure[4](https://arxiv.org/html/2606.06825#A5.F4)shows that the two methods lead to clearly different response\-length behaviors\. The single\-turn one\-shot baseline quickly converges to shorter responses, since it only rewards final correctness, executability, and format compliance\. In contrast, our method maintains longer responses during training, which is consistent with the multi\-turn refinement setting where the model needs to reason about previous SQL attempts and diagnostic feedback\. Although longer responses are not directly rewarded, this suggests that the policy learns to utilize the additional context and feedback rather than collapsing to a minimal endpoint\-oriented generation strategy\.

Overall, these trends show that the proposed progressive reward achieves stable optimization while inducing training dynamics better aligned with multi\-turn SQL correction\.

![Refer to caption](https://arxiv.org/html/2606.06825v1/x5.png)Figure 4:Training dynamics of reward and response length\. The single\-turn one\-shot reward baseline assigns rewards based on execution correctness, executability, and format compliance, while our method further incorporates multi\-turn refinement, ODT\-based structural feedback, and trajectory\-level progress\. Both methods show stable reward improvement, but our progressive reward leads to different response\-length dynamics that are more consistent with multi\-turn SQL refinement\.
### E\.2Effect of Per\-turn Decay

Figure[3\(a\)](https://arxiv.org/html/2606.06825#S5.F3.sf1)analyzes the effect of removing the per\-turn decay term by settingγ=1\.0\\gamma=1\.0\. Although the accumulated reward continues to increase, the average number of interaction turns first decreases and then rebounds to a high level\. This suggests that removing per\-turn decay makes multi\-turn RL optimization unstable: the policy is no longer encouraged to produce correct SQL early, and instead tends to rely on longer correction trajectories with more ODT feedback\.

This instability also explains the large performance drop in Table[3](https://arxiv.org/html/2606.06825#S5.T3)\. Without per\-turn decay, late corrections receive the same accuracy reward as early corrections, which weakens the pressure to improve the first\-attempt SQL\. As a result, the learned policy becomes poorly aligned with the standard single\-turn inference setting used at evaluation time\. The per\-turn decay term therefore acts as an important regularizer for multi\-turn RL training, encouraging earlier successful executions and stabilizing the transfer from multi\-turn training to single\-turn inference\.

### E\.3Maximum Interaction Turns

Figure[3\(b\)](https://arxiv.org/html/2606.06825#S5.F3.sf2)shows the effect of varying the maximum interaction budgetTT\. IncreasingTTfrom 1 to 4 consistently improves execution accuracy on both BIRD Dev and Spider Dev, suggesting that additional turns provide useful opportunities for structural correction\. However, further increasingTTcauses performance degradation\. We attribute this to context accumulation and over\-correction: longer trajectories introduce more historical SQL attempts and feedback messages, which may distract the model or make it deviate from the original question\. Based on these observations, we setT=4T=4in all main experiments\.

Table 8:ODT\-based structural mismatch rates for single\-turn greedy inference on Spider Dev\. The ODT diagnostic module is used only as an offline evaluator and is not exposed to the model during inference\. These rates measure clause\-level structural mismatches rather than execution failures, and thus are not directly complementary to EX\. Arrows indicate absolute changes compared with the corresponding same\-scale base model; lower values are better\. Avg\. Err\. averages SELECT, JOIN, WHERE, and GROUP mismatch rates\.
### E\.4Single\-turn Inference Behavior after Multi\-turn RL

During training, our framework exposes the policy to multi\-turn ODT\-based diagnostic feedback, while during evaluation we deliberately use standard single\-turn inference without ODT feedback or iterative refinement\. This setting ensures fair comparison with existing Text\-to\-SQL baselines, but also raises an important question: how can a model trained in a multi\-turn debugging environment benefit at single\-turn test time?

We hypothesize that multi\-turn RL does not merely teach the model to react to external feedback, but also helps the policy internalize structural debugging patterns\. In other words, after optimization with progressive rewards, the model may learn to anticipate common SQL construction errors before producing the final query\. Such internalization should be reflected in fewer structural errors in the first generated SQL, even when no ODT feedback is provided at inference time\.

To verify this hypothesis, we apply the ODT diagnostic module only as an offline evaluator for single\-turn greedy predictions on Spider Dev\. Specifically, after each model generates its SQL, we compare the predicted SQL with the gold SQL using ODT and map fine\-grained diagnostic tags into four coarse structural categories: SELECT, JOIN, WHERE, and GROUP errors\. Importantly, this analysis is performed after generation; the diagnostic tags are never exposed to the model during evaluation\.

As shown in Table[8](https://arxiv.org/html/2606.06825#A5.T8), RL\-trained models consistently reduce the average structural error rate under single\-turn inference on Spider Dev\. For the 7B model, the average error rate decreases from 34\.6% to 29\.0%, with especially clear reductions in JOIN and GROUP errors\. For the 14B model, the average error rate decreases from 37\.1% to 26\.1%, mainly due to large reductions in SELECT, WHERE, and GROUP errors\. Although the JOIN error rate of the 14B model increases, the overall structural error rate still drops substantially\. These results suggest that the benefit of multi\-turn ODT\-based training is not limited to interactive correction: the learned policy also improves first\-attempt SQL construction without relying on test\-time oracle feedback\.

### E\.5Decoding Strategy Analysis by Difficulty

To provide a fine\-grained view of the inference behavior of our RL\-trained models, we report their performance across SQL query difficulty levels under two decoding strategies\. Specifically, we evaluate Progress\-SQL\-7B and Progress\-SQL\-14B on BIRD Dev, Spider Dev, Spider Test, and three robustness variants: Spider\-DK, Spider\-Realistic, and Spider\-Syn\. For each benchmark, we include both Greedy Decoding and Majority Voting \(Vote@8\)\.

Overall Observation\.The results show that decoding strategy can affect performance differently across benchmarks and difficulty subsets\. Vote@8 often changes the aggregate score compared with greedy decoding, but the effect is not uniform across all settings\. In some hard or extra\-hard subsets, majority voting may select a suboptimal candidate when generated SQLs differ in subtle structural details\. Therefore, this analysis is intended to provide a more detailed view of decoding behavior rather than to claim that one decoding strategy consistently dominates the other\.

BIRD Dev\.Table[9](https://arxiv.org/html/2606.06825#A5.T9)reports the BIRD Dev results\. We report both greedy decoding and Vote@8 results for Progress\-SQL\-7B and Progress\-SQL\-14B, showing how performance varies across simple, moderate, and challenging examples\.

Table 9:Execution accuracy breakdown by difficulty on BIRD Dev for our RL\-trained models under greedy decoding and Vote@8\.Spider Dev and Spider Test\.Table[10](https://arxiv.org/html/2606.06825#A5.T10)reports the results on Spider Dev and Spider Test\. We include both greedy decoding and Vote@8 results to show how decoding choices affect different difficulty subsets\. The aggregate scores are generally close between the two decoding strategies, while the difficulty\-level breakdown reveals non\-uniform changes across easy, medium, hard, and extra\-hard examples\. In particular, Vote@8 does not consistently improve hard or extra\-hard subsets, suggesting that majority voting may select a suboptimal candidate when generated SQLs differ in subtle structural details\.

Table 10:Execution accuracy \(EX%\) and test\-suite accuracy \(TS%\) breakdown by difficulty on Spider Dev and Spider Test for our RL\-trained models under greedy decoding and Vote@8\. Spider Test reports EX only\.Robustness Variants\.Tables[11](https://arxiv.org/html/2606.06825#A5.T11)and[12](https://arxiv.org/html/2606.06825#A5.T12)report difficulty\-level results on Spider\-DK, Spider\-Realistic, and Spider\-Syn\. These results further show that the effect of decoding strategy varies across robustness settings and difficulty levels\. Therefore, we present both greedy and Vote@8 results to provide a more complete view of inference behavior under different perturbation scenarios\.

Table 11:Execution accuracy \(EX%\) breakdown by difficulty on Spider\-DK, Spider\-Realistic, and Spider\-Syn for our RL\-trained models under greedy decoding and Vote@8\.Table 12:Test\-suite accuracy \(TS%\) breakdown by difficulty on Spider\-Realistic and Spider\-Syn for our RL\-trained models under greedy decoding and Vote@8\. Spider\-DK is omitted because test\-suite accuracy is unavailable\.
### E\.6Generalization Across RL Algorithms

To verify that our proposed multi\-turn progressive rewardℛ​\(Y\)\\mathcal\{R\}\(Y\)is algorithm\-agnostic, we evaluate it under two distinct RL training paradigms:GRPO\(Group Relative Policy Optimization\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.06825#bib.bib22)\), our primary algorithm, andGSPO\(Group Sequence Policy Optimization\)\(Zhenget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib12)\)\. For a controlled comparison, both algorithms are initialized from the same Qwen2\.5\-Coder\-7B\-Instruct base model without any task\-specific supervised fine\-tuning, and are optimized using identical multi\-turn reward signals\.

For a given input queryxx, both algorithms sample a group ofGGresponses\{yi\}i=1G\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}from the old policyπθold\(⋅\|x\)\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|x\), and compute the group\-based advantage estimationA^i\\hat\{A\}\_\{i\}using our multi\-turn rewardℛ​\(yi\)\\mathcal\{R\}\(y\_\{i\}\):

A^i=ℛ​\(yi\)−mean​\(\{ℛ​\(yi\)\}i=1G\)std​\(\{ℛ​\(yi\)\}i=1G\)\\hat\{A\}\_\{i\}=\\frac\{\\mathcal\{R\}\(y\_\{i\}\)\-\\text\{mean\}\(\\\{\\mathcal\{R\}\(y\_\{i\}\)\\\}\_\{i=1\}^\{G\}\)\}\{\\text\{std\}\(\\\{\\mathcal\{R\}\(y\_\{i\}\)\\\}\_\{i=1\}^\{G\}\)\}
GRPO Formulation\.GRPO optimizes the policy at the token level\. It computes the advantage for each tokenyi,ty\_\{i,t\}asA^i,t=A^i\\hat\{A\}\_\{i,t\}=\\hat\{A\}\_\{i\}, and applies clipping to the token\-level importance ratiowi,t​\(θ\)=πθ​\(yi,t\|x,yi,<t\)πθold​\(yi,t\|x,yi,<t\)w\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\|x,y\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i,t\}\|x,y\_\{i,<t\}\)\}:

𝒥GRPO​\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\text\{GRPO\}\}\(\\theta\)=𝔼x∼𝒟,\{yi\}i=1G∼πθold\(⋅\|x\)\[1G∑i=1G1\|yi\|\\displaystyle=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|x\)\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|y\_\{i\}\|\}∑t=1\|yi\|min\(wi,t\(θ\)A^i,t,\\displaystyle\\quad\\sum\_\{t=1\}^\{\|y\_\{i\}\|\}\\min\\Big\(w\_\{i,t\}\(\\theta\)\\hat\{A\}\_\{i,t\},clip\(wi,t\(θ\),1−ε,1\+ε\)A^i,t\)\]\\displaystyle\\quad\\quad\\text\{clip\}\\big\(w\_\{i,t\}\(\\theta\),1\-\\varepsilon,1\+\\varepsilon\\big\)\\hat\{A\}\_\{i,t\}\\Big\)\\Bigg\]
GSPO Formulation\.In contrast, GSPO applies clipping to entire responses instead of individual tokens to better match sequence\-level rewards\. To reduce variance and control numerical range, it defines a length\-normalized sequence importance ratiosi​\(θ\)s\_\{i\}\(\\theta\):

si​\(θ\)\\displaystyle s\_\{i\}\(\\theta\)=\(πθ​\(yi\|x\)πθold​\(yi\|x\)\)1\|yi\|\\displaystyle=\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\|x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i\}\|x\)\}\\right\)^\{\\frac\{1\}\{\|y\_\{i\}\|\}\}=exp⁡\(1\|yi\|​∑t=1\|yi\|log⁡πθ​\(yi,t\|x,yi,<t\)πθold​\(yi,t\|x,yi,<t\)\)\\displaystyle=\\exp\\left\(\\frac\{1\}\{\|y\_\{i\}\|\}\\sum\_\{t=1\}^\{\|y\_\{i\}\|\}\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\|x,y\_\{i,<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{i,t\}\|x,y\_\{i,<t\}\)\}\\right\)The sequence\-level optimization objective is then formulated as:

𝒥GSPO​\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\text\{GSPO\}\}\(\\theta\)=𝔼x∼𝒟,\{yi\}i=1G∼πθold\(⋅\|x\)\[1G∑i=1G\\displaystyle=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|x\)\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}min\(si\(θ\)A^i,\\displaystyle\\quad\\min\\Big\(s\_\{i\}\(\\theta\)\\hat\{A\}\_\{i\},clip\(si\(θ\),1−ε,1\+ε\)A^i\)\]\\displaystyle\\quad\\text\{clip\}\\big\(s\_\{i\}\(\\theta\),1\-\\varepsilon,1\+\\varepsilon\\big\)\\hat\{A\}\_\{i\}\\Big\)\\Bigg\]
As shown in Table[13](https://arxiv.org/html/2606.06825#A5.T13), both GRPO and GSPO achieve strong performance under the same multi\-turn progressive reward formulation, suggesting that our reward design is not tied to a specific RL optimizer\. GRPO performs slightly better on BIRD Dev and Spider Dev TS, while GSPO obtains a marginally higher Spider Dev EX under Vote@8\. These results indicate that the proposed progressive reward can transfer across different policy optimization algorithms, with only minor variations in final performance\.

Table 13:Generalization of our multi\-turn progressive rewardℛ​\(Y\)\\mathcal\{R\}\(Y\)across RL algorithms\. Both GRPO and GSPO are initialized from Qwen2\.5\-Coder\-7B\-Instruct and trained with identical reward signals\.
### E\.7Generalization to Qwen3\-8B

To further validate the generalizability of our framework, we apply the same training pipeline to Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.06825#bib.bib28)\)without architecture\-specific modification\.

Table 14:Generalization results on Qwen3\-8B\. Subscript values denote absolute gains over the corresponding Qwen3\-8B baseline\.As shown in Table[14](https://arxiv.org/html/2606.06825#A5.T14), our method consistently improves Qwen3\-8B on both BIRD Dev and Spider Dev under greedy decoding and Vote@8\. These results demonstrate that the proposed reward design is not tied to the Qwen2\.5\-Coder architecture\. Instead, it generalizes effectively to Qwen3\-8B, suggesting that our structural feedback and progressive reward formulation can benefit stronger base models as well\. The consistent gains across both EX and TS suggest that our training objective improves not only final execution correctness, but also the robustness of the generated SQL programs\.

## Appendix FPrompt Templates

The following templates are used across all training and rollout phases\. Placeholders are denoted in monospace format:\{db\_engine\}is filled withSQLite;\{db\_details\}contains the DDL schema of the target database;\{question\}is the natural\-language query\.

System PromptTask Overview: You are a data science expert\. The user message contains the database schema and the natural language question\. Your task is to understand the schema and generate a valid SQL query to answer the question\.Database Engine: \{db\_engine\}Instructions:•Dialect Strictness:Ensure the generated SQL strictly conforms to the specific syntax and functions of the provided\{db\_engine\}\.•Precision:Only output the exact information asked in the question\. Do not include extra columns in theSELECTclause\.•Ambiguity Prevention:Always use short table aliases \(e\.g\.,t1,t2\) and explicitly qualify all column names with their respective table aliases, especially whenJOINs are involved\.•Robustness:Consider edge cases such as handlingNULLvalues appropriately and string matching case conventions where applicable\.•Thinking Process:Before generating the final SQL query, step\-by\-step analyze the tables needed, theJOINconditions, and the filtering criteria\. Enclose your reasoning strictly within<think\>and</think\>tags\.Output Format: Please strictly follow this exact output structure\. Do not output any other text or markdown outside of these tags:<think\>Your step\-by\-step reasoning here\.</think\><sql\>SELECT \.\.\.</sql\>

User Prompt — Round 1Database Schema: \{db\_details\} This schema describes the database’s structure, including tables, columns, primary keys, foreign keys, and any relevant relationships or constraints\.Question: \{question\}

User Prompt — Round 2\+ \(ODT Feedback Injection\)Feedback: your previous SQL has structural issues:\{error\_codes\},\{error\_descriptions\}\.Revise your previous answer using these issues\. Revise your previous answer for SQLite\. Put step\-by\-step reasoning inside<think\>\.\.\.</think\>, then output exactly one corrected SQLite query inside<sql\>\.\.\.</sql\>\. Do not output any text outside those tags\.

Note:When the ODT service fails to parse the predicted SQL, or when the gold SQL cannot be fully parsed and no cached structural profile is available, we inject a generic fallback message instead of clause\-level diagnostic feedback\.

## Appendix GCase Study: Progressive Refinement with Structural Feedback

To qualitatively demonstrate the efficacy of our ODT\-based dense feedback, we present a complete three\-turn debugging trajectory\. This case highlights how clause\-level feedback helps the model correct aggregation granularity, projection targets, and redundant join structures during iterative SQL refinement\.

Question:

> Highly\-rated suppliers refer to those who have received a rating of 4 or higher from customers\. What are the names and average prices of contact lenses for highly\-rated suppliers who have sold at least one lens?

Gold SQLSELECT s\.supplier\_name, AVG\(cl\.price\) AS average\_priceFROM contact\_lenses clJOIN suppliers s ON s\.supplier\_id = cl\.supplier\_idJOIN feedback f ON cl\.lens\_id = f\.lens\_idWHERE f\.rating \>= 4GROUP BY s\.supplier\_nameHAVING COUNT\(cl\.lens\_id\) \> 0;

Debugging Trajectory

Round 1\[× EX\-Wrong\]

Analysis:The agent fundamentally misunderstands the required granularity\. It attempts to group bylens\_nameinstead ofsupplier\_name, and over\-complicates the query by unnecessarily joining thesalesandorder\_itemstables to check for sales\.

Model OutputChain\-of\-Thought \(Extracted\):To answer this question, we need to identify contact lenses from suppliers who have received a rating of 4 or higher… I will join contact\_lenses with suppliers, feedback, sales, and order\_items\. Finally, I will group by lens name and calculate the average price\. \[snip\]Predicted SQL:SELECTt1\.lens\_name, AVG\(t1\.price\) AS average\_priceFROM contact\_lenses t1JOIN suppliers t2 ON t1\.supplier\_id = t2\.supplier\_idJOIN feedback t3 ON t1\.lens\_id = t3\.lens\_idJOIN order\_items t4 ON t1\.lens\_id = t4\.lens\_idWHERE t3\.rating \>= 4GROUP BYt1\.lens\_nameHAVINGCOUNT\(DISTINCT t4\.order\_item\_id\) \> 0;

ODT FeedbackFeedback: your previous SQL has structural issues: JOIN\_KEY\_MISMATCH,FROM\_OR\_JOIN\_TABLE\_MISMATCH, SELECT\_ERROR, GROUP\_BY\_ERROR\. \[\.\.\.\]SELECT projection differs from reference\. GROUP BY columns are not alignedwith reference, and redundant join structures are introduced\.

Round 2\[× EX\-Wrong\]

Analysis:Guided by the dense feedback, the agent successfully corrects the projection \(SELECT\) and grouping \(GROUP BY\) targets tosupplier\_name\. However, it stubbornly retains the redundantsalesandorder\_itemstables, leading to a severe table mismatch warning\.

Model OutputChain\-of\-Thought \(Extracted\):Based on the feedback, there are several issues… The selected column and grouping key should be changed to the supplier name\. I will revise the query accordingly, but I still keep the transaction\-table joins to check whether a lens has been sold\. \[snip\]Predicted SQL:SELECT s\.supplier\_name, AVG\(cl\.price\) AS average\_priceFROM contact\_lenses clJOIN suppliers s ON cl\.supplier\_id = s\.supplier\_idJOIN feedback f ON cl\.lens\_id = f\.lens\_idJOIN order\_items oi ON cl\.lens\_id = oi\.lens\_idJOIN sales sa ON oi\.order\_item\_id = sa\.order\_idWHERE f\.rating \>= 4GROUP BY s\.supplier\_name;

ODT FeedbackFeedback: your previous SQL has structural issues: JOIN\_KEY\_MISMATCH,FROM\_OR\_JOIN\_TABLE\_MISMATCH\. \[\.\.\.\] FROM/JOIN tables differ from expected, andthe query introduces redundant transaction\-table joins\.

Round 3\[✓ EX\-Match\]

Analysis:Acknowledging the specificFROM\_OR\_JOIN\_TABLE\_MISMATCHerror, the agent finally drops the hallucinated transaction tables\. It simplifies the query to the core tables and recovers the correct aggregation granularity withGROUP BY s\.supplier\_nameandHAVING COUNT\(cl\.lens\_id\) \> 0, achieving an exact execution match\.

Model OutputChain\-of\-Thought \(Extracted\):Based on the new feedback… FROM\_OR\_JOIN\_TABLE\_MISMATCH: The FROM and JOIN clauses in the previous query included tables that are not necessary\. I will only join contact\_lenses with suppliers and feedback\. Then group by supplier name and calculate the average price\. \[snip\]Predicted SQL:SELECT s\.supplier\_name, AVG\(cl\.price\) AS average\_priceFROM contact\_lenses clJOIN suppliers s ON cl\.supplier\_id = s\.supplier\_idJOIN feedback f ON cl\.lens\_id = f\.lens\_idWHERE f\.rating \>= 4GROUP BY s\.supplier\_nameHAVING COUNT\(cl\.lens\_id\) \> 0;

Similar Articles

R^3-SQL: Ranking Reward and Resampling for Text-to-SQL

Hugging Face Daily Papers

# Paper page - R^3-SQL: Ranking Reward and Resampling for Text-to-SQL Source: [https://huggingface.co/papers/2604.25325](https://huggingface.co/papers/2604.25325) ## Abstract R$^3$\-SQL addresses inconsistencies in scoring functionally equivalent SQL queries and improves candidate recall through unified reward ranking and agentic resampling techniques\. Modern[Text\-to\-SQL](https://huggingface.co/papers?q=Text-to-SQL)systems generate multiple candidate[SQL queries](https://huggingface.co/papers

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

arXiv cs.CL

This paper proposes MERIT, a dynamic multi-horizon memory retrieval framework for interactive text-to-SQL agents that uses episode-level and turn-level memory with learned retrieval policies optimized via reinforcement learning and a process reward model for dense rewards. Experiments on BIRD-Interact and Spider2-Snow show that MERIT outperforms static and single-horizon dynamic baselines in success rate while requiring fewer interaction turns.