Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning
Summary
This paper proposes CoTE-SQL, a self-enhanced fine-tuning framework for text-to-SQL that integrates self-reasoning traces, structured chain-of-thought prompting, and execution feedback to achieve state-of-the-art performance on Spider and Bird benchmarks.
View Cached Full Text
Cached at: 06/16/26, 11:47 AM
# Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning
Source: [https://arxiv.org/html/2606.15598](https://arxiv.org/html/2606.15598)
###### Abstract\.
Text\-to\-SQL aims to translate natural language questions into executable SQL queries over structured databases, enabling non\-expert users to access data intuitively\. While recent advances in large language models \(LLMs\) have shown promise in this task, existing LLM\-based approaches often struggle to strike a balance between strong reasoning capabilities and robust generalization\. To address these limitations, we proposeCoTE\-SQLto enhance the LLM\-based text\-to\-SQL generation with three key innovations: \(i\) self\-enhanced reasoning traces distilled from LLMs without human annotation, \(ii\) structured chain\-of\-thought \(CoT\) prompting with modular decomposition and examples retrieval, and \(iii\) error\-aware revision based on SQL execution feedback\. Extensive experiments on the Spider and Bird benchmarks demonstrate thatCoTE\-SQLachieves new state\-of\-the\-art performance among methods built on open\-source LLMs with comparable model sizes on Bird \(53\.39% EX / 59\.02 VES\) and strong results on Spider \(79\.60% EX / 77\.19 VES\), with especially significant gains on complex queries\. Results highlight the effectiveness of combining self\-enhancement, structured reasoning, and execution\-time feedback within an LLM\-based framework for text\-to\-SQL design\.
## 1\.Introduction
As modern enterprise databases continue to expand in both scale and complexity, querying structured data remains a significant barrier for non\-expert users\. The traditional requirement of writing Structured Query Language \(SQL\) hinders broad access to relational databases, especially for users without programming expertise\(Shiet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib29); Renet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib30); Zhaoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib42)\)\. To address this usability gap, text\-to\-SQL systems have emerged as a promising solution, focusing on the semantic conversion of information needs expressed in natural language into structured database queries\(Pourreza and Rafiei,[2024](https://arxiv.org/html/2606.15598#bib.bib7); Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13); Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.15598#bib.bib9); Liuet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib35); Fanet al\.,[2024c](https://arxiv.org/html/2606.15598#bib.bib36),[2023](https://arxiv.org/html/2606.15598#bib.bib33); Zhenget al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib65); Gong and Sun,[2024](https://arxiv.org/html/2606.15598#bib.bib66)\)\. This paradigm not only democratizes data retrieval but also enables new applications in data analytics and business intelligence\. Despite steady progress, accurately capturing user intent from users’ natural language and generating complex, correct SQL remains a core challenge\. Recent advances in large language models \(LLMs\) have significantly reshaped the landscape of text\-to\-SQL research\. With their strong natural language understanding and reasoning abilities, LLMs have shown promise in bridging the gap between natural language and formal database queries\(Honget al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib23); Zhanget al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib3),[2025](https://arxiv.org/html/2606.15598#bib.bib31); Liet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib32); Huanget al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib46); Luoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib47); Liuet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib40)\)\. As reported in Figure[1](https://arxiv.org/html/2606.15598#S1.F1), current LLM\-based text\-to\-SQL approaches generally fall into two main paradigms: in\-context learning and fine\-tuning\.
Figure 1\.Limitations of existing text\-to\-SQL approaches\. \(i\) In\-context learning methods can be categorized into two types\. The first, referred to as shallow prompting, demonstrates SQL generation via examples but fails to fully exploit the reasoning capabilities of LLMs\. The second, CoT\-based prompting, builds on this with CoT reasoning, but often suffers from poor generalization and struggles with complex SQL tasks due to limited reasoning diversity\. \(ii\) Fine\-tuning\-based methods are prone to logical errors, constrained mainly by the quantity and quality of annotated data\. Moreover, these approaches often lack explicit supervision over intermediate reasoning steps, resulting in suboptimal performance on complex queries\.In\-context learning\-based methodsenhance the zero\-shot or few\-shot capabilities of LLMs by injecting task\-specific examples into the prompt without modifying model parameters\(Taiet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib5); Zhanget al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib6); Shen and Kejriwal,[2024](https://arxiv.org/html/2606.15598#bib.bib25)\)\. However, these methods show poor performance because they cannot fully utilize the reasoning ability of LLMs\. Therefore, some works incorporate chain\-of\-thought \(CoT\) prompting to elicit intermediate reasoning steps\(Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13); Wanget al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib20); Chenet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib41)\)\. However, the reliance on static, handcrafted examples or prompt templateslimits the generalization ability, especially when handling complex queries or out\-of\-distribution schemas\.
Fine\-tuning\-based approaches, in contrast, adapt LLMs to the text\-to\-SQL task through supervised learning on large collections of annotated \(question, SQL\) pairs\(Huet al\.,[2022](https://arxiv.org/html/2606.15598#bib.bib28); Fuet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib17); Honget al\.,[2024a](https://arxiv.org/html/2606.15598#bib.bib14); Yanget al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib15)\)\. By integrating database schema information and leveraging synthetic data augmentation, these models aim for stronger domain alignment and robustness\(Pourreza and Rafiei,[2024](https://arxiv.org/html/2606.15598#bib.bib7); Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\)\. However, most fine\-tuning methods lack explicit supervision over intermediate reasoning processes due to the scarcity of high\-quality CoT\-style annotations in this domain\. As a result, fine\-tuned models oftenstruggle with multi\-step or compositional querieswhere step\-by\-step logical reasoning is crucial\. Although some studies fine\-tune models with distilled CoT annotations\(Heet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib63); Rossielloet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib64)\), they often depend on strong data generators or impose rigid reasoning formats, limiting flexibility\. Recent RL\-based methods\(Pourrezaet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib60); Sheng and Xu,[2025](https://arxiv.org/html/2606.15598#bib.bib61)\)improve Text\-to\-SQL reasoning but require complex reward design and incur high training costs and instability\.
To address the limitations of both in\-context prompting and fine\-tuning paradigms, our work seeks to bridge the gap betweenreasoning capabilityandgeneralization abilityin text\-to\-SQL systems\.
However, achieving this goal introduces several key challenges, as detailed in Section[3\.1](https://arxiv.org/html/2606.15598#S3.SS1)\.First, the lack of explicit supervision over intermediate reasoning steps in most public datasets prevents models from learning interpretable, step\-by\-step derivations, which are critical for complex SQL generation\.Second, existing models often struggle to generalize across various database schemas, due to the ambiguity of natural language and the variability of schema representations\.Third, real\-world applications impose strict correctness and reliability requirements: models must avoid spurious or hallucinated reasoning, and ideally possess the ability to self\-correct without extensive manual intervention\.
To tackle these challenges, we proposeCoTE\-SQL, a unified framework that enhances reasoning, generalization, and robustness in text\-to\-SQL systems with three key components \(illustrated in Figure[2](https://arxiv.org/html/2606.15598#S3.F2)\)\.First, aniterative self\-enhanced fine\-tuningframework where the model generates and verifies its own reasoning traces, progressively refining its intermediate reasoning capabilities\.Second, astructured CoT promptingdesign that guides the model through modular subtasks such as schema selection and SQL generation, augmented with retrieval\-based exemplar prompts to enhance in\-context learning and domain generalization\.Third, anerror\-aware revisionmodule that leverages execution feedback to enable self\-debugging and correction of faulty SQL queries, boosting robustness in practical deployments\.
We conduct extensive experiments on two standard benchmarks \(Spider and Bird\) to evaluateCoTE\-SQLagainst state\-of\-the\-art text\-to\-SQL methods\. Results show thatCoTE\-SQLachieves new SOTA performance on Bird \(53\.39% EX / 59\.02 VES\) and competitive results on Spider \(79\.60% EX / 77\.19 VES\), outperforming 15\+ baselines, including fine\-tuning \(e\.g\., MAC\-SQL and DTS\-SQL\) and prompting methods \(e\.g\., DAC and DIN\-SQL\)\. Notably,CoTE\-SQLdemonstrates consistent improvements across all difficulty levels, with particularly significant gains on the most challenging queries \(7\.2% EX improvement on Spider’s ‘Extra Hard’ category\)\. Comprehensive ablation studies validate the contribution of each module\. Human evaluation further confirmsCoTE\-SQL’s superiority in generating complete \(93/100\), structurally sound \(83/100\), and logically consistent \(86/100\) SQL queries\.
In summary, we make the following key contributions:
- •We identify and address a gap in existing LLM\-based text\-to\-SQL systems between reasoning capability and generalization, tackling challenges related to reasoning supervision, schema diversity, and error correction\.
- •We proposeCoTE\-SQL, a novel framework that integrates iterative self\-enhanced fine\-tuning with structured CoT prompting and error\-aware revision, enabling the scalable acquisition of intermediate reasoning skills and the reliable generation of SQL\.
- •Extensive experiments on multiple challenging benchmarks demonstrate thatCoTE\-SQLachieves new state\-of\-the\-art performance, significantly outperforming existing fine\-tuning and prompting methods across all difficulty levels\.
## 2\.Preliminaries
### 2\.1\.LLM\-based Text\-to\-SQL
The LLM\-based text\-to\-SQL approach translates natural language queries into SQL statements\. Given a queryQQand a database schemaSSwith tablesT=\{t1,…,t\|T\|\}T=\\\{t\_\{1\},\\dots,t\_\{\|T\|\}\\\}, each tabletit\_\{i\}has columnsCi=\{c1ti,…,c\|Ci\|ti\}C\_\{i\}=\\\{c\_\{1\}^\{t\_\{i\}\},\\dots,c\_\{\|C\_\{i\}\|\}^\{t\_\{i\}\}\\\}\. The task is to generate an executable SQL queryYYthat answersQQ, modeled by estimating the conditional probability ofYYgiven the prompt𝒫=\(Q,S\)\\mathcal\{P\}=\(Q,S\):
\(1\)PM\(Y\|𝒫\)=∏i=1\|Y\|PM\(Yi∣Y<i;𝒫\),P\_\{M\}\(Y\|\\mathcal\{P\}\)=\\prod\_\{i=1\}^\{\|Y\|\}P\_\{M\}\(Y\_\{i\}\\mid Y\_\{<i\};\\mathcal\{P\}\),wherePM\(Yi∣Y<i;𝒫\)P\_\{M\}\(Y\_\{i\}\\mid Y\_\{<i\};\\mathcal\{P\}\)is the probability of generating tokenYiY\_\{i\}given previous tokensY<iY\_\{<i\}and context𝒫\\mathcal\{P\}\.
### 2\.2\.CoT Prompting for Text\-to\-SQL
CoT Prompting for Text\-to\-SQL\.CoT prompting has been explored to improve LLM performance on text\-to\-SQL by encouraging intermediate reasoning before SQL generation\(Taiet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib5)\)\. For a taskTTand queryQQ, the model first derives a reasoning traceRtaskR\_\{\\text\{task\}\}:
\(2\)Rtask=argmaxP\(R∣T,Q\)\.R\_\{\\text\{task\}\}=\\arg\\max P\(R\\mid T,Q\)\.The final SQL queryyyis then produced conditioned on both the inputs andRtaskR\_\{\\text\{task\}\}\.
## 3\.Overview
### 3\.1\.Problem Challenges
Existing works on text\-to\-SQL fall into two extremes: either fine\-tuning models without explicit reasoning supervision or using CoT\-style prompting without grounding in domain\-specific learning or strong generalization\. To bridge this gap, we aim to develop a fine\-tuning approach that explicitly enhances reasoning capabilities and improves generalization across diverse database schemas\. This introduces several key challenges:
1. \(1\)Lack of Reasoning\-Specific Supervision\. High\-quality intermediate reasoning traces are rarely available in public text\-to\-SQL datasets\. Training solely on final SQL outputs limits the model’s ability to learn structured, interpretable reasoning paths\.
2. \(2\)Domain Generalization & Schema Sensitivity\. LLMs often fail to generalize across various database schemas, suffering from unclear correlations and misinterpreting schema elements due to ambiguous or under\-specified natural language queries\.
3. \(3\)High Correctness Requirements in Applications\. In practical deployments, LLMs must avoid generating incorrect or hallucinated reasoning paths\. Effective systems must detect, revise, and learn from their errors without relying on manual supervision\.
Figure 2\.Overview ofCoTE\-SQL\. The upper part illustrates the self\-enhanced fine\-tuning approach, while the lower part shows the error\-aware revision during inference\. Schema selection and SQL generation are used in both the fine\-tuning and inference phases\.
### 3\.2\.Core Insights
To address these challenges, our design is driven by three key insights:
1. \(1\)Self\-Enhanced Reasoning from LLMs\. Manually constructing high\-quality reasoning traces is expensive and non\-scalable\. Inspired by recent self\-training techniques, we can propose a self\-enhanced fine\-tuning framework that extracts latent reasoning traces from LLMs themselves\. By aligning these traces with gold answers and filtering for correctness, we construct high\-fidelity supervision without expert annotations\.
2. \(2\)Structured CoT Prompt for Maximizing Reasoning Ability\. We aim to design a flexible yet structured CoT prompting framework that decomposes the text\-to\-SQL task into modular reasoning stages, such as schema linking and SQL planning\. This structured decomposition enhances interpretability while maintaining generation flexibility\. Furthermore, we can augment prompts with retrieval\-based exemplars from a curated set of text\-SQL pairs, improving in\-context learning and adaptation to unseen schemas\.
3. \(3\)Error\-Driven Reasoning Revision\. Instead of discarding failed outputs, we can treat SQL execution errors as weak supervision signals\. By pairing error messages with reflective CoT prompts, we can guide the model to analyze its own failure points and iteratively revise the SQL, a form of self\-debugging driven by runtime feedback\.
Figure 3\.The proposed iterative self\-enhanced fine\-tuning framework\. Note that in practical use,CoTE\-SQLevaluates the correctness of the selected schema and the generated SQL separately\. If one is incorrect, only the corresponding module needs to be retried\. We plot them together for simplicity\.
### 3\.3\.Key Designs
Based on the insights above, we proposeCoTE\-SQL, whose workflow is illustrated in Figure[2](https://arxiv.org/html/2606.15598#S3.F2)\. It comprises the following key design components:
1\. Iterative Self\-Enhanced Fine\-tuning Framework \(§[4\.1](https://arxiv.org/html/2606.15598#S4.SS1)\)\.We propose a multi\-round, self\-enhanced training framework where the LLM acts not only as a reasoning engine but also as a generator of its own intermediate reasoning traces\. At each iteration, the model is prompted to produce reasoning steps that align with the gold SQL answers\. These outputs are filtered based on execution correctness or logical alignment, and only verified traces are retained as training data\. Incorrect outputs are augmented with hints and reprocessed to improve coverage\. Notably, the reasoning traces used in this framework are generated using our structured CoT design \(§[4\.2](https://arxiv.org/html/2606.15598#S4.SS2)\), ensuring structure and interpretability\. This iterative self\-supervision pipeline enables the model to gradually acquire robust, domain\-adapted reasoning skills, without relying on external labels or teacher models\.
2\. Structured CoT with Modular Reasoning and Retrieval Cues \(§[4\.2](https://arxiv.org/html/2606.15598#S4.SS2)\)\.To reliably generate training\-quality reasoning traces, we design a structured CoT prompting strategy that divides the text\-to\-SQL task into modular subtasks\. Each module, such asschema selectionorSQL generation, uses role\-specific instructions, localized guidance, and annotated demonstrations\. In contrast to rigid template\-based methods, our design supports free\-form, self\-explanatory reasoning within a scaffolded structure\. To further enhance adaptability, we incorporate retrieval\-augmented in\-context learning by selecting soft exemplar pairs based on question similarity\. These structured, retrieval\-aware CoT traces not only enhance generation accuracy but also provide the foundation for self\-improving fine\-tuning\.
3\. Error\-Aware Revision with Execution\-Guided Debugging \(§[4\.3](https://arxiv.org/html/2606.15598#S4.SS3)\)\.To meet real\-world accuracy standards, we introduce a SQL correction module that transforms runtime execution errors into learning signals\. When a generated SQL query fails to execute, the system pairs the returned error message with a specialized CoT prompt that instructs the LLM to analyze the failure, identify the error source, and revise the SQL accordingly\. This process mimics self\-debugging: the model reasons through the error, revises both syntax and logic, and re\-validates the output\. Common error types, such as invalid operators, ambiguous fields, or syntax violations, are handled through detailed, stage\-wise correction instructions\. These corrected outputs not only enhance robustness during inference but are also recycled into the training loop to reinforce reasoning over time\.
Note that the structured CoT prompt serves as the base for both fine\-tuning and inference\. Specifically, it provides the prompting design used in the self\-enhanced fine\-tuning phase to generate interpretable reasoning traces with selected schema and generated SQL\. Also, it guides the error\-aware revision process during inference\. By unifying prompting strategies across both phases, this design ensures consistency, modularity, and transferability of reasoning behaviors\.
## 4\.System Design
### 4\.1\.Iterative Self\-Enhanced Fine\-tuning Framework
Pre\-trained LLMs lack the domain\-specific reasoning required for accurate text\-to\-SQL generation inCoTE\-SQL\. Although fine\-tuning offers performance gains, it is fundamentally limited by the absence of intermediate reasoning traces as supervision\. To fully unlock the generalizable reasoning potential of LLMs, we introduce a framework that enables the model to iteratively generate and validate its own reasoning traces as high\-quality supervision\.
As illustrated in Figure[3](https://arxiv.org/html/2606.15598#S3.F3), the proposed framework performs multiple rounds of self\-enhanced fine\-tuning\. At each iteration, the LLM produces CoT reasoning aligned with gold SQL answers, guided by a structured CoT prompt \(§[4\.2](https://arxiv.org/html/2606.15598#S4.SS2)\)\. Generated traces are filtered using task\-specific correctness criteria: schema selection outputs are validated via exact entity alignment matching in JSON format, while SQL generation outputs are verified through execution equivalence with the gold queries\.
Formally, letDgen=\{\(xigen,yi\)\}i=1ND^\{gen\}=\\\{\(x\_\{i\}^\{gen\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote the generation dataset, wherexigenx\_\{i\}^\{gen\}is the CoT prompt input andyiy\_\{i\}the corresponding gold answer\. At iterationkk, the current modelMk−1M\_\{k\-1\}produces outputoio\_\{i\}fromxigenx\_\{i\}^\{gen\}, from which we extract the reasoning tracerir\_\{i\}and predicted answery^i\\hat\{y\}\_\{i\}\. Ify^i\\hat\{y\}\_\{i\}satisfies the correctness indicator𝟙\(yi,y^i\)=1\\mathds\{1\}\(y\_\{i\},\\hat\{y\}\_\{i\}\)=1, the pair\(xigen,oi\)\(x\_\{i\}^\{gen\},o\_\{i\}\)is incorporated into the fine\-tuning datasetDtrainD^\{train\}\. Otherwise, the input is augmented with a corrective hinthhto formxigenhx\_\{i\}^\{gen\_\{h\}\}, and the model is prompted again to generateoio\_\{i\}\. This retry mechanism continues up to a maximum number of attempts to improve coverage and robustness\.
To mitigate distribution shift during training and inference, pairs involving hinted inputs\(xigenh,oi\)\(x\_\{i\}^\{gen\_\{h\}\},o\_\{i\}\)are sampled with a smaller probability \(e\.g\., 0\.1\), as such hints are unavailable at test time\. For samples that fail to generate correct reasoning after all retries, we add\(xigenh,yi\)\(x\_\{i\}^\{gen\_\{h\}\},y\_\{i\}\)toDtrainD^\{train\}, ensuring the model learns from difficult cases and enhancing data diversity\.
Unlike conventional self\-training approaches that fine\-tune on the model from the previous iteration, we fine\-tune the original pre\-trained LLMM0M\_\{0\}with the accumulatedDtrainD^\{train\}at each iteration, thus preventing error amplification\. The objective optimized during fine\-tuning is the standard autoregressive token\-level cross\-entropy loss:
\(3\)ℒ=𝔼\(x,y\)∼Dtrain\[−∑t=1TlogP\(yt\|y<t,x\)\],\\mathcal\{L\}=\\mathbb\{E\}\_\{\(x,y\)\\sim D^\{train\}\}\\left\[\-\\sum\_\{t=1\}^\{T\}\\log P\(y\_\{t\}\|y\_\{<t\},x\)\\right\],whereTTdenotes the sequence length, and\(x,y\)\(x,y\)a training pair\.
Iterating this process forKKrounds yields progressively stronger models\{Mk\}k=1K\\\{M\_\{k\}\\\}\_\{k=1\}^\{K\}with enhanced domain\-adapted reasoning capabilities, benefiting from higher\-quality datasets in each round\. This iterative self\-enhancement strategy not only alleviates the reliance on external supervision but also enables effective adaptation to new database schemas or domains, as detailed in Algorithm[1](https://arxiv.org/html/2606.15598#alg1)\.
Algorithm 1Iterative Self\-Enhanced Fine\-tuning0:Pre\-trained LLM
M0M\_\{0\}, generation dataset
Dgen=\{\(xigen,yi\)\}i=1ND^\{gen\}=\\\{\(x\_\{i\}^\{gen\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, number of iterations
KK\.
0:Fine\-tuned LLM
MKM\_\{K\}\.
1:Initialize
M←M0M\\leftarrow M\_\{0\},
Dtrain←∅D^\{train\}\\leftarrow\\emptyset
2:for
k=1k=1to
KKdo
3:foreach
\(xigen,yi\)\(x\_\{i\}^\{gen\},y\_\{i\}\)in
DgenD^\{gen\}do
4:Generate
oi=M\(xigen\)o\_\{i\}=M\(x\_\{i\}^\{gen\}\), extract
\(ri,y^i\)\(r\_\{i\},\\hat\{y\}\_\{i\}\)
5:retries
←0\\leftarrow 0
6:whileretries
<max\_retries<max\\\_retriesand
𝟙\(yi,y^i\)=0\\mathds\{1\}\(y\_\{i\},\\hat\{y\}\_\{i\}\)=0do
7:Augment
xigenx\_\{i\}^\{gen\}with hint
hhto get
xigenhx\_\{i\}^\{gen\_\{h\}\}
8:Generate
oi=M\(xigenh\)o\_\{i\}=M\(x\_\{i\}^\{gen\_\{h\}\}\), extract
\(ri,y^i\)\(r\_\{i\},\\hat\{y\}\_\{i\}\)
9:retries
←\\leftarrowretries \+ 1
10:endwhile
11:if
𝟙\(yi,y^i\)=1\\mathds\{1\}\(y\_\{i\},\\hat\{y\}\_\{i\}\)=1then
12:Add
\(xigen,oi\)\(x\_\{i\}^\{gen\},o\_\{i\}\)to
DtrainD^\{train\}with probability
0\.90\.9
13:Add
\(xigenh,oi\)\(x\_\{i\}^\{gen\_\{h\}\},o\_\{i\}\)to
DtrainD^\{train\}with probability
0\.10\.1
14:else
15:Add
\(xigenh,yi\)\(x\_\{i\}^\{gen\_\{h\}\},y\_\{i\}\)to
DtrainD^\{train\}
16:endif
17:endfor
18:Fine\-tune
M0M\_\{0\}on
DtrainD^\{train\}to obtain
MkM\_\{k\}
19:endfor
20:return
MKM\_\{K\}
Our iterative self\-enhanced fine\-tuning framework systematically extracts and refines high\-fidelity intermediate reasoning traces from the model itself, thereby overcoming the scarcity of explicit supervision and significantly boosting the interpretability and accuracy ofCoTE\-SQL’s text\-to\-SQL reasoning\.
Given the inherent complexity of SQL generation, including handling nested queries, joins, aggregations, and conditionals, we allocate a larger proportion of training data to the SQL generation module compared to schema selection\. Schema selection primarily involves identifying relevant tables and columns, which is a comparatively straightforward process\. This balanced data allocation strategy ensures both modules achieve robust performance\.
### 4\.2\.Structured CoT Prompt Design
To address the challenge of domain generalization and schema sensitivity in text\-to\-SQL tasks, we propose a structured CoT prompt that integrates modular decomposition with retrieval\-based in\-context examples\. Different from rigid template\-based prompting, our design introduces a flexible reasoning interface that preserves the LLM’s generative freedom while enforcing structure\. By segmenting the reasoning process into interpretable subtasks, including schema selection \(§[4\.2\.2](https://arxiv.org/html/2606.15598#S4.SS2.SSS2)\) and SQL generation \(§[4\.2\.3](https://arxiv.org/html/2606.15598#S4.SS2.SSS3)\), and augmenting prompts with examples tailored to both the schema and the query, our method ensures robust adaptation to unseen databases and natural language variations\. This design not only improves inference\-time generation but also yields fine\-tuning quality reasoning traces for a self\-improving training loop\.
#### 4\.2\.1\.CoT Prompt Template
At the heart of our framework is a unified prompt design that enables stepwise reasoning across diverse modules\.111The full prompt templates can be found in our source code\.Each prompt follows a consistent template comprising:i\) Role specificationto contextualize the LLM’s function \(e\.g\., schema linker or SQL planner\)\.ii\) Reasoning instructionsto clarify the expected format and process\.iii\) Few\-shot examplesdrawn from curated demonstrations aligned with question similarity\.iv\) Problem inputappended with the cue phrase“Let’s think step by step”to activate CoT reasoning\.
In contrast to prior work that constrains reasoning within rigid formats\(Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13); Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.15598#bib.bib9); Zhanget al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib6)\), we allow free\-form yet guided reasoning by carefully decoupling demonstration format from execution logic\. This approach empowers LLMs to internalize task structures while expressing reasoning naturally\.
#### 4\.2\.2\.Schema Selection via Entity Alignment
Schema selection is crucial in mitigating the schema sensitivity of LLMs\. Given a natural language query, only a subset of the database schema is typically relevant\. Including irrelevant tables or columns can dilute the LLM’s attention and lead to erroneous reasoning in downstream modules\.
To address this, we design a schema grounding module that explicitly links natural language entities to database elements\. As illustrated in Figure[4](https://arxiv.org/html/2606.15598#S4.F4), this is operationalized as a CoT task with two substeps:\(i\) Entity Identification: extracting tokens or phrases referencing data concepts in the query, and\(ii\) Entity Linking: aligning them with relevant tables and columns in the schema\.
Figure 4\.Reasoning instruction for schema selection\.Unlike surface\-level linking, our prompt explicitly models the compositionality and ambiguity of entity references\. For instance, in the query“Show the name and release year of the song by the youngest singer,”a naïve LLM may incorrectly associate“name”with the singer rather than the song\. Our prompt includes detailed annotations and localized guidance \(e\.g\., reminders about referential cues such as“who”,“when”, etc\.\) to reduce such errors\. An example of the schema selection process is shown in Figure[5](https://arxiv.org/html/2606.15598#S4.F5)\.
Input:*Schemas*:CREATE TABLE“singer” \(“Singer\_ID” int,“Name” text, “Country” text, “Song\_Name” text,“Song\_release\_year” text, “Age” int, “Is\_male” bool,PRIMARY KEY\(“Singer\_ID”\);\) …*Question*:How many singers are from each country?LLM Output:To solve this problem, let’s break it down step by step\.1\. Identify the entity in the question:The entity in the question is “singers” and “country”2\. Link the entity to the most relevant table or column:The entity “singers” is related to the table “singer”\.The entity “country” is related to the column“Country” in the table “singer”…Based on the above analysis, the correct alignments are:\[\{“token”: “singers”, “schema”: “singer”, “type”: “tbl”\},\{”token”: “country”, “schema”: “singer\.Country”, “type”: “col”\}\]
Figure 5\.Demonstration of the schema selection\.Each alignment output is structured as a list of mappings, including the original token, the corresponding schema element, and the entity type \(table or column\)\. These structured outputs not only guide schema pruning but also serve as pseudo\-labels for fine\-tuning\. As a result, we reduce schema noise while enhancing generalization to unseen databases\.
#### 4\.2\.3\.Progressive SQL Generation with CoT Planning
The final module in our pipeline is responsible for SQL generation\. Rather than generating queries in a single step, we adopt a progressive CoT strategy that mimics human\-like planning\. The prompt guides the LLM through stages such as identifying the intent, selecting relevant operations \(e\.g\.,SELECT,GROUP BY\), and determining compositional constructs \(e\.g\., joins, subqueries\)\.
This progressive decoding framework supports adaptive reasoning: Forsimple queries, the LLM may bypass redundant steps after evaluating task simplicity\. Forcomplex queries, the LLM incrementally decomposes the task, preserving contextual consistency without requiring additional prompt orchestration\. Note that the simplicity evaluation is implicit, as it relies on the model itself to determine, during the reasoning process, whether further steps are necessary\.
To enhance the LLM’s structured SQL generation capability, we incorporate the progressive reasoning process into the self‑enhanced fine‑tuning framework \(Section[4\.1](https://arxiv.org/html/2606.15598#S4.SS1)\)\. In terms of data preparation, we collect reasoning\-SQL pairs and apply filtering based on by SQL correctness to curate high\-quality traces for fine\-tuning\. To align reasoning and generation, we provide role\-specific prompt annotations \(Figure[6](https://arxiv.org/html/2606.15598#S4.F6)\) and few\-shot examples\. The LLM outputs both the SQL statement and an accompanying stepwise rationale \(as illustrated in Figure[7](https://arxiv.org/html/2606.15598#S4.F7)\)\. This improves execution accuracy and interpretability, providing a transparent path from question to query\.
Figure 6\.Reasoning instruction for the SQL generation\.Input:*Schemas*:CREATE TABLE“singer” \(“Singer\_ID” int,“Name” text, “Country” text,PRIMARY KEY\(“Singer\_ID”\);\) …*Question*:How many singers are from each country?LLM Output:“‘sqlSELECTsinger\.country , COUNT \( \* \)FROMsingerGROUP BYsinger\.country”’Here’s the step\-by\-step reasoning:1\. Identify the necessary tables and columns:We need to find the number of singers from each country\.The relevant table is ‘singer’, and we need the ‘country’ column\.2\. Check for aggregate functions:We need to count the number of singers from each country,so we use the ‘COUNT\(\*\)’ aggregate function\.3\. Consider using GROUP BY:Since we want to group the results by country,we use the ‘GROUP BY’ clause\.…
Figure 7\.Demonstration of the SQL generation\.
### 4\.3\.Error\-Aware Revision
Despite the strong reasoning capabilities of LLMs, their generated SQL statements often contain structural or semantic flaws that hinder execution correctness\. Empirical analysis reveals four dominant failure modes:\(i\)invalid table\-column relationships,\(ii\)syntax violations such as misquoted identifiers,\(iii\)misuse of unsupported operators or functions, and\(iv\)field ambiguity due to alias conflicts or missing table qualifiers\. These errors not only prevent successful execution but also obscure the model’s reasoning trajectory\.
To address this challenge, we propose an error\-aware revision that transforms runtime errors into weak supervision signals for iterative SQL revision\. Specifically, when a generated SQL query fails to execute, we extract the returned error message and integrate it into the CoT prompting framework, which is specifically tailored for debugging\. This prompt explicitly instructs the model to\(i\)analyze the error feedback,\(ii\)identify the likely fault location, and\(iii\)propose a minimal, logically consistent fix\.
This mechanism enablesself\-reflective reasoning, where the model learns to revise both syntax and semantics based on runtime feedback\. The correction process is structured in multiple stages, beginning with error diagnosis \(e\.g\., resolving naming mismatches or unsupported syntax\) and followed by logical reassessment of the modified query\. Figure[8](https://arxiv.org/html/2606.15598#S4.F8)illustrates our debugging instruction template, designed to ensure that each step is traceable and grounded in the original error context\. A concrete example is shown in Figure[9](https://arxiv.org/html/2606.15598#S4.F9), where an initial query fails due to a casing mismatch in column names \(i\.e\.,pet\_typevs\.PetType\)\. The model, guided by our CoT prompt, correctly localizes the issue and generates a revised query that is both syntactically valid and semantically faithful to the original intent\.
Figure 8\.Reasoning instruction for the error\-aware revision\.Input:*Schemas*:CREATE TABLEPets \(PetID INTEGER,PetType VARCHAR\(20\), pet\_age INTEGER, weight REAL,PRIMARY KEY\(PetID\);\) …*Question*:What is the type and weight of the youngest pet?*Wrong sql*:SELECTpets\.pet\_type, pets\.weightFROMpetsORDER BYpets\.pet\_age ASC LIMIT 1*Error information*:Error executing SQL“SELECTpets\.pet\_type,pets\.weightFROMpetsORDER BYpets\.pet\_age ASC LIMIT 1”:“no such column: pets\.pet\_type”LLM Output:“‘sqlSELECTpets\.PetType, pets\.weightFROMpetsORDER BYpets\.pet\_age ASC LIMIT 1;”’\#\#\# Reasoning:1\.\*\*Analyze the Error\*\*:The error message indicates that there is nosuch column as ‘pet\_type’ in the ‘pets’ table\.However, the correct column name is ‘PetType’ \(with a capital ‘T’\)\.…
Figure 9\.Demonstration of the error\-aware revision\.This revision module not only enhances robustness at inference time but also produces corrected samples that can be recycled into the training pipeline, enabling the model to internalize patterns of failure and repair\. Over time, this improves generalization in complex, error\-prone SQL generation scenarios\.
## 5\.Experiments
We conduct a comprehensive evaluation of the proposedCoTE\-SQLframework for text\-to\-SQL to address the following questions:
RQ1: Overall Performance\.How doesCoTE\-SQLcompare against state\-of\-the\-art baselines?
RQ2: Ablation Study\.What is the individual and collective contribution of eachCoTE\-SQLmodule?
RQ3: Parameter Sensitivity\.How do hyperparameters affect theCoTE\-SQL’s effectiveness?
RQ4: Computation Efficiency\.What are the token efficiency and inference latency to produce SQL queries?
RQ5: Human Evaluation\.How do human evaluators assess the quality of the generated SQL queries?
### 5\.1\.Experimental Setup
#### 5\.1\.1\.Datasets
We evaluateCoTE\-SQLon two standard text\-to\-SQL benchmarks: Spider\(Yuet al\.,[2018](https://arxiv.org/html/2606.15598#bib.bib4)\)and Bird\(Liet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib8)\)\.
- •Spider: This large\-scale cross\-domain benchmark contains 10,181 questions \(including 1,034 validation samples\) paired with 5,693 unique complex SQL queries across 200 databases\. Each database features multiple tables spanning 138 diverse domains\. Spider is particularly designed to assess a model’s schemageneralization capabilitiesin generating syntactically and semantically correct SQL queries\.
- •Bird: Comprising 12,751 question\-SQL pairs from 95 real\-world databases, this benchmark covers 37\+ specialized domains including healthcare, education, and blockchain\. Bird introduces additional challenges by requiring systems to handle substantial database contents during SQL parsing, better reflectingpractical deployment scenarios\.
Table 1\.Comparison ofCoTE\-SQLand baselines on the Spider and Bird datasets by Execution Accuracy \(EX\) and Valid Efficiency Score \(VES\)\. “Direct” denotes the zero\-shot prompt baseline\. Best and second\-best results areboldedandunderlined, respectively\. The GPT\-4 results on Spider and Bird are taken from\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib22)\)and\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\), respectively\. As full\-parameter training exceeds our hardware limits, MAC\-SQL results are adopted from\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\)\. All others are from our implementations\.CategoryMethodModelEX \(Spider\)VES \(Spider\)EX \(Bird\)VES \(Bird\)DirectGPT\-472\.30\-46\.3549\.77Qwen2\.5\-7B75\.3071\.6745\.4448\.37Llama\-3\.1\-8B56\.0053\.4341\.2046\.58DeepSeek\-R1\-Qwen\-7B\(Guoet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib58)\)39\.1012\.186\.177\.05DeepSeek\-R1\-Llama\-8B\(Guoet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib58)\)41\.5018\.9315\.1215\.98Shallow prompt\-based methodsPURPLEQwen2\.5\-7B81\.9080\.9736\.3142\.33Llama\-3\.1\-8B70\.6073\.0328\.49\-MetaSQLQwen2\.5\-7B68\.6064\.98\-\-Llama\-3\.1\-8B65\.4062\.18\-\-DAIL\-SQLQwen2\.5\-7B72\.4067\.7139\.9041\.59Llama\-3\.1\-8B74\.8035\.0541\.0043\.82DACQwen2\.5\-7B76\.8072\.9546\.5448\.25Llama\-3\.1\-8B75\.7071\.9345\.5748\.32CoT\-based methodsDEA\-SQLQwen2\.5\-7B52\.1049\.5213\.1714\.93Llama\-3\.1\-8B19\.0016\.768\.418\.35DIN\-SQLQwen2\.5\-7B41\.0040\.0139\.9641\.41Llama\-3\.1\-8B55\.3055\.4838\.7240\.65Fine\-tuned\-based methodsMAC\-SQLSQL\-Llama \(7B\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\)76\.25\-43\.9457\.36DTS\-SQLQwen2\.5\-7B\(Yanget al\.,[2024a](https://arxiv.org/html/2606.15598#bib.bib26)\)73\.5068\.9642\.1844\.43Llama\-3\.1\-8B\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib19)\)71\.0066\.9341\.9243\.59CoTE\-SQL \(ours\)Qwen2\.5\-7B77\.8074\.6853\.3959\.02Llama\-3\.1\-8B79\.6077\.1949\.5455\.25
#### 5\.1\.2\.Evaluation Metrics
We use two evaluation metrics: Execution Accuracy \(EX\) and Valid Efficiency Score \(VES\), following the official evaluation scripts of Spider and Bird with minor adaptations\.
- •Execution Accuracy \(EX\)\(Yuet al\.,[2018](https://arxiv.org/html/2606.15598#bib.bib4)\): Measures the proportion of cases where the execution result of the predicted SQL query matches that of the ground truth\. This metric reflects the correctness of model outputs in terms of execution semantics\.
- •Valid Efficiency Score \(VES\)\(Liet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib8)\): Assesses both the correctness and efficiency of valid SQL queries generated by the model\. A predicted SQL is consideredvalidif its execution result aligns with that of the ground\-truth query\. The efficiency aspect captures factors such as execution time, throughput, memory consumption, or a composite of these metrics\.
#### 5\.1\.3\.Baselines
We compare our approach with several state\-of\-the\-art text\-to\-SQL baselines, which fall into two main categories: \(i\) fine\-tuning\-based methods\(Pourreza and Rafiei,[2024](https://arxiv.org/html/2606.15598#bib.bib7); Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\), and \(ii\) prompt\-based methods\(Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.15598#bib.bib9); Gaoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib22); Wanget al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib20); Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13); Fanet al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib34); Renet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib30)\)\. Forfine\-tuning\-based methods, we either reproduce the results using their official training procedures or cite the results reported in the original papers when reproduction is infeasible\. For instance, MAC\-SQL\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\)requires full\-parameter fine\-tuning, which could not be executed due to out\-of\-memory issues on our hardware\. In such cases, we directly report the performance from the original work\. All other methods in this category were trained and evaluated by us\. Forprompt\-based methods, we report the zero\-shot performance of both open\-source and closed\-source LLMs, as well as results from adapting open\-source models to existing text\-to\-SQL frameworks\. GPT\-4 results are taken from prior work\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\), while all other results are produced through our experiments\. The baselines evaluated are as follows:
- •DTS\-SQL\(Pourreza and Rafiei,[2024](https://arxiv.org/html/2606.15598#bib.bib7)\): A two\-stage fine\-tuning method that decomposes the task to improve the LLMs’ performance\.
- •MAC\-SQL\(Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\): A full\-parameter fine\-tuned agent framework for complex, multi\-step reasoning in text\-to\-SQL tasks\.
- •DIN\-SQL\(Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.15598#bib.bib9)\): A decomposition\-based method that divides complex tasks into sub\-problems using contextual prompts\.
- •MetaSQL\(Fanet al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib34)\): Converts queries into metadata for guiding SQL generation through constrained prompting\. Evaluated only on the Spider dataset due to incompatibility with Bird\.
- •PURPLE\(Renet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib30)\): Uses pre\-trained models to retrieve few\-shots with key logical structures for effective SQL generation\.
- •DAIL\-SQL\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib22)\): Assesses various demonstration selection strategies to enhance few\-shot text\-to\-SQL performance\.
- •DAC\(Wanget al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib20)\): Introduces decomposed automatic correction mechanisms to refine SQL decoding\.
- •DEA\-SQL\(Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13)\): Employs query decomposition workflows to improve LLMs’ attention and reasoning on complex queries\.
#### 5\.1\.4\.Implementation Details ofCoTE\-SQL
We employ the Llama\-3\.1\-8B model, aligned via instruction tuning\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib19)\), as the base model\. Both theSchema SelectionandSQL Generationmodules follow a 3\-shot inference strategy\. Demonstration examples are retrieved using the BM25 algorithm\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.15598#bib.bib27)\), which selects the most relevant samples from a candidate pool for each user query\. For the Spider dataset, we utilize the entity\-linking annotations provided by\(Liuet al\.,[2021](https://arxiv.org/html/2606.15598#bib.bib21)\)as the retrieval corpus\. As the Bird dataset lacks such annotations, theSchema Selectionmodule is trained exclusively on the Spider dataset\.
For fine\-tuning, we apply Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2022](https://arxiv.org/html/2606.15598#bib.bib28)\)to the base model using 3,000 training examples randomly sampled from the Spider and Bird training sets \(1,500 each\)\. Training is conducted with the Transformers library \(v4\.45\.2\) on a single NVIDIA A100\-SXM4\-80GB GPU\. The LoRA configuration uses a rank of 16 and an alpha value of 16\. We train with bfloat16 precision, a learning rate of5×10−55\\times 10^\{\-5\}, for 5 epochs, using a batch size of 1 and gradient accumulation over 4 steps\. Total training time is approximately 3 GPU hours\. Inference is performed using consistent generation settings across all modules: a temperature of 0\.2, a maximum sequence length of 2048 tokens, top\-p set to 1\.0, and a single output per query\. All experiments use a fixed random seed of 42 for reproducibility\.
### 5\.2\.Overall Performance \(RQ1\)
We conduct a comprehensive comparison with existing baselines, and the results are reported in Table[1](https://arxiv.org/html/2606.15598#S5.T1)\. On theBirddataset,CoTE\-SQLachieves state\-of\-the\-art performance, surpassing all baselines by 6\.85%\-47\.22% in EX and by 1\.66–51\.97 in VES\. On theSpiderdataset,CoTE\-SQLoutperforms all baselines except PURPLE with Qwen2\.5\-7B, achieving improvements of 2\.8%–60\.6% in EX and 4\.24–65\.01 in VES over the rest\. The PURPLE method yields the best results with Qwen2\.5\-7B on the Spider dataset, as it has been carefully fine\-tuned on a skeleton prediction model for this dataset\. Results show thatCoTE\-SQLdemonstrates strong generalization across datasets with varying schema complexity and domain coverage\. We further analyze the experimental results and summarize several key observations below\.
\(a\)Spider dataset\.
\(b\)Bird dataset\.
Figure 10\.Comparison of EX performance across different difficulty levels for two datasets\.Compared with direct inference using LLMs,CoTE\-SQLdecomposes the SQL generation process into fine\-grained sub\-tasks, allowing the model to focus on specific aspects of the task, leading to more accurate and efficient SQL generation\. This structured generation pipeline significantly improves overall performance\. While methods like DIN\-SQL and DEA\-SQL also adopt task decomposition strategies, their designs are often overly fragmented and heavily rely on template\-based CoT prompts that supply predefined, concrete reasoning steps in the few\-shot examples\. These prompts are challenging for LLMs, especially smaller ones, to interpret and generalize effectively, which results in degraded performance\. In fact, under small\-scale LLM settings, these methods can underperform even simple direct inference with large\-scale LLMs, e\.g\., GPT\-4\. In contrast,CoTE\-SQL’s flexible CoT prompting strategy enables adaptive reasoning paths, resulting in strong generalization across queries with varying structural and reasoning complexity\.
Notably, although DeepSeek\-R1\-Qwen\-7B and DeepSeek\-R1\-Llama\-8B inherit some reasoning capabilities from the teacher model DeepSeek\-R1, they still perform poorly on text\-to\-SQL tasks\. We find that their generated CoT reasoning is often disorganized and frequently fails to produce complete SQL statements\. This can be attributed to the lack of high\-quality, domain\-specific CoT data during the distillation process, particularly in the context of the highly specialized text\-to\-SQL task\. These results further validate the effectiveness ofCoTE\-SQLin enhancing the reasoning capability and generation quality of small\-scale LLMs\.
In contrast to end\-to\-end fine\-tuning methods like DTS\-SQL and MAC\-SQL,CoTE\-SQLadopts a self\-enhanced reasoning approach, where the LLM is lightly fine\-tuned using fundamental principle data generated by itself\. This improves the model’s understanding of the text\-to\-SQL task and boosts its reasoning ability\. It is worth emphasizing that, while MAC\-SQL relies on full\-parameter fine\-tuning,CoTE\-SQLachieves superior performance with only partial parameter updates, demonstrating both effectiveness and parameter efficiency\.
Furthermore, we profile performance across different difficulty levels, as shown in Figure[10](https://arxiv.org/html/2606.15598#S5.F10)\. On the Spider dataset,CoTE\-SQLachieves consistent improvements over the strongest baseline across medium, hard, extra, and all difficulty levels, with gains of 1\.7%, 8\.7%, 12%, and 3\.9%, respectively\. On the Bird dataset, the improvements are 3\.25%, 4\.96%, 5\.51%, and 3\.97% on simple, moderate, challenging, and all levels, respectively\.These consistent improvements across difficulty levels demonstrate thatCoTE\-SQLgeneralizes well to queries of varying complexity across diverse domains\. Notably, the most significant improvements are observed on the most challenging levels\. This is primarily due to the structured reasoning guided by CoT prompts, which help the model construct more transparent logical chains, particularly beneficial in complex scenarios, thus significantly improving the quality and accuracy of generated SQL statements\.
Table 2\.Migration results of DAC and PURPLE to newer and stronger models\.CoTE\-SQL’s results are included for comparison\. Best and second\-best results areboldedandunderlined\.MethodModelEX \(Spider\)VES \(Spider\)EX \(Bird\)VES \(Bird\)DACQwen3\.5\-9B75\.7073\.4851\.3159\.08IQuest\-Coder\-V1\-7B71\.6069\.7236\.7743\.33PURPLEQwen3\.5\-9B77\.4076\.7634\.4939\.19IQuest\-Coder\-V1\-7B50\.0050\.6814\.5418\.01CoTE\-SQL\(ours\)Qwen2\.5\-7B77\.8074\.6853\.3959\.02Llama\-3\.1\-8B79\.6077\.1949\.5455\.25
The baseline methods in Table[1](https://arxiv.org/html/2606.15598#S5.T1)are built on strong open\-source models at the time\. A natural question is whetherCoTE\-SQLwould still remain effective if these methods are migrated to stronger models with improved logical and coding abilities\. To investigate this, we re\-evaluate the two best baselines, DAC and PURPLE, by migrating them to Qwen3\.5\-9B and IQuest\-Coder\-V1\-7B\(Yanget al\.,[2026](https://arxiv.org/html/2606.15598#bib.bib67)\)\. As shown in Table[2](https://arxiv.org/html/2606.15598#S5.T2), PURPLE exhibits a performance decline after migration\. For the Bird dataset, DAC improves after migrating to Qwen3\.5\-9B, with VES slightly surpassingCoTE\-SQL\(by 0\.06\) due to the model’s preference for more execution\-efficient SQL\. However,CoTE\-SQLstill outperforms on the core EX metric\. This indicates thatCoTE\-SQLthrough self\-enhanced fine\-tuning and structured reasoning, possesses a fundamental advantage that surpasses model iterations\.
### 5\.3\.Ablation Study \(RQ2\)
We conduct an ablation study by individually removing theSchema Selection,SQL Correction, andSelf\-Enhanced Reasoningmodules fromCoTE\-SQL\. Table[3](https://arxiv.org/html/2606.15598#S5.T3)reports the results of these variants on both the Spider and Bird datasets\.CoTE\-SQLconsistently outperforms all its ablated variants, demonstrating the effectiveness of each module\. Specifically, theSelf\-Enhanced Reasoningmodule enhances the reasoning ability by fine\-tuning on self\-generated CoT data, leading to improvements of 4\.30% on Spider and 1\.30% on Bird\. TheSchema Selectionmodule filters out irrelevant schema elements, allowing the model to focus on key components and CoT instructions, resulting in gains of 2% on Spider and 1\.63% on Bird\. Lastly, theSQL Correctionmodule reduces syntax and logical errors in generated SQL, improving execution accuracy by 1\.60% on Spider and 5\.67% on Bird\.
Due to the differing characteristics of the datasets, the impact of each module varies\. The Spider dataset emphasizes generalization across diverse schemas, where improvements in reasoning capabilities yield larger gains\. Conversely, the Bird dataset features substantially larger databases, where syntax errors and incorrect table\-field associations are more frequent, thus highlighting the importance of SQL correction\. Overall, the ablation study confirms that all modules contribute significantly to the strong performance ofCoTE\-SQL\.
Table 3\.Ablation results\. Numbers in parentheses denote the decrease in EX performance when individual components are ablated fromCoTE\-SQL\.MethodEX \(Spider\)EX \(Bird\)w/o Self\-Enhanced Fine\-Tuning75\.30\(\-4\.30\)48\.24\(\-1\.30\)w/o Schema Selection77\.60\(\-2\.00\)47\.91\(\-1\.63\)w/o SQL Correction78\.00\(\-1\.60\)43\.87\(\-5\.67\)CoTE\-SQL79\.6049\.54We further profile the ablation results across different difficulty levels\. Table[5](https://arxiv.org/html/2606.15598#S5.T5)shows that removing theSchema Selectionmodule causes an EX drop of up to 7\.20% on the Extra difficulty level of the Spider dataset\. This is attributable to the increased complexity and abundance of irrelevant schema elements in harder SQL tasks, which complicates reasoning and disperses model attention when schema filtering is absent\. Meanwhile, removing theSQL Correctionmodule leads to significant performance declines \(5\.39%\-8\.27%\) at all difficulty levels on the Bird dataset, as shown in Table[4](https://arxiv.org/html/2606.15598#S5.T4), underscoring its critical role in maintaining system effectiveness\.
Table 4\.EX performance across questions with different difficulty levels in the ablation study on the Bird dataset\.MethodSimpleModerateChallengingAllw/o Self\-Enhanced Fine\-Tuning57\.30\(\-0\.54\)36\.42\(\-1\.51\)28\.28\(\-5\.51\)48\.24\(\-1\.30\)w/o Schema Selection56\.54\(\-1\.30\)36\.85\(\-1\.08\)28\.28\(\-5\.51\)47\.91\(\-1\.63\)w/o Correction52\.43\(\-5\.41\)32\.54\(\-5\.39\)25\.52\(\-8\.27\)43\.87\(\-5\.67\)CoTE\-SQL57\.8437\.9333\.7949\.54Table 5\.EX performance across questions with different difficulty levels in the ablation study on the Spider dataset\.MethodEasyMediumHardExtraAllw/o Self\-Enhanced Fine\-Tuning87\.10\(3\.60\)80\.90\(\-5\.60\)64\.90\(\-1\.80\)53\.60\(\-4\.20\)75\.30\(\-4\.30\)w/o Schema Selection89\.50\(\-1\.20\)86\.10\(\-0\.40\)64\.40\(\-2\.30\)50\.60\(\-7\.20\)77\.60\(\-2\.00\)w/o Correction89\.10\(\-1\.60\)84\.50\(\-2\.00\)64\.90\(\-1\.80\)57\.80\(0\.00\)78\.00\(\-1\.60\)CoTE\-SQL90\.7086\.5066\.7057\.8079\.60\(a\)Spider dataset\.
\(b\)Bird dataset\.
Figure 11\.Impact of the number of few\-shot samples\.\(a\)Spider dataset\.
\(b\)Bird dataset\.
Figure 12\.Impact of training data ratio between schema selection and SQL generation during self\-enhanced fine\-tuning\.
### 5\.4\.Parameters Sensitivity \(RQ3\)
We investigate the sensitivity ofCoTE\-SQLto two key parameters: the number of few\-shot exampleskkand the training data ratio between the schema selection and SQL generation tasks\. As shown in Figure[11](https://arxiv.org/html/2606.15598#S5.F11), the best performance is achieved whenk=3k=3, indicating that a moderate number of few\-shot examples enhances open\-source LLMs’ understanding of the text\-to\-SQL task\. Whenkkis too small, the selected examples may fail to capture the logical structure of the current SQL query, providing limited guidance\. Conversely, whenkkis too large, the excessive context introduces redundancy and distracts the model’s attention, thereby degrading reasoning performance\.
Surprisingly, performance atk=1k=1is even lower than atk=0k=0\. A closer analysis of 100 samples under thek=1k=1setting reveals that in approximately 70% of cases, the selected example, while lexically similar to the input question, follows a different SQL logic\. This leads the model to overfit to an irrelevant pattern, thus impairing generalization\.
Figure[12](https://arxiv.org/html/2606.15598#S5.F12)presents results under varying training data ratios\. A 1:2 ratio between schema selection and SQL generation yields the best overall performance\. This reflects the relative difficulty of SQL generation, which requires producing syntactically correct and semantically meaningful queries that may involve joins, nested queries, and complex conditions, necessitating more extensive supervision compared to schema selection\.
Table 6\.Token efficiency of different methods on Spider\.MethodAvg\. Token NumAvg\. TimeEXDTS\-SQL10817\.7s71\.0DEA\-SQL12,038100\.2s19\.0DIN\-SQL10,45316\.2s55\.3DAIL\-SQL9322\.4s74\.8DAC11,7417\.4s75\.7CoTE\-SQL2,24210\.9s79\.6
### 5\.5\.Computation Efficiency \(RQ4\)
We evaluate average token count and inference time on 10 randomly sampled Spider queries\. As shown in Table[6](https://arxiv.org/html/2606.15598#S5.T6),CoTE\-SQLuses significantly fewer tokens than multi\-stage baselines \(DEA\-SQL, DIN\-SQL, DAC\), which incur redundant prompting and fragmented reasoning\. For instance, DEA\-SQL averages 12,038 tokens and over 100 seconds per query with 19\.0% EX, whereasCoTE\-SQLrequires only 2,242 tokens and 10\.9 seconds, achieving 79\.6% EX\. This efficiency arises from our structured CoT prompt, which unifies reasoning and generation in a single\-stage pipeline, avoiding redundant decomposition and repeated LLM calls\.
When compared to DTS\-SQL and DAIL\-SQL,CoTE\-SQLachieves superior accuracy \(79\.6% vs\. 74\.8% and 71\.0%, respectively\), despite a moderate increase in token usage and inference time\. Notably, DAIL\-SQL achieves the lowest average token count \(932\) and inference time \(2\.4s\), but this comes at the cost of reduced reasoning capacity, leading to suboptimal EX\. Our method introduces minimal overhead by incorporating lightweight intermediate reasoning steps, which are crucial for handling complex and compositional SQL structures\.
Moreover, while DAC achieves comparable execution accuracy \(75\.7%\) with relatively fast inference \(7\.4s\), it requires over 11,000 tokens on average, suggesting that it trades off efficiency in token space for runtime gains via aggressive parallelization or optimization strategies\. In contrast,CoTE\-SQLoffers a balanced trade\-off between computational cost and predictive accuracy, making it more suitable for deployment in real\-time systems with constrained budgets\.
In summary,CoTE\-SQLachieves the best accuracy\-efficiency trade\-off, outperforming multi\-stage and end\-to\-end approaches, validating our compact, interpretable reasoning design that enhances both efficiency and semantic correctness\.
### 5\.6\.Human Evaluation \(RQ5\)
To assess generated SQL quality, we randomly sample 100 Spider questions and compared outputs fromCoTE\-SQLand the strongest baseline, DAC\. We recruited three skilled volunteers with SQL knowledge, provided them with detailed evaluation criteria and background information\. Evaluators saw each question, its gold SQL, and two anonymized outputs \(order shuffled\)\. Evaluation used three criteria:Completeness \(CP\): Does the SQL include all necessary and relevant tables and fields required to answer the question?Structural Soundness \(SS\): Does the SQL follow a clean and efficient structure with minimal redundancy?Logical Consistency \(LC\): Which SQL better captures the question’s logic, e\.g\., computation or condition logic? The preferred SQL received 1 point per criterion, aggregated over 100 samples and averaged among volunteers to obtain the final results\.
As shown in Table[7](https://arxiv.org/html/2606.15598#S5.T7),CoTE\-SQLoutperforms DAC on all three criteria: 93 \(CP\), 83 \(SS\), and 86 \(LC\), exceeding DAC by 8, 3, and 7 points\. EX and VES metrics are also reported, further validating our method\.
Overall, these results suggest more reliable handling of diverse and complex SQL queries\.
We present a case study in Figure[13](https://arxiv.org/html/2606.15598#S5.F13)comparing SQL outputs from CoTE\-SQL and DAC \(both Llama\-3\.1\-8B\) for a specific question\. CoTE\-SQL correctly identifies target entities, applies joins, and produces a query equivalent to the ground truth\. DAC omits thecountrycondition and fails to capture necessary relationships, reflecting limitations in its reasoning and schema understanding\. This example highlightsCoTE\-SQL’s advantage across all three metrics\.
Table 7\.Comparison of human preference scores and quantitative metrics\. The first three columns show human evaluations; the last two show corresponding quantitative metrics\.MethodCPSSLCEXVESDAC85807971\.0067\.59CoTE\-SQL93838675\.0073\.61Question:What are the first names of the students who live in Haitipermanently or have the cell phone number 09700166582?Ground\-Truth SQL:SELECTt1\.first\_nameFROMstudents AS t1JOINaddresses AS t2ONt1\.permanent\_address\_id = t2\.address\_idWHEREt2\.country = ‘Haiti’ORt1\.cell\_mobile\_number = ‘09700166582’CoTE\-SQL \(LLaMA\-3\.1\-8B\):SELECTstudents\.first\_nameFROMstudentsWHEREstudents\.permanent\_address\_id IN\(SELECTaddress\_idFROMaddressesWHEREaddresses\.country = ‘haiti’ \)ORstudents\.cell\_mobile\_number = ‘09700166582’DAC \(LLaMA\-3\.1\-8B\):SELECTs\.first\_nameFROMStudents sWHEREs\.permanent\_address\_id = 5ORs\.cell\_mobile\_number = ’09700166582’
Figure 13\.Case study for the human evaluation\.
## 6\.Related Work
### 6\.1\.Prompt\-based Reasoning for Text\-to\-SQL
Recent advances in text\-to\-SQL have been driven by LLMs, enabling executable SQL generation from natural language\(Zhaiet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib44); Shen and Kejriwal,[2024](https://arxiv.org/html/2606.15598#bib.bib25)\)\. CoT prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.15598#bib.bib11)\)effectively enhances LLM reasoning by decomposing complex queries into intermediate steps using few\-shot exemplars\. Recent works explore structured prompting to improve SQL generation: DIN\-SQL\(Pourreza and Rafiei,[2023](https://arxiv.org/html/2606.15598#bib.bib9)\)decomposes queries via contextual guidance; ACT\-SQL\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib6)\)automates CoT example generation, reducing annotation costs; Tai et al\.\(Taiet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib5)\)study CoT styles and propose decomposition\-based prompting to mitigate error propagation; Xie et al\.\(Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13)\)divide SQL generation into specialized reasoning modules, advancing CoT performance\. However, these methods encode reasoning in fixed examples or templates, limiting generalization to complex, compositional queries\.
### 6\.2\.Fine\-tuning Strategies for Text\-to\-SQL
Fine\-tuning LLMs on supervised text\-to\-SQL datasets complements prompting\-based methods\(Huet al\.,[2022](https://arxiv.org/html/2606.15598#bib.bib28)\)by training on input\-query pairs, improving performance on domain\-specific or conversational tasks\(Sarkeret al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib43); Shenet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib45); Wanget al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib12)\)\. MIGA\(Fuet al\.,[2023](https://arxiv.org/html/2606.15598#bib.bib17)\)combines multi\-task pre\-training with SQL perturbations to enhance robustness\. FinSQL\(Zhanget al\.,[2024a](https://arxiv.org/html/2606.15598#bib.bib18)\)uses ChatGPT to synthesize domain\-specific datasets for domain\-adaptive fine\-tuning\. Hong et al\.\(Honget al\.,[2024a](https://arxiv.org/html/2606.15598#bib.bib14)\)inject schema\-level knowledge, while Yang et al\.\(Yanget al\.,[2024b](https://arxiv.org/html/2606.15598#bib.bib15)\)leverage synthetic and erroneous SQL examples to narrow the gap between open and proprietary LLMs\. Despite their success, most fine\-tuning approaches neglect CoT\-style reasoning due to limited step\-by\-step annotations, underperforming on complex or multi\-hop SQL queries\. Rossiello et al\.\(Rossielloet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib64)\)distill CoT annotations from a 70B to an 8B model\. STaR\-SQL\(Heet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib63)\)combines few\-shot prompting with a verifier, with gains mainly from the verifier\. Reasoning\-SQL\(Pourrezaet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib60)\)applies partial rewards, and CSC\-SQL\(Sheng and Xu,[2025](https://arxiv.org/html/2606.15598#bib.bib61)\)separates generation and correction stages optimized via GRPO; however, RL\-based methods remain costly, complex, and unstable\.
### 6\.3\.Other Explorations in Text\-to\-SQL
Several studies aim to reduce cost, latency, and accessibility barriers of LLM\-based solutions via architectural and training innovations\(Wanget al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib20); Xieet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib13); Weiet al\.,[2022](https://arxiv.org/html/2606.15598#bib.bib11); Gaoet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib22); Chenet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib41)\)\. MSC\-SQL\(Gortiet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib53)\)selects optimal SQL predictions from small LLMs using multi\-sample scoring\. DTS\-SQL\(Pourreza and Rafiei,[2024](https://arxiv.org/html/2606.15598#bib.bib7)\)applies two\-stage fine\-tuning for schema linking and SQL generation\. Karki et al\.\(Karkiet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib54)\)show that 0\.5B–3\.8B models with parameter\-efficient fine\-tuning can match 7B models\. Fan et al\.\(Fanet al\.,[2024a](https://arxiv.org/html/2606.15598#bib.bib55)\)delegate entity linking and sketch generation to small LLMs, using large LLMs for completion\. SPS\-SQL\(Yanet al\.,[2025](https://arxiv.org/html/2606.15598#bib.bib56)\)pre\-synthesizes schema\-aware SQL sketches to improve accuracy\. Oliveira et al\.\(Oliveiraet al\.,[2024](https://arxiv.org/html/2606.15598#bib.bib57)\)enhance small LLMs with retrieval\-augmented generation \(RAG\)\. While these methods improve surface\-level generation or alignment, few tackle reasoning limitations\. We propose a CoT\-enhanced framework for small\-scale models, bridging the performance gap through modularized reasoning while maintaining generalization\.
## 7\.Conclusion
In this paper, we presentedCoTE\-SQL, a framework for enhancing reasoning, generalization, and correctness in LLM\-based text\-to\-SQL generation\. By leveraging self\-enhanced reasoning traces, structured CoT prompting, and error\-aware revision,CoTE\-SQLaddresses key limitations in existing approaches without relying on expensive human annotations\. Extensive evaluations demonstrate that our method achieves state\-of\-the\-art performance, particularly excelling on complex and high\-difficulty queries\. Looking forward, we believe the ideas explored inCoTE\-SQLoffer a broader paradigm for combining self\-enhancement, modular reasoning, and feedback\-driven correction in LLMs, opening up promising directions for a wide range of complex, multi\-step decision\-making tasks in natural language interfaces\.
## References
- B\. Chen, S\. Shi, Y\. Luo, B\. Xu, R\. Cai, and Z\. Hao \(2025\)Track\-SQL: enhancing generative language models with dual\-extractive modules for schema and context tracking in multi\-turn text\-to\-SQL\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 10690–10708\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1\.4](https://arxiv.org/html/2606.15598#S5.SS1.SSS4.p1.1),[Table 1](https://arxiv.org/html/2606.15598#S5.T1.7.1.21.1)\.
- J\. Fan, Z\. Gu, S\. Zhang, Y\. Zhang, Z\. Chen, L\. Cao, G\. Li, S\. Madden, X\. Du, and N\. Tang \(2024a\)Combining small language models and large language models for zero\-shot NL2SQL\.Proceedings of the VLDB Endowment17\(11\),pp\. 2750–2763\.Cited by:[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- Y\. Fan, Z\. He, T\. Ren, D\. Guo, L\. Chen, R\. Zhu, G\. Chen, Y\. Jing, K\. Zhang, and X\. S\. Wang \(2023\)Gar: a generate\-and\-rank approach for natural language to SQL translation\.In2023 IEEE 39th International Conference on Data Engineering \(ICDE\),pp\. 110–122\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- Y\. Fan, Z\. He, T\. Ren, C\. Huang, Y\. Jing, K\. Zhang, and X\. S\. Wang \(2024b\)Metasql: a generate\-then\-rank framework for natural language to SQL translation\.In2024 IEEE 40th International Conference on Data Engineering \(ICDE\),pp\. 1765–1778\.Cited by:[4th item](https://arxiv.org/html/2606.15598#S5.I3.i4.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1)\.
- Y\. Fan, T\. Ren, C\. Huang, Z\. He, and X\. S\. Wang \(2024c\)Grounding natural language to SQL translation with data\-based self\-explanations\.arXiv preprint arXiv:2411\.02948\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- Y\. Fu, W\. Ou, Z\. Yu, and Y\. Lin \(2023\)MIGA: a unified multi\-task generation framework for conversational text\-to\-sql\.Proceedings of the AAAI Conference on Artificial Intelligence37\(11\),pp\. 12790–12798\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou \(2024\)Text\-to\-SQL empowered by large language models: A benchmark evaluation\.Proceedings of the VLDB Endowment17\(5\),pp\. 1132–1145\.Cited by:[6th item](https://arxiv.org/html/2606.15598#S5.I3.i6.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.15598#S5.T1),[Table 1](https://arxiv.org/html/2606.15598#S5.T1.6.2),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- Z\. Gong and Y\. Sun \(2024\)Graph reasoning enhanced language models for text\-to\-sql\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2447–2451\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- S\. K\. Gorti, I\. Gofman, Z\. Liu, J\. Wu, N\. Vouitsis, G\. Yu, J\. C\. Cresswell, and R\. Hosseinzadeh \(2024\)MSC\-SQL: multi\-sample critiquing small language models for text\-to\-SQL translation\.arXiv preprint arXiv:2410\.12916\.Cited by:[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[Table 1](https://arxiv.org/html/2606.15598#S5.T1.7.1.5.2),[Table 1](https://arxiv.org/html/2606.15598#S5.T1.7.1.6.2)\.
- M\. He, Y\. Shen, W\. Zhang, Q\. Peng, J\. Wang, and W\. Lu \(2025\)STaR\-SQL: self\-taught reasoner for text\-to\-SQL\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 24365–24375\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- Z\. Hong, Z\. Yuan, H\. Chen, Q\. Zhang, F\. Huang, and X\. Huang \(2024a\)Knowledge\-to\-SQL: enhancing SQL generation with data expert LLM\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10997–11008\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- Z\. Hong, Z\. Yuan, Q\. Zhang, H\. Chen, J\. Dong, F\. Huang, and X\. Huang \(2024b\)Next\-generation database interfaces: a survey of LLM\-based Text\-to\-SQL\.arXiv preprint arXiv:2406\.08426\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§5\.1\.4](https://arxiv.org/html/2606.15598#S5.SS1.SSS4.p2.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- Z\. Huang, S\. Zhang, K\. Liu, and E\. Wu \(2024\)Data\-centric text\-to\-SQL with large language models\.InNeurIPS 2024 Third Table Representation Learning Workshop,Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- S\. Karki, P\. Karki, B\. L\. Shrestha, and T\. N\. Jha \(2025\)Smaller large language models for text\-to\-SQL: performance analysis and optimal performance\.In2025 International Conference on Inventive Computation Technologies \(ICICT\),pp\. 1–7\.Cited by:[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo, X\. Zhou, M\. Chenhao, G\. Li, K\. Chang, F\. Huang, R\. Cheng, and Y\. Li \(2023\)Can LLM already serve as a database interface? a big bench for large\-scale database grounded text\-to\-SQLs\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 42330–42357\.Cited by:[2nd item](https://arxiv.org/html/2606.15598#S5.I2.i2.p1.1.1),[§5\.1\.1](https://arxiv.org/html/2606.15598#S5.SS1.SSS1.p1.1)\.
- X\. Li, Q\. Cai, Y\. Shu, C\. Guo, and B\. Yang \(2025\)AID\-SQL: adaptive in\-context learning of text\-to\-SQL with difficulty\-aware instruction and retrieval\-augmented generation\.In2025 IEEE 41st International Conference on Data Engineering \(ICDE\),pp\. 3945–3957\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- Q\. Liu, D\. Yang, J\. Zhang, J\. Guo, B\. Zhou, and J\. Lou \(2021\)Awakening latent grounding from pretrained language models for semantic parsing\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,pp\. 1174–1189\.Cited by:[§5\.1\.4](https://arxiv.org/html/2606.15598#S5.SS1.SSS4.p1.1)\.
- X\. Liu, S\. Shen, B\. Li, P\. Ma, R\. Jiang, Y\. Zhang, J\. Fan, G\. Li, N\. Tang, and Y\. Luo \(2024\)A survey of NL2SQL with large language models: where are we, and where are we going?\.arXiv preprint arXiv:2408\.05109\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- X\. Liu, S\. Shen, B\. Li, N\. Tang, and Y\. Luo \(2025\)Nl2sql\-bugs: a benchmark for detecting semantic errors in NL2SQL translation\.arXiv preprint arXiv:2503\.11984\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- R\. Luo, L\. Wang, B\. Lin, Z\. Lin, and Y\. Yang \(2024\)PTD\-SQL: partitioning and targeted drilling with LLMs in text\-to\-SQL\.arXiv preprint arXiv:2409\.14082\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- A\. Oliveira, E\. Nascimento, J\. Pinheiro, C\. V\. S\. Avila, G\. Coelho, L\. Feijó, Y\. Izquierdo, G\. García, L\. A\. P\. P\. Leme, M\. Lemos,et al\.\(2024\)Small, medium, and large language models for text\-to\-SQL\.InInternational Conference on Conceptual Modeling,pp\. 276–294\.Cited by:[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- M\. Pourreza and D\. Rafiei \(2023\)DIN\-sql: decomposed in\-context learning of text\-to\-sql with self\-correction\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 36339–36348\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.15598#S4.SS2.SSS1.p2.1),[3rd item](https://arxiv.org/html/2606.15598#S5.I3.i3.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1)\.
- M\. Pourreza and D\. Rafiei \(2024\)DTS\-SQL: decomposed text\-to\-SQL with small large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 8212–8220\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1),[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[1st item](https://arxiv.org/html/2606.15598#S5.I3.i1.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- M\. Pourreza, S\. Talaei, R\. Sun, X\. Wan, H\. Li, A\. Mirhoseini, A\. Saberi, S\. Arik,et al\.\(2025\)Reasoning\-sql: reinforcement learning with sql tailored partial rewards for reasoning\-enhanced text\-to\-sql\.arXiv preprint arXiv:2503\.23157\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- T\. Ren, Y\. Fan, Z\. He, R\. Huang, J\. Dai, C\. Huang, Y\. Jing, K\. Zhang, Y\. Yang, and X\. S\. Wang \(2024\)Purple: making a large language model a better SQL writer\.In2024 IEEE 40th International Conference on Data Engineering \(ICDE\),pp\. 15–28\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1),[5th item](https://arxiv.org/html/2606.15598#S5.I3.i5.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Found\. Trends Inf\. Retr\.3\(4\),pp\. 333–389\.Cited by:[§5\.1\.4](https://arxiv.org/html/2606.15598#S5.SS1.SSS4.p1.1)\.
- G\. Rossiello, N\. Pham, M\. Glass, J\. Lee, and D\. Subramanian \(2025\)Rationalization models for text\-to\-sql\.arXiv preprint arXiv:2502\.06759\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- S\. Sarker, X\. Dong, X\. Li, and L\. Qian \(2024\)Enhancing LLM fine\-tuning for text\-to\-SQLs by SQL quality measurement\.arXiv preprint arXiv:2410\.01869\.Cited by:[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- K\. Shen and M\. Kejriwal \(2024\)SelECT\-SQL: self\-correcting ensemble chain\-of\-thought for text\-to\-SQL\.arXiv preprint arXiv:2409\.10007\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1)\.
- Z\. Shen, P\. Vougiouklis, C\. Diao, K\. Vyas, Y\. Ji, and J\. Z\. Pan \(2024\)Improving retrieval\-augmented text\-to\-SQL with AST\-based ranking and schema pruning\.arXiv preprint arXiv:2407\.03227\.Cited by:[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- L\. Sheng and S\. Xu \(2025\)CSC\-sql: corrective self\-consistency in text\-to\-sql via reinforcement learning\.arXiv preprint arXiv:2505\.13271\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- L\. Shi, Z\. Tang, N\. Zhang, X\. Zhang, and Z\. Yang \(2024\)A survey on employing large language models for text\-to\-SQL tasks\.ACM Computing Surveys\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- C\. Tai, Z\. Chen, T\. Zhang, X\. Deng, and H\. Sun \(2023\)Exploring chain of thought style prompting for text\-to\-SQL\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 5376–5393\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.15598#S2.SS2.p1.3),[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1)\.
- B\. Wang, C\. Ren, J\. Yang, X\. Liang, J\. Bai, L\. Chai, Z\. Yan, Q\. Zhang, D\. Yin, X\. Sun, and Z\. Li \(2025\)MAC\-SQL: a multi\-agent collaborative framework for text\-to\-SQL\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 540–557\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[2nd item](https://arxiv.org/html/2606.15598#S5.I3.i2.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[Table 1](https://arxiv.org/html/2606.15598#S5.T1),[Table 1](https://arxiv.org/html/2606.15598#S5.T1.6.2),[Table 1](https://arxiv.org/html/2606.15598#S5.T1.7.1.19.3),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- D\. Wang, L\. Dou, X\. Zhang, Q\. Zhu, and W\. Che \(2024\)DAC: decomposed automation correction for text\-to\-SQL\.arXiv preprint arXiv:2408\.08779\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[7th item](https://arxiv.org/html/2606.15598#S5.I3.i7.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, b\. ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.Cited by:[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- Y\. Xie, X\. Jin, T\. Xie, M\. Matrixmxlin, L\. Chen, C\. Yu, C\. Lei, C\. Zhuo, B\. Hu, and Z\. Li \(2024\)Decomposition for enhancing attention: improving LLM\-based text\-to\-SQL through workflow paradigm\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10796–10816\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1),[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[§4\.2\.1](https://arxiv.org/html/2606.15598#S4.SS2.SSS1.p2.1),[8th item](https://arxiv.org/html/2606.15598#S5.I3.i8.p1.1),[§5\.1\.3](https://arxiv.org/html/2606.15598#S5.SS1.SSS3.p1.1),[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1),[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- L\. Yan, Q\. Wan, C\. Liu, S\. Duan, P\. Han, and Y\. Xu \(2025\)SPS\-SQL: enhancing text\-to\-SQL generation on small\-scale LLMs with pre\-synthesized queries\.Pattern Recognition Letters\.Cited by:[§6\.3](https://arxiv.org/html/2606.15598#S6.SS3.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024a\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[Table 1](https://arxiv.org/html/2606.15598#S5.T1.7.1.20.2)\.
- J\. Yang, W\. Zhang, S\. Guo, Z\. Ye, L\. Jing, S\. Liu, Y\. Li, J\. Wu, C\. Liu, X\. Ma,et al\.\(2026\)IQuest\-coder\-v1 technical report\.arXiv preprint arXiv:2603\.16733\.Cited by:[§5\.2](https://arxiv.org/html/2606.15598#S5.SS2.p6.1)\.
- J\. Yang, B\. Hui, M\. Yang, J\. Yang, J\. Lin, and C\. Zhou \(2024b\)Synthesizing text\-to\-SQL data from weak and strong LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7864–7875\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p3.1),[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. Radev \(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-SQL task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 3911–3921\.Cited by:[1st item](https://arxiv.org/html/2606.15598#S5.I2.i1.p1.1.1),[§5\.1\.1](https://arxiv.org/html/2606.15598#S5.SS1.SSS1.p1.1)\.
- B\. Zhai, C\. Xu, Y\. He, and Z\. Yao \(2025\)ExCoT: optimizing reasoning for text\-to\-SQL with execution feedback\.arXiv preprint arXiv:2503\.19988\.Cited by:[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1)\.
- C\. Zhang, Y\. Mao, Y\. Fan, Y\. Mi, Y\. Gao, L\. Chen, D\. Lou, and J\. Lin \(2024a\)FinSQL: model\-agnostic llms\-based text\-to\-sql framework for financial analysis\.InCompanion of the 2024 International Conference on Management of Data, SIGMOD/PODS 2024, Santiago, Chile, June 9\-15, 2024,pp\. 93–105\.Cited by:[§6\.2](https://arxiv.org/html/2606.15598#S6.SS2.p1.1)\.
- H\. Zhang, R\. Cao, L\. Chen, H\. Xu, and K\. Yu \(2023\)ACT\-SQL: in\-context learning for text\-to\-SQL with automatically\-generated chain\-of\-thought\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 3501–3532\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p2.1),[§4\.2\.1](https://arxiv.org/html/2606.15598#S4.SS2.SSS1.p2.1),[§6\.1](https://arxiv.org/html/2606.15598#S6.SS1.p1.1)\.
- M\. Zhang, K\. Ma, L\. Xu, K\. Zhang, Y\. Peng, and R\. Jin \(2025\)CLEAR: a parser\-independent disambiguation framework for NL2SQL\.In2025 IEEE 41st International Conference on Data Engineering \(ICDE\),pp\. 3302–3315\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- W\. Zhang, Y\. Wang, Y\. Song, V\. J\. Wei, Y\. Tian, Y\. Qi, J\. H\. Chan, R\. C\. Wong, and H\. Yang \(2024b\)Natural language interfaces for tabular data querying and visualization: a survey\.IEEE Transactions on Knowledge and Data Engineering36\(11\),pp\. 6699–6718\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- F\. Zhao, S\. Deep, F\. Psallidas, A\. Floratou, D\. Agrawal, and A\. E\. Abbadi \(2024\)Sphinteract: resolving ambiguities in NL2SQL through user interaction\.Proceedings of the VLDB Endowment18\(4\),pp\. 1145–1158\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.
- B\. Zheng, L\. Bi, R\. Xi, L\. Chen, Y\. Gao, X\. Zhou, and C\. S\. Jensen \(2023\)RHB\-net: a relation\-aware historical bridging network for text2sql auto\-completion\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1458–1467\.Cited by:[§1](https://arxiv.org/html/2606.15598#S1.p1.1)\.Similar Articles
Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards
Progress-SQL introduces a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL, using an Oracle-guided Diagnostic Tree to provide dense reward signals and improve SQL query generation on benchmarks like BIRD and Spider.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
This paper introduces TESSY, a teacher-student cooperative framework for fine-tuning reasoning models that generates on-policy SFT data by decoupling generation into capability tokens (from teacher) and style tokens (from student), addressing catastrophic forgetting issues when using off-policy teacher data.
Residual Skill Optimization for Text-to-SQL Ensembles
DivSkill-SQL is a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning, improving selected accuracy by up to +11.1 points on Spider2-Lite by targeting examples that current ensembles fail on.
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
This paper constructs a large dataset of 263,911 long-form stories annotated with TTCW-based creativity metrics and fine-tunes Qwen3 models to generate structured review reports. It finds that non-reasoning fine-tuning outperforms reasoning-supervised fine-tuning, which suffers from parse failures and irrelevant repetition.