Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

arXiv cs.CL Papers

Summary

This paper proposes a knowledge-aware Text-to-SQL framework that uses knowledge distillation to improve performance in low-resource settings by constructing task-specific knowledge bases and generating synthetic training data. Experiments on seven benchmarks show substantial improvements, especially for open-source models.

arXiv:2605.22843v1 Announce Type: new Abstract: Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{<question, SQL>} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:54 AM

# Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model
Source: [https://arxiv.org/html/2605.22843](https://arxiv.org/html/2605.22843)
Tianhao Qiu Shenzhen University Shenzhen, China 2310275033@email\.szu\.edu\.cn &Xiaojun Chen Shenzhen University Shenzhen, China xjchen@szu\.edu\.cn

###### Abstract

Text\-to\-SQL converts natural language questions into executable SQL queries, enabling non\-technical users to access relational databases for analytics and intelligent data services\. In real\-world scenarios, performance is often constrained by low\-resource settings, where high\-quality annotated<question, SQL\>pairs are scarce, particularly for domain\-specific databases\. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema\. Existing data synthesis and prompting techniques improve coverage but often fail to produce task\-specific, semantically grounded examples aligned with database constraints\. To address these challenges, we propose a knowledge\-aware Text\-to\-SQL framework that constructs task\-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference\. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval\. Experiments on seven benchmarks, covering both general and domain\-specific datasets, demonstrate that our approach substantially improves the performance of open\-source and closed\-source large language models in Text\-to\-SQL tasks, especially in low\-resource domain\-specific settings, enhancing generalization, robustness, and adaptability\.

Knowledge Distillation for Low\-Resource Open\-source Text\-to\-SQL Model

Tianhao QiuShenzhen UniversityShenzhen, China2310275033@email\.szu\.edu\.cnXiaojun ChenShenzhen UniversityShenzhen, Chinaxjchen@szu\.edu\.cn

## 1Introduction

Text\-to\-SQL is a foundational task in natural language processing that translates natural language questions into executable SQL queries\. By serving as a bridge between non\-technical users and relational databases, it enables intuitive and scalable access to structured data, powering applications such as business analytics, intelligent data services, and reporting\. However, accurately mapping user intent to SQL while adhering to the strict syntactic and semantic constraints of relational schemas remains a core challengeQinet al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib41)\); Katsogiannis\-Meimarakis and Koutrika \([2023](https://arxiv.org/html/2605.22843#bib.bib52)\)\.

A key challenge in real\-world Text\-to\-SQL is low\-resource settings, where only a limited number of labeled<question, SQL\>pairs are available for a given task—especially for open\-source models, which cannot leverage proprietary data due to privacy constraints\. Recent work has sought to mitigate this limitation through data synthesis strategies\. Rule\-based methods using grammars and templatesYuet al\.\([2018a](https://arxiv.org/html/2605.22843#bib.bib119)\); Wuet al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib114)\)and LLM\-based approaches leveraging prompt engineeringLiet al\.\([2024a](https://arxiv.org/html/2605.22843#bib.bib97)\); Yanget al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib98)\); Liet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib123)\)have expanded training coverage\. However, these approaches primarily generate general\-purpose samples and often fail to produce task\-specific, semantically grounded examples, resulting in poor alignment with real\-world database constraintsPourreza and Rafiei \([2023](https://arxiv.org/html/2605.22843#bib.bib81)\); Wanget al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib76)\)\.

To address this, we propose a framework to distill structured, task\-specific knowledge from closed\-source LLMs into open\-source models\. The framework captures domain terminology and SQL query patterns that encode semantic relationships between questions and schema elements\. This knowledge is then leveraged to synthesize high\-quality, grounded training examples for fine\-tuning and to provide task\-relevant context during inference, enhancing the model’s reasoning over complex queries\. By transferring the implicit understanding of closed\-source models to open\-source ones, our framework enables more accurate, context\-aware, and executable SQL generation, even in low\-resource settings\.

Our main contributions are as follows:

1. 1\.Structured Knowledge Construction:We develop a systematic approach to construct task\-specific knowledge, including schema knowledge, domain terminology, and SQL query patterns\. This includes algorithms for extracting domain terms and building the SQL Pattern Graph, which captures recurring relationships between question types and SQL skeletons\.
2. 2\.Knowledge\-Aware Training and Inference:Leveraging the constructed knowledge, we synthesize diverse and semantically accurate<question, SQL\>pairs with LLMs for fine\-tuning, and retrieve relevant schema, domain, and query pattern knowledge at inference to guide reasoning\. This unified approach improves generalization, context\-awareness, and the accuracy of SQL generation in low\-resource and domain\-specific settings\.
3. 3\.Extensive Evaluation:We conduct comprehensive experiments across seven benchmarks spanning general and domain\-specific datasets\. Results demonstrate that our framework consistently enhances the performance of both open\-source and closed\-source LLMs, highlighting the value of structured knowledge for generalization, interpretability, and adaptability in real\-world Text\-to\-SQL tasks\.

## 2Related Work

### 2\.1Text\-to\-SQL

Early Text\-to\-SQL solutions were mainly rule\- or template\-drivenLi and Jagadish \([2014](https://arxiv.org/html/2605.22843#bib.bib77)\); Mahmudet al\.\([2015](https://arxiv.org/html/2605.22843#bib.bib78)\), relying on handcrafted rules or SQL templates to convert natural language into queries\. While effective for simple scenarios, they struggled to scale to complex, multi\-domain settings due to rigidity and labor\-intensive template design\. Benchmark datasets such as WikiSQLZhonget al\.\([2017](https://arxiv.org/html/2605.22843#bib.bib47)\), SpiderYuet al\.\([2018b](https://arxiv.org/html/2605.22843#bib.bib26)\), KaggleDBQALeeet al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib28)\), and BIRDLiet al\.\([2023c](https://arxiv.org/html/2605.22843#bib.bib14)\)later enabled more realistic, multi\-table, and cross\-domain research\.

With the rise of deep learning, Text\-to\-SQL has been reframed as a sequence\-to\-sequence problem\. Encoder\-decoder architecturesCaiet al\.\([2018](https://arxiv.org/html/2605.22843#bib.bib64)\); Popescuet al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib46)\); Qiet al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib72)\), enhanced by attention mechanismsLiuet al\.\([2023b](https://arxiv.org/html/2605.22843#bib.bib45)\), graph\-based schema representationsXuet al\.\([2018](https://arxiv.org/html/2605.22843#bib.bib42)\); Liet al\.\([2023b](https://arxiv.org/html/2605.22843#bib.bib35)\); Zhenget al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib61)\); Wanget al\.\([2020](https://arxiv.org/html/2605.22843#bib.bib60)\), and syntax\-aware decodingGuoet al\.\([2019](https://arxiv.org/html/2605.22843#bib.bib65)\); Scholaket al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib71)\); Liet al\.\([2023a](https://arxiv.org/html/2605.22843#bib.bib91)\); Wanget al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib76)\), have become dominant\. Tabular Language Models like TaBERTYinet al\.\([2020](https://arxiv.org/html/2605.22843#bib.bib79)\)further support joint modeling of text and schema\. Despite these advances, training such models remains expensive, and domain adaptation is challenging\.

Recently, large language models \(LLMs\) such as GPTOpenAI \([2023b](https://arxiv.org/html/2605.22843#bib.bib5),[a](https://arxiv.org/html/2605.22843#bib.bib6)\)and LLaMATouvronet al\.\([2023a](https://arxiv.org/html/2605.22843#bib.bib10),[b](https://arxiv.org/html/2605.22843#bib.bib7)\)have demonstrated remarkable Text\-to\-SQL capabilities\. Three main paradigms have emerged:supervised fine\-tuning \(SFT\)Sunet al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib21)\), which updates model parameters using labeled<question, SQL\>pairs;in\-context learning \(ICL\)Donget al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib40)\); Nanet al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib19)\); Liuet al\.\([2023a](https://arxiv.org/html/2605.22843#bib.bib13)\); Gaoet al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib2)\), which relies on carefully designed prompts without modifying model parameters; andreinforcement learning \(RL\)Shaoet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib143)\); Pourrezaet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib145)\); Maet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib147)\), which leverages feedback to directly optimize model behavior, improving robustness and alignment with complex objectives\.

![Refer to caption](https://arxiv.org/html/2605.22843v1/figures/kg_gen.png)Figure 1:Our proposed knowledge enhancement framework for Text\-to\-SQL tasks\.
### 2\.2Data Synthesis

Data synthesis methods aim to automatically generate additional<question, SQL\>pairs to enhance training coverage, query diversity, and model robustness\. Early rule\-based approaches include template\-driven generation, which translates manually crafted or database\-derived SQL templates into questionsGuoet al\.\([2018](https://arxiv.org/html/2605.22843#bib.bib124)\); Huet al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib117)\); Liet al\.\([2024a](https://arxiv.org/html/2605.22843#bib.bib97)\); grammar\-based generation, which constructs SQL via ASTs or grammars and converts them to natural languageWuet al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib114)\); Wanget al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib115)\); Zhanget al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib116)\); slot\-filling, which populates reusable templates with schema elements or valuesYuet al\.\([2018a](https://arxiv.org/html/2605.22843#bib.bib119)\); Weiret al\.\([2020](https://arxiv.org/html/2605.22843#bib.bib120)\); Yuet al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib121)\); Liet al\.\([2024a](https://arxiv.org/html/2605.22843#bib.bib97)\), though often producing repetitive or unnatural phrasing; and question\-to\-SQL with existing models, which generates questions first and predicts SQL, potentially introducing noiseYanget al\.\([2021](https://arxiv.org/html/2605.22843#bib.bib118)\)\. While rule\-based methods ensure structural correctness, they often struggle with scalability and semantic diversity\. LLM\-based synthesis has increasingly leveraged in\-context learning with SQL templates, control prompts, and curated examples\. For instance, Pourreza et al\.Pourrezaet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib103)\)select SQL templates from Spider to guide generation, Yang et al\.Yanget al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib98)\)control SQL difficulty via table counts, and Li et al\.Liet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib123)\)synthesize diverse databases and systematically generate QA pairs with controlled complexity and language styles\. However, because LLMs generate examples based on general patterns learned from broad pretraining data, they can lack domain grounding for a specific target database\. This may result in syntactic or semantic mismatches with the schema and poor alignment with real\-world constraintsPourreza and Rafiei \([2023](https://arxiv.org/html/2605.22843#bib.bib81)\); Wanget al\.\([2022](https://arxiv.org/html/2605.22843#bib.bib76)\)\.

## 3Motivation

External knowledge is crucial in complex Text\-to\-SQL tasks, helping models interpret user intent, align with database schemas, and generate valid queries\. We categorize such knowledge into three types: 1\)Schema Knowledge:database structure, including table/column names, value formats, and relationships, enabling accurate schema linking; 2\)Domain Knowledge:task\-specific concepts, terminology, and computation logic, allowing reasoning over derived metrics or expressions; 3\)SQL Query Pattern Graph:a structured representation of canonical SQL templates that capture typical reasoning patterns, modeling the mapping from question intent to SQL logic, including constructs such as subqueries, joins, and aggregations\.

Closed\-source LLMs may inherently capture portions of this knowledge, whereas open\-source models often lack it entirely\. To address this, we propose a unified knowledge distillation framework \(Figure[1](https://arxiv.org/html/2605.22843#S2.F1)\) that constructs, verifies, and applies knowledge through a three\-stage pipeline\. Raw knowledge extracted from schema documentation, question–SQL pairs, and query clusters is first filtered and normalized using a lightweight LLM module combined with expert cross\-validation\. The verified knowledge is organized into four unidirectional tables, including𝒯\\mathcal\{T\}for domain terms and the SQL query pattern graph𝒢\\mathcal\{G\}\. This knowledge base enables: \(i\)Knowledge\-Enhanced In\-Context Learning \(KE\-ICL\), which enriches prompts and reduces ambiguity, and \(ii\)Knowledge\-Enhanced Reinforcement Learning \(KE\-RL\), which generates diverse, schema\-faithful training data to improve model robustness\.

## 4Knowledge Construction

Our knowledge construction framework for Text\-to\-SQL consists of four stages: \(i\)Schema Knowledge Enrichment, which enhances schema semantics through clarified names and descriptions; \(ii\)Domain Terminology Construction, which maps domain\-specific terms to SQL logic; \(iii\)SQL Query Pattern Graph Building, which builds a graph of query skeleton patterns; and \(iv\)Knowledge Post\-processing, which validates and organizes the knowledge for context\-aware SQL generation\. The details of each stage are described below\.

### 4\.1Schema Knowledge Enrichment

Schema knowledge can be enriched through a combination of domain experts and large language models \(LLMs\)\. The LLM infers metadata beyond the original schema definitions, generating human\-readable annotations for table and column names, clarifying abbreviations, and interpreting encoded values to enhance semantic transparency\. For instance, in thecalifornia\_schoolsdatabase, a tablefrpmmay be annotated as “Free and Reduced\-Price Meal Program statistics,”capacityas “number of available seats,” anddobas “date of birth\.” Similarly, value\-level mappings such asM/Fingender\_codeare interpreted as “Male” and “Female\.” Domain experts can further validate and refine these annotations to ensure accuracy and consistency\. By bridging low\-level schema structures with natural language understanding, this enriched schema knowledge improves the model’s comprehension of database semantics, enabling more accurate and context\-aware SQL generation\.

### 4\.2Domain Terminology Construction

This stage constructs domain\-specific terminology from database columns, as detailed in Algorithm[1](https://arxiv.org/html/2605.22843#alg1)in Appendix[C](https://arxiv.org/html/2605.22843#A3)\. Each column is first encoded into a semantic embedding and clustered into groups representing related concepts\. Candidate terms are then generated by sampling one term from each of two clusters and combining them with a sampled operator or symbol \(o​pop\)\. Each candidate is validated by a large language model \(LLM\), which provides a validity label, confidence score, and an optional natural\-language explanation\. Valid terms are collected, and the topKKterms are selected based on their confidence scores\. This approach efficiently explores the space of column combinations while ensuring semantic diversity, interpretability, and high\-quality domain terminology\.

### 4\.3SQL Pattern Graph Building

![Refer to caption](https://arxiv.org/html/2605.22843v1/figures/sql_pattern.png)Figure 2:SQL Pattern Graph ConstructionTheSQL Pattern Graphcaptures frequent mappings between question clusters and SQL skeleton clusters \(Figure[2](https://arxiv.org/html/2605.22843#S4.F2)\)\. Let the input sets be questions𝒬=\{q1,q2,…,qN\}\\mathcal\{Q\}=\\\{q\_\{1\},q\_\{2\},\\dots,q\_\{N\}\\\}and SQL answers𝒮=\{s1,s2,…,sN\}\\mathcal\{S\}=\\\{s\_\{1\},s\_\{2\},\\dots,s\_\{N\}\\\}\. In theQuestion Processing Module, each questionqiq\_\{i\}is first masked and then represented in a semantic space\. Questions are clustered based on semantic distance, resulting in a set ofkqk\_\{q\}question clusters\{q^1,⋯,q^kq\}\\\{\\hat\{q\}\_\{1\},\\cdots,\\hat\{q\}\_\{k\_\{q\}\}\\\}, where each cluster represents a type of skeleton\. Simultaneously, in theSQL Processing Module, SQL skeletons are extracted from the SQL answers and clustered using TF\-IDF features, results in a set ofksk\_\{s\}skeleton clusters\{s^1,⋯,s^ks\}\\\{\\hat\{s\}\_\{1\},\\cdots,\\hat\{s\}\_\{k\_\{s\}\}\\\}\.

Next,bigram statisticsbetween question clusters and SQL skeleton clusters are computed to estimate co\-occurrence the frequenciesf​\(q^i,s^j\)f\(\\hat\{q\}\_\{i\},\\hat\{s\}\_\{j\}\)in the original question\-sql pairs\. From these,conditional probabilitiesare derived asp​\(s^j∣q^i\)=f​\(q^i,s^j\)∑j′f​\(q^i,s^j′\)p\(\\hat\{s\}\_\{j\}\\mid\\hat\{q\}\_\{i\}\)=\\frac\{f\(\\hat\{q\}\_\{i\},\\hat\{s\}\_\{j\}\)\}\{\\sum\_\{j^\{\\prime\}\}f\(\\hat\{q\}\_\{i\},\\hat\{s\}\_\{j^\{\\prime\}\}\)\}\. Finally, theSQL Pattern GraphG=\(𝒱,ℰ\)G=\(\\mathcal\{V\},\\mathcal\{E\}\)is constructed, where vertices𝒱=\{q^i,s^j\}\\mathcal\{V\}=\\\{\\hat\{q\}\_\{i\},\\hat\{s\}\_\{j\}\\\}represent clusters and edgesei​j∈ℰe\_\{ij\}\\in\\mathcal\{E\}connect question clusters to SQL skeleton clusters with weights corresponding to the conditional probabilitiesw​\(ei​j\)=p​\(s^j∣q^i\)w\(e\_\{ij\}\)=p\(\\hat\{s\}\_\{j\}\\mid\\hat\{q\}\_\{i\}\)\. This graph effectively encodes recurring patterns between questions and SQL templates, enabling structured generalization for SQL query generation\.

### 4\.4Knowledge Post\-processing

Before storage, all extracted knowledge undergoes a two\-stage validation pipeline to ensure both semantic fidelity and execution reliability\.Schema knowledgeanddomain terminologyare evaluated on a per\-database basis, reflecting their database\-specific semantics, whileSQL query patternsare validated globally across databases\. Domain terminology and SQL patterns are assessed using a hybrid LLM–human framework with unified scoring criteria\. Two state\-of\-the\-art LLMs—Claude 3\.5 Sonnet and Gemini 1\.5 Pro—independently score each entry on a 1–5 scale along two dimensions:semantic consistency\(whether the natural\-language description faithfully captures the SQL logic\) andSQL validity\(whether the SQL is syntactically correct and executable\)\. Entries receiving scores≥4\\geq 4from both LLMs proceed to human validation, where two annotators independently apply the same criteria\. Items with mutual agreement are accepted, while disagreements are resolved by a third expert adjudicator\. This process ensures high precision at scale; on the BIRD benchmark, we construct an average of 20 validated domain terms per database, while SQL query patterns are validated across the full corpus\.

Once validated, the knowledge is stored according to its scope and intended usage\.Schema knowledgeanddomain terminologyare maintained on a per\-database basis: schema knowledge augments table and column metadata, while domain terminology is indexed in a database\-specific vector store to enable precise semantic retrieval\. On the BIRD benchmark, we construct an average of 20 validated domain terms for each database\. In contrast,SQL query patternsare organized into a global*SQL Query Pattern Graph*𝒢\\mathcal\{G\}shared across databases, capturing reusable reasoning structures that are independent of specific schema details\. This graph is extracted from BIRD and OmniSQLLiet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib123)\)and consists of approximately 50 question clusters and 150 SQL skeleton clusters, with edges encoding conditional associations between question intents and SQL structures\. The graph is stored in a graph database \(e\.g\., Neo4j\) and retrieved during both training and inference to guide structured, cross\-database SQL reasoning\.

## 5Knowledge\-Enhanced In\-Context Learning

We propose aknowledge\-enhanced in\-context learning \(KE\-ICL\)approach that constructs a composite prompt integrating structural, semantic, and contextual cues\. The prompt follows a unified template \(Listing[1](https://arxiv.org/html/2605.22843#LST1)\), where the user’s question is inserted verbatim as$\{USER\_QUESTION\}, and three additional components—$\{DATABASE\_SCHEMA\},$\{DOMAIN\_TERM\}, and$\{QUERY\_PATTERN\}—provide structured guidance\.

Database Schema and Domain Terms \($\{DATABASE\_SCHEMA\},$\{DOMAIN\_TERM\}\)\.Both components leverage a single classifier,Knowledge Linker, to predict the relevance of schema elements and domain\-specific terms with respect to the user question\. The classifier is trained on the BIRD datasetLiet al\.\([2024b](https://arxiv.org/html/2605.22843#bib.bib113)\)with a RoBERTa encoderLiuet al\.\([2019](https://arxiv.org/html/2605.22843#bib.bib141)\), and outputs relevance scores for tables, columns, and domain terms\. We select the top\-k1k\_\{1\}tables and top\-k2k\_\{2\}columns for schema linking, and the top\-k3k\_\{3\}terms for domain knowledge\.

SQL Query Pattern \($\{QUERY\_PATTERN\}\)\.To facilitate analogical reasoning and pattern transfer, we retrieve the top\-k4k\_\{4\}SQL skeletons from the SQL query pattern graph𝒢\\mathcal\{G\}\. Given a queryqq, we first identify the two most similar question clusters,q^1\\hat\{q\}\_\{1\}andq^2\\hat\{q\}\_\{2\}, and compute the conditional probability of each SQL skeleton clusters^j\\hat\{s\}\_\{j\}asp​\(s^j∣q\)=p​\(q^1\)⋅p​\(s^j∣q^1\)\+p​\(q^2\)⋅p​\(s^j∣q^2\)p\(\\hat\{s\}\_\{j\}\\mid q\)=p\(\\hat\{q\}\_\{1\}\)\\cdot p\(\\hat\{s\}\_\{j\}\\mid\\hat\{q\}\_\{1\}\)\+p\(\\hat\{q\}\_\{2\}\)\\cdot p\(\\hat\{s\}\_\{j\}\\mid\\hat\{q\}\_\{2\}\), wherep​\(q^i\)p\(\\hat\{q\}\_\{i\}\)is proportional to the similarity betweenqqandq^i\\hat\{q\}\_\{i\}\. The top\-k4k\_\{4\}skeletons are then sampled based on these probabilities using weighted random sampling, selecting the most relevant patterns to include as in\-context examples for guiding SQL generation\.

![Refer to caption](https://arxiv.org/html/2605.22843v1/figures/incontext_learning.png)Figure 3:Knowledge\-Enhanced In\-Context Learning\.
## 6Knowledge\-Enhanced Reinforcement Learning

Existing open\-source LLMs often struggle with Text\-to\-SQL tasks, particularly in incorporating domain\-specific knowledge, limiting performance on specialized databases\. We proposeKnowledge\-Enhanced Reinforcement Learning \(KE\-RL\), which leverages schema and domain knowledge to generate diverse, accurate, and contextually grounded question–SQL pairs for LLM training\. As shown in Figure[3](https://arxiv.org/html/2605.22843#S5.F3), the pipeline consists of four stages: SQL template generation, knowledge\-aware Q–SQL pair generation, data augmentation, and GRPO\-based fine\-tuning\.

SQL Template Generation:We construct abstract SQL skeletons capturing high\-level query structure \(e\.g\.,SELECT,WHERE,GROUP BY\) while masking schema\-specific elements\. The skeleton pool combines 440 extracted and 100 manually designed skeletons, covering common and complex patterns\. Each skeleton is expanded into executable SQL templates using LLM\-guided synthesis with three difficulty levels: Easy, Medium, Hard\. Templates are retained adaptively based on the number of tables involved, yielding 3,854 validated SQL templates\.

Knowledge\-Aware Question–SQL Pair GenerationTemplates are sampled based on structural diversity and similarity to representative queriesℛ\\mathcal\{R\}from the SQL query pattern graph𝒢\\mathcal\{G\}:

N=⌈ρ1−ρ×M16⌉,pi=\{Siα∑jSjα,T≠01N,T=0N=\\left\\lceil\\frac\{\\rho\}\{1\-\\rho\}\\times\\frac\{M\}\{16\}\\right\\rceil,\\quad p\_\{i\}=\\begin\{cases\}\\frac\{S\_\{i\}^\{\\alpha\}\}\{\\sum\_\{j\}S\_\{j\}^\{\\alpha\}\},&T\\neq 0\\\\ \\frac\{1\}\{N\},&T=0\\end\{cases\}whereSiS\_\{i\}is the average cosine similarity between templatetit\_\{i\}and representative queries,α\\alphacontrols the sampling bias \(α\>0\\alpha\>0favors templates similar to known queries,α=0\\alpha=0yields uniform sampling, andα<0\\alpha<0encourages selection of structurally diverse templates\) , andρ\\rhodetermines the synthetic\-to\-real data ratio\.

For each sampled template, a knowledge\-enriched prompt incorporates both the database schema and the top\-kkrelevant domain knowledge entries\. An LLM generates the SQL statement and corresponding natural language question, which are validated for syntactic correctness, schema consistency, and semantic alignment\.

Data Augmentation:To enhance diversity, each validated pair undergoes SQL rewriting \(three semantically equivalent variants per original SQL\) and question rephrasing \(three alternative phrasings per SQL variant\), producing 16 Q–SQL pairs per example\. Chain\-of\-thought explanations are minimally adapted to reflect edits\.

GRPO Training:The synthesized and augmented dataset is used to fine\-tune the LLM with GRPO, guided by a knowledge\-aware, execution\-driven reward:

Ri=\{1,if the SQL execution matches the ground truth;0\.5,if the SQL respects knowledge constraints;0\.1,if the SQL is executable;0,otherwise\.R\_\{i\}=\\begin\{cases\}1,&\\text\{if the SQL execution matches the ground truth;\}\\\\ 0\.5,&\\text\{if the SQL respects knowledge constraints;\}\\\\ 0\.1,&\\text\{if the SQL is executable;\}\\\\ 0,&\\text\{otherwise\.\}\\end\{cases\}The intermediate reward for partial alignment encourages the model to respect relevant knowledge—such as schema and domain\-term constraints—even if the execution result is not fully correct\. This knowledge\-aware reward guides GRPO to generate SQL queries that are syntactically valid, semantically accurate, and consistent with both the database schema and associated domain knowledge, promoting robust reasoning over structured data\.

## 7Experiment

### 7\.1Main Result

Table 1:Execution accuracy on seven benchmarks\.LLMMethodStandardRobustnessDomain\-SpecificAverageSpider\-devSpider\-testBIRD\-devSpider\-DKSpider\-SynSpider\-RealisticEHRSQLScience BenchmarkGPT\-4oDAIL\-SQL \(ICL\)72\.4585\.0362\.1864\.9163\.4475\.5140\.0253\.6264\.65CodeS \(ICL\)73\.8886\.1664\.4165\.9865\.8676\.7741\.2754\.8466\.15KE\-ICL74\.1786\.4065\.2568\.4166\.0578\.1447\.1259\.5368\.13Gemini\-Pro\-1\.5DAIL\-SQL \(ICL\)78\.9187\.2066\.3073\.0269\.5078\.3051\.0050\.2069\.30CodeS \(ICL\)80\.1788\.3167\.5474\.2071\.1779\.7252\.2851\.5470\.62KE\-ICL80\.2788\.6867\.8075\.3271\.7680\.1155\.4653\.1771\.57Deepseek\-Coder\-7BDAIL\-SQL \(ICL\)61\.4470\.9038\.9063\.9553\.1356\.0213\.4128\.8848\.33CodeS \(ICL\)63\.5372\.3840\.2965\.3455\.1857\.1314\.8829\.7649\.81KE\-ICL67\.8975\.0845\.8268\.1064\.0066\.2023\.5134\.4455\.62SQL\-GEN \(RL\)74\.0579\.7349\.8770\.3059\.6562\.7228\.3034\.6257\.41Omni \(RL\)78\.9082\.7852\.7072\.4762\.2764\.8531\.4639\.2960\.47KE\-RL80\.4984\.7658\.7476\.5866\.0468\.7345\.6448\.3266\.16Granite\-3\.1\-8BDAIL\-SQL \(ICL\)57\.0167\.2035\.4446\.9243\.0146\.1012\.3031\.8942\.48CodeS \(ICL\)59\.1568\.5637\.1648\.1744\.6347\.4313\.5933\.1143\.98KE\-ICL61\.2270\.2441\.7852\.2747\.7850\.1320\.6336\.1247\.52SQL\-GEN \(RL\)69\.3277\.9152\.2360\.1255\.4457\.0133\.1046\.8856\.50Omni \(RL\)77\.1482\.3449\.9856\.9752\.2155\.1030\.7041\.1355\.69KE\-RL78\.0182\.4857\.4862\.7158\.0960\.9341\.4849\.1661\.29Qwen2\.5\-Coder\-7BDAIL\-SQL \(ICL\)74\.0584\.4454\.7070\.4365\.4363\.7119\.8438\.2258\.85CodeS \(ICL\)75\.7385\.6156\.1371\.2964\.7764\.8021\.0339\.7959\.89KE\-ICL77\.8585\.9856\.0674\.0266\.9871\.4533\.4343\.1463\.61SQL\-GEN \(RL\)77\.5685\.3257\.9273\.9067\.8070\.3534\.5144\.1263\.94Omni \(RL\)78\.4385\.1458\.6076\.5269\.1772\.4338\.0046\.4865\.60KE\-RL82\.3487\.9765\.8280\.9774\.8377\.5847\.9356\.5071\.74

Experiment details are provided in Appendix[A](https://arxiv.org/html/2605.22843#A1)\. As shown in Table[1](https://arxiv.org/html/2605.22843#S7.T1), our knowledge\-enhanced methods, KE\-ICL and KE\-RL, consistently deliver substantial gains across all benchmarks, LLMs, and average metrics\. KE\-ICL achieves the highest average performance among ICL methods, surpassing the second\-best baseline \(CodeS\) by \+3\.2% across five LLMs, with notable improvements on Deepseek\-Coder\-7B \(\+5\.81%\) and Granite\-3\.1\-8B \(\+3\.54%\), and smaller yet consistent gains on GPT\-4o \(\+1\.98%\), Gemini\-Pro\-1\.5 \(\+0\.95%\), and Qwen2\.5\-Coder\-7B\-Instruct \(\+3\.72%\)\. These results highlight the effectiveness of leveraging schema components, domain\-specific expressions, and representative query exemplars for inference\-time reasoning\. KE\-RL further strengthens performance, outperforming the strongest RL baseline \(Omni\) by \+5\.8% on average across open\-source LLMs, with particularly large gains on Deepseek\-Coder\-7B \(\+16\.35% over CodeS\), Granite\-3\.1\-8B \(\+5\.60%\), and Qwen2\.5\-Coder\-7B\-Instruct \(\+6\.14%\)\. Beyond standard benchmarks, KE\-RL enhances robustness against paraphrasing and ambiguity, and delivers remarkable domain\-specific improvements—\+26\.90% on EHRSQL and \+16\.71% on ScienceBenchmark compared to CodeS\. These findings demonstrate that training with structured, knowledge\-informed synthetic data effectively improves syntactic validity, semantic accuracy, and reasoning over both schema and domain knowledge, providing a scalable and cost\-efficient pathway for building high\-performing Text\-to\-SQL systems with open\-source LLMs\.

Table 2:Ablation study of knowledge\-enhanced prompting with KE\-SI on Qwen2\.5\-Coder\-7B\-Instruct\. “↓” indicates performance drop compared to the full setting, with absolute deltas in parentheses\.Boldhighlights the most affected metrics per row\.Knowledge SettingStandardRobustnessDomain\-SpecificAverageSpider\-devSpider\-testBIRD \(dev\)Spider\-DKSpider\-SynSpider\-RealisticEHRSQLScienceBenchmarkALL Knowledge82\.3487\.9765\.8280\.974\.8375\.5847\.9356\.5071\.74w/o Enhanced Schema Info75\.58 ↓ \(6\.76\)82\.80 ↓ \(5\.17\)58\.98 ↓ \(6\.84\)72\.32 ↓ \(8\.65\)68\.23 ↓ \(6\.60\)67\.21 ↓ \(8\.37\)43\.80 ↓ \(4\.13\)51\.60 ↓ \(4\.90\)65\.07 ↓ \(6\.68\)w/o Representative Queries79\.83 ↓ \(2\.51\)86\.29 ↓ \(1\.68\)59\.50 ↓ \(6\.32\)76\.50 ↓ \(4\.47\)68\.24 ↓ \(4\.59\)72\.66 ↓ \(4\.92\)41\.10 ↓ \(6\.83\)52\.40 ↓ \(4\.10\)67\.07 ↓ \(4\.68\)w/o Domain Terminology78\.48 ↓ \(3\.86\)85\.31 ↓ \(2\.66\)61\.62 ↓ \(4\.10\)74\.22 ↓ \(6\.75\)70\.75 ↓ \(4\.08\)69\.63 ↓ \(5\.95\)45\.10 ↓ \(2\.83\)53\.30 ↓ \(3\.20\)67\.30 ↓ \(4\.4\)

### 7\.2Schema Linking Result

We evaluate our proposed two\-step schema linking strategy \(Section[5](https://arxiv.org/html/2605.22843#S5)\) using Schema Linking Recall \(SLR\)Maamariet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib140)\), which measures the proportion of questions for which all required columns are correctly retrieved—a prerequisite for accurate SQL generation\. As shown in Table[5](https://arxiv.org/html/2605.22843#A2.T5), two main findings emerge\. First, the Step 1 schema classifier consistently improves over the LLM\-only baseline across different numbers of retained columns \(k2k\_\{2\}\)\. For example, on BIRD\-dev, increasingk2k\_\{2\}from 2 to 8 gradually improves SLR, demonstrating that structure\-aware filtering effectively ranks the most relevant columns\. Even at lowerk2k\_\{2\}values, Step 1 outperforms the LLM, highlighting the schema classifier’s superior ability to select relevant tables and columns\. Second, Step 2, which adds term expansion via value\-aware retrieval, further increases recall over Step 1 across mostk2k\_\{2\}settings\. Gains are most pronounced at moderatek2k\_\{2\}values, while at very highk2k\_\{2\}, improvements are smaller, suggesting that Step 1 already ranks relevant columns effectively\. Overall, our two\-step schema linking method consistently outperforms the LLM, providing robust schema linking and enhanced coverage for downstream SQL generation\.

Table 3:Effect of varying the top\-k3k\_\{3\}retrieved knowledge items on model performance\(execution accuracy\) on the BIRD\-DEV benchmark across different LLMs\.Top\-k3k\_\{3\}Qwen\-2\.5Coder\-7BGPT\-4o\-miniGPT\-4oGemini\-1\.5k3k\_\{3\}=054\.7061\.8364\.4167\.54k3k\_\{3\}=355\.1062\.5164\.5067\.30k3k\_\{3\}=556\.0663\.3465\.2567\.80k3k\_\{3\}=754\.9064\.3665\.2868\.22k3k\_\{3\}=954\.2062\.5664\.4366\.36

### 7\.3In\-Context Learning

Impact of Knowledge Types\.Following the setup in Appendix[A](https://arxiv.org/html/2605.22843#A1), we assess the contribution of different knowledge types under the KE\-SI setting using the Qwen2\.5\-Coder\-7B\-Instruct model\. As shown in Table[2](https://arxiv.org/html/2605.22843#S7.T2), each injected component—Enhanced Schema Info, Representative Queries, and Domain Terminology—plays a distinct and complementary role in enhancing in\-context learning\. Enhanced Schema Info proves to be the most critical\. Its removal leads to the largest performance declines, particularly on Spider\-DK \(−8\.65%\-8\.65\\%\), Spider\-Realistic \(−8\.37%\-8\.37\\%\), and ScienceBenchmark \(−4\.90%\-4\.90\\%\), underscoring its importance in establishing structural alignment for accurate SQL generation\. Representative Queries also contribute substantially, especially in domain specific benchmarks such as EHRSQL \(−6\.83%\-6\.83\\%\) and ScienceBenchmark \(−4\.10%\-4\.10\\%\)\. While domain terminology has a relatively smaller impact, it remains important for grounding semantic understanding in real applications\. Its removal causes notable drops on Spider\-Realistic \(−5\.95%\-5\.95\\%\) and ScienceBenchmark \(−3\.20%\-3\.20\\%\), highlighting its role in entity disambiguation and domain\-specific reasoning\. Overall, the removal of any individual component results in consistent performance degradation across benchmarks, emphasizing the necessity of holistic knowledge injection for robust in\-context learning\.

Impact of Knowledge Quantity\.In this experiment, we examine the impact of injecting domain knowledge \(DK\) entries and representative questions \(RQ\) by varyingk3k\_\{3\}, while keeping the number of selected tables fixed atk1=5k\_\{1\}=5and the number of selected columns per table fixed atk2=12k\_\{2\}=12\. As shown in Table[3](https://arxiv.org/html/2605.22843#S7.T3), three key findings emerge\. First, adding knowledge consistently improves performance compared with thek3=0k\_\{3\}=0baseline, validating its usefulness for semantic grounding\. For example, GPT\-4o improves from 64\.41% atk3=0k\_\{3\}=0to 65\.25% atk3=5k\_\{3\}=5\. Second, performance gains peak at moderate levels \(k3=5k\_\{3\}=5or77\), whereas larger injections \(k3=9k\_\{3\}=9\) often degrade accuracy—indicating that excessive context may introduce noise\. Third, the impact varies across models: GPT\-4o and Gemini\-1\.5 achieve their best results atk3=7k\_\{3\}=7, while Qwen\-2\.5\-Coder\-7B shows only marginal improvements\. Overall, controlled knowledge injection enhances robustness, but excessive information can hinder performance\.

### 7\.4Reinforcement Learning

Effect of Synthetic Data Ratio\.The synthetic\-to\-real data ratioρ\\rhocontrols the proportion of generated samples relative to human\-annotated ones\. We study how varyingρ\\rhoinfluences model performance while keeping the total training size fixed at 5,000 instances\. As shown in Figure[4](https://arxiv.org/html/2605.22843#S7.F4), performance exhibits an inverted U\-shaped trend: mixing real and synthetic data consistently outperforms using either alone\. Fordomain\-specific benchmarks\(EHRSQL and ScienceBenchmark\), optimal performance arises when 20–40% of the data is synthetic\. In this range, synthetic data broadens coverage while real annotations anchor domain\-specific logic\. In contrast, forstandard and robustness benchmarks\(Spider\-dev and Spider\-DK\), higher synthetic ratios \(40–80%\) are more effective, as synthetic data enriches query diversity and strengthens generalization to unseen structures\. Whenρ=1\.0\\rho=1\.0, no real data remains, and template sampling reduces to uniform selection across all templates without representative query guidance, disrupting query pattern preferences and leading to sharp performance drops across all datasets\. Overall, these results underscore a key trade\-off:*with limited annotation budgets, neither relying solely on synthetic data nor exclusively on real data is optimal\. Instead, carefully tuningρ\\rhoenables practitioners to maximize performance under realistic resource constraints*\.

![Refer to caption](https://arxiv.org/html/2605.22843v1/figures/syn_rate.png)Figure 4:Impact of varying synthetic data ratioρ\\rhoon execution accuracy, with total training size fixed at 5,000\.![Refer to caption](https://arxiv.org/html/2605.22843v1/figures/alpha_rate.png)Figure 5:Effect of the sampling bias parameterα\\alphaon model execution accuracy\.Effect of Template Sampling Strategy in Data SynthesisIn our data synthesis pipeline, the hyperparameterα\\alphacontrols the balance between favoring query patterns in the training set and exploring novel query patterns\. As shown in Figure[5](https://arxiv.org/html/2605.22843#S7.F5), the effect ofα\\alphais task\-dependent\. Fordomain\-specific benchmarks\(EHRSQL and ScienceBenchmark\), accuracy improves steadily asα\\alphaincreases, peaking atα=10\\alpha=10\. This suggests that emphasizing templates closer to known query patterns is essential for capturing domain\-specific logic\. In contrast, forstandard and robustness benchmarks\(Spider\-dev and Spider\-DK\), performance is highest whenα<0\\alpha<0, indicating that encouraging structural diversity enhances cross\-domain generalization\. Overall, these results reveal a trade\-off betweenrelevance\-driven sampling, which strengthens specialization, anddiversity\-driven sampling, which improves robustness\. This underscores the importance of adaptingα\\alphato different benchmark settings\.

## 8Conclusion

We propose a unified knowledge\-aware Text\-to\-SQL framework that incorporates task\-specific domain knowledge—including schema semantics, abbreviations, and business logic—into both training and inference\. By generating diverse, contextually grounded synthetic data and performing targeted knowledge retrieval over SQL skeletons and representative query patterns, the framework enhances reasoning and semantic grounding, addressing challenges of data scarcity and structural complexity\. Experiments on seven benchmarks demonstrate substantial performance gains across open\- and closed\-source LLMs, improving generalization, interpretability, and adaptability\. Future work will focus on more advanced knowledge retrieval strategies that reason over skeleton structures and multi\-hop dependencies\.

## 9Limitations

Despite its effectiveness, the framework has several limitations\. First, building and maintaining a high\-quality knowledge base requires significant effort and domain expertise, which can limit scalability and increase costs, particularly for large or frequently evolving databases\. Second, outputs generated by LLMs may still exhibit hallucinations or inconsistencies with domain constraints, which can reduce reliability\. Third, databases with very large schemas or complex table relationships can strain knowledge retrieval and reduce the effectiveness of template coverage, limiting scalability and efficiency\.

## References

- An encoder\-decoder framework translating natural language to database queries\.InProceedings of the Twenty\-Seventh International Joint Conference on Artificial Intelligence,pp\. 3977–3983\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- X\. Chen, T\. Wang, T\. Qiu, J\. Qin, and M\. Yang \(2024\)Open\-sql framework: enhancing text\-to\-sql on open\-source large language models\.External Links:2405\.06674,[Link](https://arxiv.org/abs/2405.06674)Cited by:[Table 5](https://arxiv.org/html/2605.22843#A2.T5)\.
- X\. Deng and et al\. \(2020\)Spider\-dk: evaluating text\-to\-sql models with implicit domain knowledge\.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,pp\. 456–467\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1)\.
- Q\. Dong, L\. Li, D\. Dai, C\. Zheng, Z\. Wu, B\. Chang, X\. Sun, J\. Xu, L\. Li, and Z\. Sui \(2023\)A survey for in\-context learning\.CoRRabs/2301\.00234\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- Z\. Gan and et al\. \(2020\)Spider\-syn: synonym\-augmented robustness testing for text\-to\-sql\.arXiv preprint arXiv:2005\.02345\.External Links:[Link](https://arxiv.org/abs/2005.02345)Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1)\.
- D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou \(2023\)Text\-to\-sql empowered by large language models: a benchmark evaluation\.CoRRabs/2308\.15363\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1),[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- D\. Guo, Y\. Sun, D\. Tang, N\. Duan, J\. Yin, H\. Chi, J\. Cao, P\. Chen, and M\. Zhou \(2018\)Question generation from SQL queries improves neural semantic parsing\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,pp\. 1597–1607\.External Links:[Document](https://dx.doi.org/10.18653/V1/D18-1188)Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, and et al\. \(2024\)DeepSeek\-coder: when the large language model meets programming \- the rise of code intelligence\.CoRRabs/2401\.14196\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2401.14196),2401\.14196Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1)\.
- J\. Guo, Z\. Zhan, Y\. Gao, Y\. Xiao, J\. Lou, T\. Liu, and D\. Zhang \(2019\)Towards complex text\-to\-sql in cross\-domain database with intermediate representation\.InProceedings of the 57th Conference of the Association for Computational Linguistics,pp\. 4524–4535\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- Y\. Hu, Y\. Zhao, J\. Jiang, W\. Lan, H\. Zhu, A\. Chauhan, A\. H\. Li, L\. Pan, J\. Wang, C\. Hang, S\. Zhang, J\. Guo, and et al\. \(2023\)Importance of synthesizing high\-quality data for text\-to\-sql parsing\.InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9\-14, 2023,pp\. 1327–1343\.External Links:[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-ACL.86)Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Dang, A\. Yang, and et al\. \(2024\)Qwen2\.5\-coder technical report\.CoRRabs/2409\.12186\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2409.12186),2409\.12186Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1)\.
- A\. Johnson and B\. Lee \(2023\)ScienceBenchmark: a diverse query set for interdisciplinary text\-to\-sql evaluation\.Journal of Artificial Intelligence Research81,pp\. 123–145\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1)\.
- G\. Katsogiannis\-Meimarakis and G\. Koutrika \(2023\)A survey on deep learning approaches for text\-to\-sql\.VLDB J\.32\(4\),pp\. 905–936\.Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p1.1)\.
- C\. Lee, O\. Polozov, and M\. Richardson \(2021\)KaggleDBQA: realistic evaluation of text\-to\-sql parsers\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,pp\. 2261–2273\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.
- B\. Li and et al\. \(2020\)Spider\-realistic: enhancing text\-to\-sql robustness with realistic modifications\.Proceedings of the 2020 ACL Workshop on Natural Language Processing,pp\. 89–102\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1)\.
- F\. Li and H\. V\. Jagadish \(2014\)Constructing an interactive natural language interface for relational databases\.Proceedings of the VLDB Endowment,pp\. 73–84\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.
- H\. Li, S\. Wu, X\. Zhang, X\. Huang, J\. Zhang, F\. Jiang, S\. Wang, T\. Zhang, J\. Chen, R\. Shi, H\. Chen, and C\. Li \(2025\)OmniSQL: synthesizing high\-quality text\-to\-sql data at scale\.External Links:2503\.02240,[Link](https://arxiv.org/abs/2503.02240)Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1),[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1),[§4\.4](https://arxiv.org/html/2605.22843#S4.SS4.p2.1)\.
- H\. Li, J\. Zhang, C\. Li, and H\. Chen \(2023a\)RESDSQL: decoupling schema linking and skeleton parsing for text\-to\-sql\.AAAI\-23,pp\. 13067–13075\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- H\. Li, J\. Zhang, H\. Liu, J\. Fan, X\. Zhang, J\. Zhu, R\. Wei, H\. Pan, C\. Li, and H\. Chen \(2024a\)Codes: towards building open\-source language models for text\-to\-sql\.Proceedings of the ACM on Management of Data2\(3\),pp\. 1–28\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1),[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- J\. Li, B\. Hui, R\. Cheng, B\. Qin, C\. Ma, N\. Huo, F\. Huang, W\. Du, L\. Si, and Y\. Li \(2023b\)Graphix\-t5: mixing pre\-trained transformers with graph\-aware layers for text\-to\-sql parsing\.In37th AAAI Conference on Artificial Intelligence,pp\. 13076–13084\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- J\. Li, B\. Hui, G\. Qu, B\. Li, J\. Yang, B\. Li, B\. Wang, B\. Qin, R\. Cao, R\. Geng,et al\.\(2023c\)Can LLM already serve as A database interface? A big bench for large\-scale database grounded text\-to\-sqls\.CoRRabs/2305\.03111\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1),[Appendix A](https://arxiv.org/html/2605.22843#A1.p3.1),[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.
- J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo,et al\.\(2024b\)Can llm already serve as a database interface? a big bench for large\-scale database grounded text\-to\-sqls\.Advances in Neural Information Processing Systems36\.Cited by:[§5](https://arxiv.org/html/2605.22843#S5.p2.3)\.
- A\. Liu, X\. Hu, L\. Wen, and P\. S\. Yu \(2023a\)A comprehensive evaluation of chatgpt’s zero\-shot text\-to\-sql capability\.CoRRabs/2303\.13547\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- H\. Liu, Y\. Shi, J\. Zhang, X\. Wang, H\. Li, and F\. Kong \(2023b\)Multi\-hop relational graph attention network for text\-to\-sql parsing\.InInternational Joint Conference on Neural Networks,pp\. 1–8\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.External Links:1907\.11692,[Link](https://arxiv.org/abs/1907.11692)Cited by:[§5](https://arxiv.org/html/2605.22843#S5.p2.3)\.
- P\. Ma, X\. Zhuang, C\. Xu, X\. Jiang, R\. Chen, and J\. Guo \(2025\)SQL\-R1: training natural language to SQL reasoning model by reinforcement learning\.arXiv preprint arXiv:2504\.08600\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- K\. Maamari, F\. Abubaker, D\. Jaroslawicz, and A\. Mhedhbi \(2024\)The death of schema linking? text\-to\-sql in the age of well\-reasoned language models\.External Links:2408\.07702,[Link](https://arxiv.org/abs/2408.07702)Cited by:[§7\.2](https://arxiv.org/html/2605.22843#S7.SS2.p1.6)\.
- T\. Mahmud, K\. M\. Azharul Hasan, M\. Ahmed, and T\. H\. C\. Chak \(2015\)A rule based approach for nlp based query processing\.In2015 2nd International Conference on Electrical Information and Communication Technologies \(EICT\),Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.
- M\. Mishra, M\. Stallone, G\. Zhang, Y\. Shen, A\. Prasad, A\. M\. Soria, M\. Merler, P\. Selvam, S\. Surendran, S\. Singh, M\. Sethi, X\. Dang, and et al\. \(2024\)Granite code models: A family of open foundation models for code intelligence\.CoRRabs/2405\.04324\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2405.04324),2405\.04324Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1)\.
- L\. Nan, Y\. Zhao, W\. Zou, N\. Ri, J\. Tae, E\. Zhang, A\. Cohan, and D\. Radev \(2023\)Enhancing few\-shot text\-to\-sql capabilities of large language models: A study on prompt design strategies\.CoRRabs/2305\.12586\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- OpenAI \(2023a\)GPT\-4 technical report\.CoRRabs/2303\.08774\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- OpenAI \(2023b\)Introducing chatgpt\.Note:[https://openai\.com/blog/chatgpt](https://openai.com/blog/chatgpt)Last accessed on 2023\-07\-24Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- OpenAI \(2024\)Hello gpt\-4o\.Note:https://openai\.com/index/hello\-gpt\-4o/Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1)\.
- O\. Popescu, I\. Manotas, N\. P\. A\. Vo, H\. Yeo, E\. Khorashani, and V\. Sheinin \(2022\)Addressing limitations of encoder\-decoder based approach to text\-to\-sql\.InProceedings of the 29th International Conference on Computational Linguistics,pp\. 1593–1603\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- M\. Pourreza and D\. Rafiei \(2023\)DIN\-sql: decomposed in\-context learning of text\-to\-sql with self\-correction\.External Links:2304\.11015Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- M\. Pourreza, R\. Sun, H\. Li, L\. Miculicich, T\. Pfister, and S\. O\. Arik \(2024\)Sql\-gen: bridging the dialect gap for text\-to\-sql via synthetic data and model merging\.arXiv preprint arXiv:2408\.12733\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- M\. Pourreza, S\. Talaei, R\. Sun, X\. Wan, H\. Li, A\. Mirhoseini, A\. Saberi, S\. Arik,et al\.\(2025\)Reasoning\-SQL: reinforcement learning with SQL tailored partial rewards for reasoning\-enhanced text\-to\-SQL\.arXiv preprint arXiv:2503\.23157\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- J\. Qi, J\. Tang, Z\. He, X\. Wan, Y\. Cheng, C\. Zhou, X\. Wang, Q\. Zhang, and Z\. Lin \(2022\)RASAT: integrating relational structures into pretrained seq2seq model for text\-to\-sql\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 3215–3229\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- B\. Qin, B\. Hui, L\. Wang, M\. Yang, J\. Li, B\. Li, R\. Geng, R\. Cao, J\. Sun, L\. Si, F\. Huang, and Y\. Li \(2022\)A survey on text\-to\-sql parsing: concepts, methods, and future directions\.External Links:2208\.13629Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p1.1)\.
- S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He \(2020\)ZeRO: memory optimizations toward training trillion parameter models\.InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis,Vol\.,pp\. 1–16\.External Links:[Document](https://dx.doi.org/10.1109/SC41405.2020.00024)Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p7.9)\.
- T\. Scholak, N\. Schucher, and D\. Bahdanau \(2021\)PICARD: parsing incrementally for constrained auto\-regressive decoding from language models\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 9895–9901\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\.K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Document](https://dx.doi.org/10.48550/arXiv.2402.03300)Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- R\. Sun, S\. Ö\. Arik, H\. Nakhost, H\. Dai, R\. Sinha, P\. Yin, and T\. Pfister \(2023\)SQL\-palm: improved large language model adaptation for text\-to\-sql\.CoRRabs/2306\.00739\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- G\. Team, P\. Georgiev, V\. I\. Lei, R\. Burnell, L\. Bai, A\. Gulati, G\. Tanzer, D\. Vincent, Z\. Pan, S\. Wang,et al\.\(2024\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.arXiv preprint arXiv:2403\.05530\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p2.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023a\)LLaMA: open and efficient foundation language models\.CoRRabs/2302\.13971\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. Michael, S\. Ranjan, S\. Xiaoqing, E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom \(2023b\)LLAMA2: open foundation and fine\-tuned chat models\.CoRR\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p3.1)\.
- Z\. Wan and et al\. \(2023\)EHRSQL: a practical text\-to\-sql benchmark for electronic health records\.arXiv preprint arXiv:2301\.03462\.External Links:[Link](https://arxiv.org/abs/2301.03462)Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1)\.
- B\. Wang, R\. Shin, X\. Liu, O\. Polozov, and M\. Richardson \(2020\)RAT\-SQL: relation\-aware schema encoding and linking for text\-to\-sql parsers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 7567–7578\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- B\. Wang, W\. Yin, X\. V\. Lin, and C\. Xiong \(2021\)Learning to synthesize data for semantic parsing\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2021, Online, June 6\-11, 2021,pp\. 2760–2766\.External Links:[Document](https://dx.doi.org/10.18653/V1/2021.NAACL-MAIN.220)Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- L\. Wang, B\. Qin, B\. Hui, B\. Li, M\. Yang, B\. Wang, B\. Li, J\. Sun, F\. Huang, L\. Si, and Y\. Li \(2022\)Proton: probing schema linking information from pre\-trained language models for text\-to\-sql parsing\.InThe 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1889–1898\.Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- N\. Weir, P\. A\. Utama, A\. Galakatos, A\. Crotty, A\. Ilkhechi, S\. Ramaswamy, R\. Bhushan, N\. Geisler, B\. Hättasch, S\. Eger, U\. Çetintemel, and C\. Binnig \(2020\)DBPal: A fully pluggable NL2SQL training pipeline\.InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference \[Portland, OR, USA\], June 14\-19, 2020,pp\. 2347–2361\.Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- K\. Wu, L\. Wang, Z\. Li, A\. Zhang, X\. Xiao, H\. Wu, M\. Zhang, and H\. Wang \(2021\)Data augmentation with hierarchical sql\-to\-question generation for cross\-domain text\-to\-sql parsing\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7\-11 November, 2021,pp\. 8974–8983\.External Links:[Document](https://dx.doi.org/10.18653/V1/2021.EMNLP-MAIN.707)Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- K\. Xu, L\. Wu, Z\. Wang, Y\. Feng, and V\. Sheinin \(2018\)SQL\-to\-text generation with graph\-to\-sequence model\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 931–936\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- J\. Yang, B\. Hui, M\. Yang, J\. Yang, J\. Lin, and C\. Zhou \(2024\)Synthesizing text\-to\-sql data from weak and strong llms\.arXiv preprint arXiv:2408\.03256\.Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- W\. Yang, P\. Xu, and Y\. Cao \(2021\)Hierarchical neural data synthesis for semantic parsing\.CoRRabs/2112\.02212\.External Links:2112\.02212Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- P\. Yin, G\. Neubig, W\. Yih, and S\. Riedel \(2020\)TaBERT: pretraining for joint understanding of textual and tabular data\.Cornell University \- arXiv,Cornell University \- arXiv\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- T\. Yu, C\. Wu, X\. V\. Lin, B\. Wang, Y\. C\. Tan, X\. Yang, D\. R\. Radev, R\. Socher, and C\. Xiong \(2021\)GraPPa: grammar\-augmented pre\-training for table semantic parsing\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- T\. Yu, M\. Yasunaga, K\. Yang, R\. Zhang, D\. Wang, Z\. Li, and D\. R\. Radev \(2018a\)SyntaxSQLNet: syntax tree networks for complex and cross\-domain text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,pp\. 1653–1663\.External Links:[Document](https://dx.doi.org/10.18653/V1/D18-1193)Cited by:[§1](https://arxiv.org/html/2605.22843#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. R\. Radev \(2018b\)Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 3911–3921\.Cited by:[Appendix A](https://arxiv.org/html/2605.22843#A1.p1.1),[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.
- Y\. Zhang, J\. Deriu, G\. Katsogiannis\-Meimarakis, C\. Kosten, G\. Koutrika, and K\. Stockinger \(2023\)ScienceBenchmark: A complex real\-world benchmark for evaluating natural language to SQL systems\.Proc\. VLDB Endow\.17\(4\),pp\. 685–698\.External Links:[Document](https://dx.doi.org/10.14778/3636218.3636225)Cited by:[§2\.2](https://arxiv.org/html/2605.22843#S2.SS2.p1.1)\.
- Y\. Zheng, H\. Wang, B\. Dong, X\. Wang, and C\. Li \(2022\)HIE\-SQL: history information enhanced network for context\-dependent text\-to\-sql semantic parsing\.InFindings of the Association for Computational Linguistics,pp\. 2997–3007\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p2.1)\.
- V\. Zhong, C\. Xiong, and R\. Socher \(2017\)Seq2SQL: generating structured queries from natural language using reinforcement learning\.CoRRabs/1709\.00103\.Cited by:[§2\.1](https://arxiv.org/html/2605.22843#S2.SS1.p1.1)\.

## Appendix AExperiment Setup

Benchmarks\.We utilized three distinct benchmark sets to assess our proposed method\. \(1\)Standard Benchmarks:We use the BIRD datasetLiet al\.\([2023c](https://arxiv.org/html/2605.22843#bib.bib14)\)\(BIRD\-dev split, 1,534 examples\) and SpiderYuet al\.\([2018b](https://arxiv.org/html/2605.22843#bib.bib26)\)\(Spider\-dev: 1,034; Spider\-test: 2,147\)\. These benchmarks evaluate structured query generation over cross\-domain databases\. \(2\)Robustness Benchmarks:We employ Spider\-DKDeng and et al\. \([2020](https://arxiv.org/html/2605.22843#bib.bib127)\), Spider\-SynGan and et al\. \([2020](https://arxiv.org/html/2605.22843#bib.bib128)\), and Spider\-RealisticLi and et al\. \([2020](https://arxiv.org/html/2605.22843#bib.bib129)\)to assess model robustness\. These datasets cover domain\-specific reasoning, column name paraphrasing, and realistic query variations, containing 535, 1,034, and 508 queries, respectively\. \(3\)Domain\-Specific Benchmarks:We adopt EHRSQLWan and et al\. \([2023](https://arxiv.org/html/2605.22843#bib.bib125)\)and ScienceBenchmarkJohnson and Lee \([2023](https://arxiv.org/html/2605.22843#bib.bib126)\)to evaluate performance in specialized domains\. EHRSQL consists of 1,008 clinical queries, while ScienceBenchmark includes 299 queries across disciplines such as policy, astronomy, and oncology\.

Baselines\.We compare our approach with a diverse set of models and enhancement strategies\. ForICL\-based baselines, we evaluate Knowledge\-Enhanced In\-Context Learning \(KE\-ICL\) against prompt\-based models such as DAIL\-SQLGaoet al\.\([2023](https://arxiv.org/html/2605.22843#bib.bib2)\)and CodeSLiet al\.\([2024a](https://arxiv.org/html/2605.22843#bib.bib97)\), optimized for lightweight inference via single\-pass prompting\. This evaluation covers both commercial and open\-source Large Language Models \(LLMs\): GPT\-4oOpenAI \([2024](https://arxiv.org/html/2605.22843#bib.bib131)\), Gemini\-Pro\-1\.5Teamet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib130)\), Deepseek\-Coder\-7B\-Instruct\-v1\.5Guoet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib132)\), Qwen2\.5\-Coder\-7B\-InstructHuiet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib133)\), and Granite\-3\.1\-8B\-InstructMishraet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib134)\)\. ForRL\-based baselines, we compare Knowledge\-Enhanced Reinforcement Learning \(KE\-RL\) with SQL\-GENPourrezaet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib103)\)and OmniSQLLiet al\.\([2025](https://arxiv.org/html/2605.22843#bib.bib123)\), two leading data synthesis frameworks for text\-to\-SQL learning\. This comparison, as well as fine\-tuning, is conducted exclusively on the open\-source LLMs: Deepseek\-Coder\-7B\-Instruct\-v1\.5, Qwen2\.5\-Coder\-7B\-Instruct, and Granite\-3\.1\-8B\-Instruct\. All RL methods use the CodeS prompting strategy during training and inference\.

Metrics\.We use execution accuracy \(EX\)Liet al\.\([2023c](https://arxiv.org/html/2605.22843#bib.bib14)\), which measures the correctness of query results and enables fair, database\-agnostic comparisons\.

Implementation Details\.All experiments are conducted on 8 NVIDIA A100 80GB GPUs\.

Domain Terminology Construction\.In Domain Terminology Construction columns from the database, are first encoded into semantic embeddings usingRoBERTa\. The columns are then clustered into groups based on semantic similarity, with the number of clustersMMdetermined by the maximum silhouette coefficient\. Candidate terms are generated by sampling one column from each of two clusters and combining them with a random operator\. These candidate terms are validated by large language models, includingGPT\-4\.0 \(GPT\-4\-turbo\),Claude 3\.5,Deepseek v2, andQwen 1\.5, with one model randomly chosen for each validation\. The topK=150K=150terms are selected based on confidence scores, and the process continues untilNtarget=300N\_\{\\text\{target\}\}=300valid terms are generated\.

Training Data Synthesis\.For standard benchmarks \(Spider and BIRD\) and their robustness variants, we fine\-tune on the BIRD training set augmented via KE\-RL, with a synthetic data ratioρ=0\.6\\rho=0\.6and strong negative sampling bias \(α=−10\\alpha=\-10\) to encourage structurally diverse SQL templates\. This yields∼\\sim14,706 new question–SQL pairs, combined with 9,821 original examples for a total of∼\\sim24,527\. For domain\-specific benchmarks \(EHRSQL, ScienceBenchmark\), we preprocess 5,000 examples \(EHRSQL\) and augment ScienceBenchmark from 300 to 5,000, then generate 7,500 synthetic examples each \(ρ=0\.6\\rho=0\.6,α=10\\alpha=10\), resulting in 12,500 examples per dataset\. Baselines use similar\-scale synthetic datasets \(SQL\-GEN and OmniSQL: 13,000 for BIRD, 7,500 for EHRSQL/ScienceBenchmark\)\.

Reinforcement Learning\.We fine\-tune using GRPO in a prompt\-completion format\. Prompts use knowledge injection with top\-5 tables \(k1=5k\_\{1\}=5\), top\-6 columns per table \(k2=6k\_\{2\}=6\), and top\-3 relevant QA examples \(k3=3k\_\{3\}=3\), truncating$\{DATABASE\_SCHEMA\}to fit a 2,048\-token context\. GRPO settings: temperature 0\.8, total batch 256 \(16 rollouts\), update batch 128, KL penaltyβ=0\.001\\beta=0\.001, clip ratioϵ=0\.2\\epsilon=0\.2\. LoRA \(rankr=8r=8\) is applied toqprojq\_\{\\text\{proj\}\},kprojk\_\{\\text\{proj\}\},vprojv\_\{\\text\{proj\}\}layers, with pretrained weights frozen\. Training uses per\-device batch size 1, gradient accumulation 2, bf16 precision, under DeepSpeed ZeRO\-3Rajbhandariet al\.\([2020](https://arxiv.org/html/2605.22843#bib.bib142)\)\.

Inference\.The same knowledge\-injection strategy retrieves top\-5 tables \(k1=5k\_\{1\}=5\), top\-12 columns per table \(k2=12k\_\{2\}=12\), and top\-5 QA examples \(k3=5k\_\{3\}=5\), within 4,096 tokens\. Self\-consistent decoding generates 8 SQL candidates per question via nucleus sampling \(top\-p=0\.95p=0\.95\)\. Executable candidates are evaluated against the target database, selecting the query returning the most results, with ties broken by shortest execution time\.

## Appendix BLack konwledge of wrong case

Table 4:Illustrative examples of SQL correction by GPT\-4o through the injection of three types of knowledge\.Example 1: Schema Knowledge — Column DescriptionQuestion:What’s the reference name of Marina Bay Street Circuit? \(BIRD\-SQL\)Incorrect SQL generated without external knowledge:SELECT name FROM circuits \.\.\.Injected knowledge:‘‘reference name’’→\\rightarrowcircuits\.circuitRefCorrected SQL generated with knowledge injection:SELECT circuitRef FROM circuits \.\.\.Example 2: Schema Knowledge — AbbreviationsQuestion:What are the websites for all the partially virtual chartered schools in San Joaquin? \(BIRD\-SQL\)Incorrect SQL generated without external knowledge:\.\.\.WHERE virtual = ’partial’Injected knowledge:‘‘partially virtual’’→\\rightarrowVirtual = ’P’Corrected SQL generated with knowledge injection:\.\.\.WHERE Virtual = ’P’Example 3: Domain Knowledge — Specialized TermsQuestion:What is the complete address of the school with the lowest excellence rate? \(BIRD\-SQL\)Incorrect SQL generated without external knowledge:SELECT address \.\.\. ORDER BY excellence\_rate ASCInjected knowledge:‘‘complete address’’→\\rightarrowCONCAT\(\.\.\.\)‘‘excellence rate’’→\\rightarrowNumGE1500 / NumTstTakrCorrected SQL generated with knowledge injection:SELECT CONCAT\(\.\.\.\) \.\.\. ORDER BY \(NumGE1500 / NumTstTakr\) ASCExample 4: Representative Queries — Relevant QueryQuestion:What’s the cost to get glucocorticoids \- methylprednisolone? \(EHRSQL\)Incorrect SQL generated without external knowledge:SELECT cost FROM treatment \.\.\.Injected knowledge:Q: how much does it cost for a hemothorax diagnosis?A: select distinct cost\.cost from cost where cost\.eventtype = ’diagnosis’ and cost\.eventid in \.\.\.\.Corrected SQL generated with knowledge injection:SELECT cost FROM cost WHERE eventid IN \(SELECT treatmentid FROM treatment \.\.\.\)Table 5:Schema Linking Recall \(%\) results by Qwen2\.5\-Coder\-7B with three schema linking approaches: theLLMmethod directly uses Qwen\-2\.5\-Coder\-7B\-Instruction to identify required tables and columns without any schema\-specific preprocessing, as inChenet al\.\([2024](https://arxiv.org/html/2605.22843#bib.bib1)\);Step 1performs schema classifier training and relevant schema prediction; andStep 2further enhances Step 1 with term expansion via value\-aware retrieval, as detailed in Section[5](https://arxiv.org/html/2605.22843#S5)\.MethodRetained Columnsper Table\(k2k\_\{2\}\)StandardRobustnessDomain\-SpecificSpider\-devSpider\-testBIRD \(dev\)Spider\-DKSpider\-SynSpider\-realisticEHRSQLScienceBenchmarkLLM\-90\.6191\.2985\.7989\.5388\.2087\.2096\.0390\.63Step 1 Only497\.8798\.3793\.8798\.1395\.8494\.6892\.6498\.40899\.4199\.6198\.04100\.099\.4199\.2197\.9999\.301299\.6199\.7698\.63100\.099\.6199\.6198\.6699\.701699\.6199\.7699\.08100\.099\.6199\.6198\.6699\.70Steps 1 & 2497\.8798\.4595\.7698\.1395\.9394\.8893\.3199\.10899\.4199\.6798\.95100\.099\.4199\.2198\.3299\.431299\.6199\.8099\.34100\.099\.6199\.6199\.0099\.701699\.6199\.8099\.61100\.099\.6199\.6199\.0099\.70

## Appendix CDomain Terminology Construction Algorithm

Algorithm 1Domain Terminology Construction with Term Combination1:Input:Column set

CCfrom a database, number of clusters

MM, top\-

KK, target number of terms

NtargetN\_\{\\text\{target\}\}
2:Output:Top

KKvalidated terms with explanations

3:Compute embeddings

𝐞c=Encoder​\(c\)\\mathbf\{e\}\_\{c\}=\\text\{Encoder\}\(c\)for each

c∈Cc\\in C
4:Cluster columns into semantic groups

𝒦=\{K1,…,KM\}\\mathcal\{K\}=\\\{K\_\{1\},\\dots,K\_\{M\}\\\}
5:Initialize candidate term set

𝒯=\{\(Ki,null\)∣Ki∈𝒦\}\\mathcal\{T\}=\\\{\(K\_\{i\},\\text\{null\}\)\\mid K\_\{i\}\\in\\mathcal\{K\}\\\}
6:Initialize counter

nvalid=0n\_\{\\text\{valid\}\}=0
7:while

nvalid<Ntargetn\_\{\\text\{valid\}\}<N\_\{\\text\{target\}\}do

8:foreach pair of terms

\(ti,tj\)∈𝒯,i≠j\(t\_\{i\},t\_\{j\}\)\\in\\mathcal\{T\},i\\neq jdo

9:Sample an operator \(e\.g\.,

\+,−,∗,/\+,\-,\*,/\) as

o​pop
10:Form combined candidate term

t=Combine​\(ti,o​p,tj\)t=\\text\{Combine\}\(t\_\{i\},op,t\_\{j\}\)
11:Validate

ttusing LLM:

\(yt,st,et\)=LLM\_Review​\(t\)\(y\_\{t\},s\_\{t\},e\_\{t\}\)=\\text\{LLM\\\_Review\}\(t\)
12:if

yty\_\{t\}is validthen

13:Add

\(t,et\)\(t,e\_\{t\}\)to

𝒯\\mathcal\{T\}
14:Increment

nvalid=nvalid\+1n\_\{\\text\{valid\}\}=n\_\{\\text\{valid\}\}\+1
15:endif

16:endfor

17:Remove any candidate terms with fewer than 2 columns

18:endwhile

19:Rank

𝒯\\mathcal\{T\}by confidence

sts\_\{t\}and return the top

KKterms with explanations

## Appendix DPrompt for inference prompt

Listing 1:Example ofSQL GENERATION COT ROMPT1YouaretaskedwithgeneratingaSQLqueryaccordingtoainputuserrequest\.

2Notethat

31\.Ifthecolumnnamecontainsspecialcharacterssuchasspaces,pleaseuse‘toencloseit\.

42\.Exactlyselectthecolumnsthattheuserwantstoselect,anddonotselectotherunnesssarycolumns\.

53\.Onceyouneedtosubquery,pleaseuseCTEthatstartswiththeWITHkeywordtowrapthesubqueryandgiveitaname\.

64\.ThefinalAnswerQuery\*\*must\*\*bewrappedinMarkdownformatusingtriplebackticksandthe‘sql‘tag\.

75\.Youmustreasonstepbystepusingacompositionalapproach\.Yourreasoningprocessshouldfollowa\*\*minimalsetofsteps\*\*selectedfromapredefinedlibraryof10reasoningcomponents\(listedbelow\)\.

8\#\#\#ReasoningComponents\(ChooseFrom\):

91\.IntentRecognition

102\.Disambiguation

113\.TemporalReasoning

124\.KeywordMapping

135\.ConstraintExtraction

146\.Aggregation&GroupingReasoning

157\.Ordering&Limiting

168\.Alias&ExpressionHandling

179\.JoinReasoning

1810\.Nested/SubqueryReasoning

19\#\#\#DATABASESCHEMA

20$\{DATABASE\_SCHEMA\}

21\#\#\#DOMAINKNOWLEDGE

22$\{DOMAIN\_KG\}

23\#\#\#RELEVANTQAPAIRS

24$\{QA\_PAIRS\}

25\#\#\#QUESTION

26$\{USER\_QUESTION\}

27Pleasethinkstepbystep:

## Appendix ECost Analysis

Table 6:Token and Time Costs vs\. Column Count for Data Generation Methods\.Category\#AVG\.ColumnsToken Cost \(1k\)Time Cost \(s\)SQL\-GenOminiOursSQL\-GenOminiOursFew Columns15\.7581\.43428\.31283\.2319\.41182\.3714\.7Medium Columns33\.81324\.17421\.62168\.6397\.51524\.61163\.1Many Columns101\.53823\.719723\.85676\.3531\.91914\.52365\.4

In this section, we analyze the token consumption and time costs of our knowledge\-aware Text\-to\-SQL framework, which is essential for evaluating its scalability and efficiency, particularly given the reliance on Large Language Models \(LLMs\)\. We conducted an experiment using the Gemini 1\.5 Pro model on the BIRD Benchmark training set\. The databases were divided into three groups based on their number of columns, using quartiles to split the dataset into three equal parts:Few columns\(<24<24columns\),Medium columns\(24≤columns≤4824\\leq\\text\{columns\}\\leq 48\), andMany columns\(\>48\>48columns\)\. For each group, we randomly selected five databases and synthesized 300 samples per database, using the Gemini 1\.5 Pro model for data generation\.

Table[6](https://arxiv.org/html/2605.22843#A5.T6)compares token and time costs for SQL\-Gen, Omini, and our method \(Ours\) on databases with varying column counts in the BIRD Benchmark\. While SQL\-Gen is the most efficient in terms of both token and time costs, its data quality is lower compared to Omini and our method, leading to poorer model performance\. Omini incurs high token and time costs due to its self\-consistency technique but generates better quality data\. Our method has higher time costs, especially with larger databases, due to the knowledge synthesis step, which handles Schema Knowledge\. Despite higher costs, both Omini and our method produce better results than SQL\-Gen\.

Table 7:Token and Time Consumption for Different Phases in the Knowledge\-Aware Data Synthesis on the BIRD Benchmark using the Gemini 1\.5 Pro model\.PhaseTask NameInput TokensOutput TokensTimesKnowledgeConstructionSchema Knowledge413\.244\.9709\.7Domain Knowledge283\.865\.8370\.6Knowledge Checking73\.512\.676\.5Total770\.5123\.31,158\.8DataSynthesisQuestion\-SQL Gen296\.397\.5195\.9Data Correction Checking368\.217\.576\.5Data Augment1168\.3209\.1650\.5Total1824\.8324\.1922\.8

Table[7](https://arxiv.org/html/2605.22843#A5.T7)summarizes the token and time consumption across the Knowledge Construction and Data Synthesis phases of the knowledge\-aware pipeline\. The Knowledge Construction phase incurs relatively low computational costs, with the Schema Knowledge and Domain Knowledge tasks being the most resource\-intensive\. In contrast, the Data Synthesis phase, particularly the Data Augmentation step, exhibits much higher token usage \(1,168\.3 input tokens\) and time cost \(650\.5 seconds\), making it the main computational bottleneck\. These results suggest that Knowledge Construction could be further optimized by leveraging pre\-existing contextual knowledge to reduce training overhead, while improvements in Data Synthesis—especially the augmentation process—would yield the greatest efficiency gains\. Overall, knowledge construction is relatively lightweight, whereas data augmentation remains the primary challenge for scalability and computational efficiency in knowledge\-aware frameworks\.

Similar Articles

Residual Skill Optimization for Text-to-SQL Ensembles

arXiv cs.CL

DivSkill-SQL is a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning, improving selected accuracy by up to +11.1 points on Spider2-Lite by targeting examples that current ensembles fail on.

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Hugging Face Daily Papers

Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.

R^3-SQL: Ranking Reward and Resampling for Text-to-SQL

Hugging Face Daily Papers

# Paper page - R^3-SQL: Ranking Reward and Resampling for Text-to-SQL Source: [https://huggingface.co/papers/2604.25325](https://huggingface.co/papers/2604.25325) ## Abstract R$^3$\-SQL addresses inconsistencies in scoring functionally equivalent SQL queries and improves candidate recall through unified reward ranking and agentic resampling techniques\. Modern[Text\-to\-SQL](https://huggingface.co/papers?q=Text-to-SQL)systems generate multiple candidate[SQL queries](https://huggingface.co/papers