Residual Skill Optimization for Text-to-SQL Ensembles

arXiv cs.CL 05/22/26, 04:00 AM Papers
text-to-sql ensembles residual-skill-optimization agentic-systems llm database natural-language-query
Summary
DivSkill-SQL is a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning, improving selected accuracy by up to +11.1 points on Spider2-Lite by targeting examples that current ensembles fail on.
arXiv:2605.21792v1 Announce Type: new Abstract: Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:43 AM
# Residual Skill Optimization for Text-to-SQL Ensembles
Source: [https://arxiv.org/html/2605.21792](https://arxiv.org/html/2605.21792)
Jiongli Zhu1\*†Haoquan Guan1\*†Parjanya Prajakta Prashant1\*†Nikki Lijing Kuang2Seyedeh Baharan Khatami1†Canwen Xu2Xiaodong Yu2Yingyu Lin1Zhewei Yao2Yuxiong He2‡Babak Salimi1†‡1University of California, San Diego2Snowflake AI Research

###### Abstract

Text\-to\-SQL ensembles improve over single\-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one ofKKcandidates is correct\. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures\. We presentDivSkill\-SQL, a residual skill optimization framework that builds complementary agentic Text\-to\-SQL ensembles without model fine\-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K\. On Spider2\-Lite,DivSkill\-SQLimproves selected accuracy by up to\+11\.1\+11\.1points on Snowflake and\+8\.3\+8\.3on BigQuery over the strongest ensemble baseline, with consistent gains across two base models \(Opus\-4\.6 and GPT\-5\.4\)\. Skills optimized on a single dialect transfer without retraining across dialects \(Snowflake, BigQuery, SQLite\) and to a different task formulation, such as BIRD\-Critic \(\+2\.6\+2\.6pts\)\. Error diagnostics show up to3×3\\timesfewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface\-form variation\.

11footnotetext:Equal contribution\.22footnotetext:Work done while working at Snowflake AI Research\.33footnotetext:Co\-senior authors\.## 1Introduction

Text\-to\-SQL translates natural language questions into executable SQL queries, making relational databases accessible without SQL expertise\. While large language models \(LLMs\) achieve strong results on standard benchmarks\[[44](https://arxiv.org/html/2605.21792#bib.bib13),[31](https://arxiv.org/html/2605.21792#bib.bib22)\], real\-world queries, which involve large schemas, dialect\-specific syntax, and multi\-step query logic, remain difficult\. This has motivated agentic Text\-to\-SQL systems that inspect schemas, execute intermediate queries, observe feedback, and iteratively repair errors\[[35](https://arxiv.org/html/2605.21792#bib.bib24),[9](https://arxiv.org/html/2605.21792#bib.bib23)\]\.

Yet even with multi\-step interaction, no single agentic execution reliably solves every query, which has motivated ensembling as a complementary axis of robustness\[[29](https://arxiv.org/html/2605.21792#bib.bib27),[41](https://arxiv.org/html/2605.21792#bib.bib26),[22](https://arxiv.org/html/2605.21792#bib.bib25),[9](https://arxiv.org/html/2605.21792#bib.bib23)\]: rather than committing to a single SQL generation, ensembles generate multiple SQL candidates and select a final answer\. The potential of an ensemble is bounded above by Pass@K, the probability that at least one ofKKcandidates is correct\[[5](https://arxiv.org/html/2605.21792#bib.bib31)\]: no selector, however good, can recover an answer that was never generated\. Pass@K therefore captures what the generation stage is responsible for, separately from selection\.

Existing Text\-to\-SQL ensembles produce candidate diversity by combining stochastic decoding, hand\-designed prompt or workflow variants, and in some cases multiple fine\-tuned generators\[[29](https://arxiv.org/html/2605.21792#bib.bib27),[22](https://arxiv.org/html/2605.21792#bib.bib25)\]\. Whether the resulting candidates recover different failure cases, however, is left to chance: nothing in the procedure pushes new candidates to solve examples that earlier ones miss\. In agentic systems this is especially fragile: randomness in early planning propagates through long reasoning trajectories, so high\-temperature sampling produces noisier variants of the same path rather than complementary solutions, and introduces unstable reasoning, spurious joins, and dialect errors that degrade the candidate pool\[[32](https://arxiv.org/html/2605.21792#bib.bib21)\]\. Pass@K accordingly plateaus, with little gain from additional candidates\.

To address this, we introduceDivSkill\-SQL, a residual skill optimization framework that makes candidate complementarity an explicit optimization target instead of leaving it to heuristics or chance\. A*skill*is a high\-level instruction file that controls the agent’s decomposition style, schema\-exploration policy, drafting strategy, and repair logic;DivSkill\-SQLoperates entirely at this prompt level, with no model fine\-tuning\. Starting from a base skill,DivSkill\-SQLevaluates the agent on the training set, identifies the unresolved examples, and uses reflective prompt optimization\[[1](https://arxiv.org/html/2605.21792#bib.bib42)\]to refine a new skill specifically on this residual; it repeats round by round until a pool ofKKcomplementary skills has been learned\. A new skill need not to be globally better than its predecessors; it is useful precisely when it recovers examples they miss\. At inference,DivSkill\-SQLruns each learned skill and selects a final SQL from the resulting candidate set via pairwise comparison\[[29](https://arxiv.org/html/2605.21792#bib.bib27)\]\. Each skill is trained to cover what the others miss, so the ensemble is complementary by construction rather than by chance\.

![Refer to caption](https://arxiv.org/html/2605.21792v1/x1.png)Figure 1:System diagram ofDivSkill\-SQL\. The left panel shows skill construction: starting from diverse strategy prompts, the system repeatedly identifies unsolved questions and refines the next skill toward those remaining cases\. The right panel shows test\-time execution: multiple skill\-guided agents solve the same Text\-to\-SQL problem through different interaction patterns, producing SQL candidates that are then compared to choose the final output\.We evaluateDivSkill\-SQLon recent Text\-to\-SQL benchmarks across four dialects: Spider2\-Lite over SQLite, Snowflake, and BigQuery, and BIRD\-Critic over PostgreSQL\.DivSkill\-SQLimproves selected accuracy on Spider2\-Lite by up to\+11\.1points on Snowflake and\+8\.3on BigQuery over the strongest ensemble baseline, with consistent gains across two base models \(Opus\-4\.6 and GPT\-5\.4\)\. Skills optimized on standard Text\-to\-SQL in a single dialect \(Snowflake\) transfer without retraining to BigQuery and SQLite\. Moreover, skills optimized for standard Text\-to\-SQL generalize to the debugging\-style generation setting of BIRD\-Critic, whereDivSkill\-SQLimproves accuracy by\+2\.6\. Error diagnostics showup to 3×\\timesfewer hallucinated schema references and unsupported\-function calls, and tool\-use trajectory analysis shows thatDivSkill\-SQLreduces redundancy in agent behavior by19%19\\%–28%28\\%, producing more diverse schema\-inspection, decomposition, drafting, execution, and repair patterns than repeated runs of the same agent\.

Our contributions are as follows:

- •We formulate Text\-to\-SQL ensembling as residual Pass@K optimization over agent skills, where the goal is to learn complementary behaviors that cover different failure modes\.
- •We proposeDivSkill\-SQL, a residual skill optimization framework that improves candidate\-set coverage without model fine\-tuning or high\-temperature sampling\.
- •We show that optimizing each new skill on unresolved examples directly targets its marginal contribution to Pass@K\.
- •We evaluateDivSkill\-SQLacross recent Text\-to\-SQL benchmarks and four SQL dialects, showing improved accuracy and reduced redundancy in agent trajectories\.

## 2Related Works

#### Pass@K Optimization

Pass@K measures whether a model produces at least one correct solution amongKKsamples\[[5](https://arxiv.org/html/2605.21792#bib.bib31)\], and has recently been adopted as an optimization target\.Yueet al\.\[[45](https://arxiv.org/html/2605.21792#bib.bib35)\]show that RL with verifiable rewards improves Pass@1 but not Pass@K: training reduces output diversity and leaves hard problems outside the base model’s support unsolved\. This motivates directly optimizing for diversity, which several works pursue through policy optimization\[[6](https://arxiv.org/html/2605.21792#bib.bib33),[36](https://arxiv.org/html/2605.21792#bib.bib32),[43](https://arxiv.org/html/2605.21792#bib.bib36)\]\. However, these methods all require parameter updates, limiting them to open\-weight models and to improving a single model’s output distribution\. Our method instead learnsKKcomplementary skills without parameter updates, improving Pass@K through ensemble construction for both open and closed models\.

#### Skill and Prompt Optimization

Prompts strongly influence LLM behavior\[[3](https://arxiv.org/html/2605.21792#bib.bib38)\], motivating automatic optimization methods that improve prompts from task feedback rather than manual design\. Early work optimizes discrete trigger tokens or textual prompts\[[33](https://arxiv.org/html/2605.21792#bib.bib39)\], while recent methods use LLMs to propose, evaluate, and revise prompts iteratively\[[52](https://arxiv.org/html/2605.21792#bib.bib40),[40](https://arxiv.org/html/2605.21792#bib.bib41),[1](https://arxiv.org/html/2605.21792#bib.bib42)\]\. Boosted prompt ensembles\[[28](https://arxiv.org/html/2605.21792#bib.bib20)\]also optimize prompt ensembles by adding few\-shot prompts that target examples on which the current ensemble is uncertain or incorrect\. Beyond short prompts and few\-shot demonstrations, agent systems increasingly use*skills*: modular instruction files encoding task strategies, constraints, and tool\-use policies\[[47](https://arxiv.org/html/2605.21792#bib.bib37)\]\. Recent work explores agents that rewrite their skills\[[51](https://arxiv.org/html/2605.21792#bib.bib43)\], jointly optimize skills with model parameters\[[38](https://arxiv.org/html/2605.21792#bib.bib44)\], or evolve and reuse skills over time\[[24](https://arxiv.org/html/2605.21792#bib.bib46),[48](https://arxiv.org/html/2605.21792#bib.bib47),[49](https://arxiv.org/html/2605.21792#bib.bib45)\]\. Our work follows this direction but targets ensemble construction: rather than learning one strong skill, we learn complementary skills that address different failure modes and improve the ensemble’s Pass@K\.

#### Text\-to\-SQL

Early LLM\-based Text\-to\-SQL methods prompt a single model to generate SQL directly\[[10](https://arxiv.org/html/2605.21792#bib.bib51),[34](https://arxiv.org/html/2605.21792#bib.bib50),[17](https://arxiv.org/html/2605.21792#bib.bib3),[18](https://arxiv.org/html/2605.21792#bib.bib2),[4](https://arxiv.org/html/2605.21792#bib.bib1)\]\. More recent systems use agent\-based pipelines that decompose the task into schema pruning, evidence extraction, SQL generation, execution, and refinement\[[30](https://arxiv.org/html/2605.21792#bib.bib48),[12](https://arxiv.org/html/2605.21792#bib.bib49),[37](https://arxiv.org/html/2605.21792#bib.bib52),[35](https://arxiv.org/html/2605.21792#bib.bib24),[39](https://arxiv.org/html/2605.21792#bib.bib12),[8](https://arxiv.org/html/2605.21792#bib.bib11),[50](https://arxiv.org/html/2605.21792#bib.bib4)\]\. Several works further improve performance through ensembling\[[29](https://arxiv.org/html/2605.21792#bib.bib27),[9](https://arxiv.org/html/2605.21792#bib.bib23),[41](https://arxiv.org/html/2605.21792#bib.bib26),[22](https://arxiv.org/html/2605.21792#bib.bib25),[15](https://arxiv.org/html/2605.21792#bib.bib9),[13](https://arxiv.org/html/2605.21792#bib.bib8)\]: CHASE\-SQL generates candidates from diverse prompts and selects among them via a tournament\[[29](https://arxiv.org/html/2605.21792#bib.bib27)\]; MARS\-SQL trains a multi\-agent system with RL\[[41](https://arxiv.org/html/2605.21792#bib.bib26),[23](https://arxiv.org/html/2605.21792#bib.bib7),[46](https://arxiv.org/html/2605.21792#bib.bib6)\]; and XiYan\-SQL fine\-tunes multiple generators to induce diversity\[[22](https://arxiv.org/html/2605.21792#bib.bib25),[42](https://arxiv.org/html/2605.21792#bib.bib5)\]\. In contrast, our method requires no weight modifications or hand\-designed ensembles\. We optimize skills to explicitly construct complementary generators, increasing Pass@K coverage while remaining applicable to both open and closed LLMs\.

## 3Method

### 3\.1Setup and Notation

#### Text\-to\-SQL with agentic execution\.

A Text\-to\-SQL instance is a pair\(q,𝒟\)\(q,\\mathcal\{D\}\)consisting of a natural\-language questionqqand a relational database𝒟\\mathcal\{D\}, and the goal is to produce an executable SQL query whose result on𝒟\\mathcal\{D\}matches that of a gold reference\. We follow the agentic execution paradigm of recent Text\-to\-SQL systems\[[35](https://arxiv.org/html/2605.21792#bib.bib24),[9](https://arxiv.org/html/2605.21792#bib.bib23)\]: rather than generating SQL in a single forward pass, an*agent*interleaves tool calls, such as inspecting schema, sampling rows, drafting candidate SQL, and repairing errors, over multiple steps before returning a final query\.

#### Skills\.

We modulate the agent’s behavior through*skills*\. A skillssis a high\-level instruction file \(a system prompt expressed in natural language\) that controls the agent’s reasoning and tool\-use policy: which decomposition style to favor, when to explore the schema versus draft directly, what repair patterns to apply on execution errors, and so on\. We writeasa\_\{s\}for the agent equipped with skillssand treatssas identified with its promptπs\\pi\_\{s\}, so that the optimization space𝒮\\mathcal\{S\}is the space of natural\-language instruction files\. Two distinct skills induce two genuinely different agent trajectories on the same input, not merely two stochastic samples of the same trajectory\.

###### Example 3\.1\(Skill Examples\)\.

We show two simplified skills below that differ not only in wording but in the agent trajectory they encourage:decomposedelays final SQL generation until the query logic has been broken into verified subcomponents, whereasdirect\_coderpushes the agent to draft early and rely on execution feedback for rapid repair\.

decompose skillBreak complex questions into simple subqueries, build bottom\-up\. 1\. PARSE the question into atomic requirements\. 2\. BUILD each piece as a standaloneCTE\. 3\. COMPOSECTEs into the final query usingWITH\.\.\.SELECT\.direct\_coder skillYou are an EFFICIENTSQLwriter\. WriteSQLquickly, test, iterate\. 1\. Read the question carefully\. Identify the core tables, joins, and aggregations\. 2\. Write your bestSQLattempt IMMEDIATELY based on the schema\. 3\. Execute it\. If errors occur, read the error message carefully and fix it\.

#### Notation\.

Let𝒳\\mathcal\{X\}denote the space of input tasks\(q,𝒟\)\(q,\\mathcal\{D\}\)and letPPbe the underlying task distribution\. For a skills∈𝒮s\\in\\mathcal\{S\}, we writeps\(x\)∈\[0,1\]p\_\{s\}\(x\)\\in\[0,1\]for the probability that one execution ofasa\_\{s\}on inputx∈𝒳x\\in\\mathcal\{X\}produces a correct SQL query \(i\.e\., a query whose execution result matches the gold reference\)\. For a finite training setDtrain⊆𝒳D\_\{\\mathrm\{train\}\}\\subseteq\\mathcal\{X\}and a subsetR⊆DtrainR\\subseteq D\_\{\\mathrm\{train\}\}, we write

p^s\(R\)=1\|R\|∑x∈Rps\(x\)\\hat\{p\}\_\{s\}\(R\)\\;=\\;\\tfrac\{1\}\{\|R\|\}\\sum\\nolimits\_\{x\\in R\}p\_\{s\}\(x\)
for the empirical success rate of skillssonRR\. For a collection ofKKskillsA=\{s1,…,sK\}A=\\\{s\_\{1\},\\ldots,s\_\{K\}\\\}, the populationPass@K\\operatorname\{Pass@K\}, defined as the probability that at least one of theKKcorresponding agent executions succeeds, is

Pass@K⁡\(A\)=𝔼x∼P\[1−∏j=1K\(1−psj\(x\)\)\]\.\\operatorname\{Pass@K\}\(A\)\\;=\\;\\mathbb\{E\}\_\{x\\sim P\}\\\!\\left\[1\-\\prod\\nolimits\_\{j=1\}^\{K\}\\bigl\(1\-p\_\{s\_\{j\}\}\(x\)\\bigr\)\\right\]\.

### 3\.2Residual Skill Optimization

#### The residual principle\.

As demonstrated in[Figure˜1](https://arxiv.org/html/2605.21792#S1.F1), we construct an ensemble ofKKskills sequentially\. After selecting skillss1,…,sj−1s\_\{1\},\\ldots,s\_\{j\-1\}, we define the residual training set

Rj−1=\{xi∈Dtrain:asℓfails onxi∀ℓ<j\},R\_\{j\-1\}\\;=\\;\\bigl\\\{\\,x\_\{i\}\\in D\_\{\\mathrm\{train\}\}\\mathrel\{\\mathop\{\\ordinarycolon\}\}a\_\{s\_\{\\ell\}\}\\text\{ fails on \}x\_\{i\}\\;\\;\\forall\\ell<j\\,\\bigr\\\},and pick the next skill by maximizing success on this residual:

sj∈arg⁡maxs∈𝒮⁡p^s\(Rj−1\)\.s\_\{j\}\\;\\in\\;\\arg\\max\\nolimits\_\{s\\in\\mathcal\{S\}\}\\;\\hat\{p\}\_\{s\}\(R\_\{j\-1\}\)\.
Later skills are therefore not pushed to be globally better than earlier ones\. They are pushed to cover examples the current ensemble misses—which is exactly the marginal contribution of a new skill toPass@K\\operatorname\{Pass@K\}\. This is the mechanism by which the procedure encourages complementary skills and directly targets ensemble coverage rather than average accuracy\.

#### TheDivSkill\-SQLalgorithm\.

The residual arg\-max above is an idealized objective: it assumes access to the full training distribution and optimization over the infinite space of natural\-language skills\.DivSkill\-SQLturns this principle into a practical batch\-sequential process presented in Algorithm[1](https://arxiv.org/html/2605.21792#alg1)\. At each ofTTrounds,DivSkill\-SQLdraws a fresh batchBt⊆DtrainB\_\{t\}\\subseteq D\_\{\\mathrm\{train\}\}and performs one pass of the residual principle over the seed pool: skills are drawn from𝒮0\\mathcal\{S\}\_\{0\}in randomized ordering; after each per\-skillSkillOptimizer\\operatorname\{SkillOptimizer\}call, examples newly solved by the updated skill are removed from the residual set; and at the end of the batch, the accepted prompt updates are committed back to the seed pool, thus the pool evolves from batch to batch\. The algorithm has two main ingredients that make the ideal residual arg\-max practical: a finite set of*diverse seed skills*that defines the initial search space, and an*inner\-loop optimizer*that refines each seed on the current residual failures\.

Algorithm 1DivSkill\-SQL: batch\-sequential residual skill optimization\.1:training question pool

𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}; seed skill pool

𝒮0=\{s1,…,sK\}\\mathcal\{S\}\_\{0\}=\\\{s\_\{1\},\\ldots,s\_\{K\}\\\}with prompts

\{πs\(0\)\}s∈𝒮0\\\{\\pi\_\{s\}^\{\(0\)\}\\\}\_\{s\\in\\mathcal\{S\}\_\{0\}\}; batch size

bb; number of batches

TT; skill optimizer

SkillOptimizer⁡\(π,R\)\\operatorname\{SkillOptimizer\}\(\\pi,R\)that returns a refined prompt\.

2:

πs←πs\(0\)\\pi\_\{s\}\\leftarrow\\pi\_\{s\}^\{\(0\)\}for each

s∈𝒮0s\\in\\mathcal\{S\}\_\{0\}⊳\\trianglerightcurrent seed pool

3:for

t=1,…,Tt=1,\\ldots,Tdo

4:sample a batch

Bt⊆𝒟trainB\_\{t\}\\subseteq\\mathcal\{D\}\_\{\\mathrm\{train\}\}with

\|Bt\|=b\|B\_\{t\}\|=b⊳\\trianglerightdraw from question pool

5:

Rt,0←BtR\_\{t,0\}\\leftarrow B\_\{t\}⊳\\trianglerightinitial residual

6:choose an ordering

σt:\[K\]→𝒮0\\sigma\_\{t\}\\mathrel\{\\mathop\{\\ordinarycolon\}\}\[K\]\\to\\mathcal\{S\}\_\{0\}of the seed pool⊳\\trianglerightwithout replacement

7:for

j=1,…,Kj=1,\\ldots,Kdo

8:

s←σt\(j\)s\\leftarrow\\sigma\_\{t\}\(j\)⊳\\trianglerightdraw next seed

9:

π~s←SkillOptimizer⁡\(πs;Rt,j−1\)\\widetilde\{\\pi\}\_\{s\}\\leftarrow\\operatorname\{SkillOptimizer\}\(\\pi\_\{s\};R\_\{t,j\-1\}\)⊳\\trianglerightoptimize on residual

10:if

p^π~s\(Rt,j−1\)\>p^πs\(Rt,j−1\)\\hat\{p\}\_\{\\widetilde\{\\pi\}\_\{s\}\}\(R\_\{t,j\-1\}\)\>\\hat\{p\}\_\{\\pi\_\{s\}\}\(R\_\{t,j\-1\}\)then

11:

πs←π~s\\pi\_\{s\}\\leftarrow\\widetilde\{\\pi\}\_\{s\}⊳\\trianglerightaccept update on the residual

12:endif

13:

Rt,j←\{x∈Rt,j−1:asfails onx\}R\_\{t,j\}\\leftarrow\\\{\\,x\\in R\_\{t,j\-1\}\\mathrel\{\\mathop\{\\ordinarycolon\}\}a\_\{s\}\\text\{ fails on \}x\\,\\\}⊳\\trianglerightupdate residual with newπs\\pi\_\{s\}

14:endfor

15:

πs\(t\)←πs\\pi\_\{s\}^\{\(t\)\}\\leftarrow\\pi\_\{s\}for each

s∈𝒮0s\\in\\mathcal\{S\}\_\{0\}⊳\\trianglerightcommit seed\-pool update

16:endfor

17:return

\{πs\(T\)\}s∈𝒮0\\bigl\\\{\\pi\_\{s\}^\{\(T\)\}\\bigr\\\}\_\{s\\in\\mathcal\{S\}\_\{0\}\}

#### Skill seed pool\.

The arg\-max inarg⁡maxs∈𝒮⁡p^s\(Rj−1\)\\arg\\max\_\{s\\in\\mathcal\{S\}\}\\hat\{p\}\_\{s\}\(R\_\{j\-1\}\)is over an infinite, unstructured space of natural\-language prompts and is intractable directly\. To make the inner optimization tractable, we initialize from a small*seed pool*𝒮0=\{s1\(0\),…,sK\(0\)\}\\mathcal\{S\}\_\{0\}=\\\{s\_\{1\}^\{\(0\)\},\\ldots,s\_\{K\}^\{\(0\)\}\\\}ofKKLLM\-assisted, manually curated skills, each encoding a distinct high\-level reasoning strategy\. Concretely, we first run a generic LLM agent on a subset of the training data and inspect representative and recurring failure modes, and propose diverse strategy prompts from several perspectives: whether to decompose the question before coding, how much schema and value exploration to perform before drafting SQL, etc\. We inspect and keep only strategies with distinct intended trajectories, removing near\-duplicates that differ only in wording\.[Example˜3\.1](https://arxiv.org/html/2605.21792#S3.Thmexample1)presents two simple, high\-level skills we adopted as seeds, and the comprehensive list of seed skills is provided in Appendix[C](https://arxiv.org/html/2605.21792#A3)\. Subsequent rounds optimize prompts*starting from*the seed pool rather than from scratch, restricting the effective search space to natural refinements of these high\-level strategies\.

#### Inner\-loop optimization\.

The inner loop uses an LLM to propose refined skill prompts based on observed agent failures, a standard reflective prompt\-optimization step\[[11](https://arxiv.org/html/2605.21792#bib.bib15),[52](https://arxiv.org/html/2605.21792#bib.bib40),[40](https://arxiv.org/html/2605.21792#bib.bib41),[1](https://arxiv.org/html/2605.21792#bib.bib42)\]\. At each round, the LLM is shown failure traces from running the current skill promptπs\\pi\_\{s\}on the residual setRt,j−1R\_\{t,j\-1\}, proposes a refined promptπ~s\\widetilde\{\\pi\}\_\{s\}that addresses those failures, and the refinement is accepted if it improves recovery onRt,j−1R\_\{t,j\-1\}\. The agentic setting introduces non\-trivial structure: each failure trace contains a full sequence of schema inspections, intermediate query executions, and repair attempts, providing rich diagnostic signal but also requiring the optimizer to attribute blame across multi\-step trajectories rather than to a single output\. TheDivSkill\-SQL\-specific element is the choice of training subset: by optimizing each skill against the residual rather than the full training set, refinements are biased toward recovering prior\-ensemble failures rather than improving average accuracy on already\-solved examples\.

The following example of thedecomposeskill evolution illustrates how residual optimization turns diversity in the seed pool into more targeted coverage in the inner loop optimization\.

###### Example 3\.2\(Decompose Skill Evolution\)\.

Thedecomposeseed skill in[Example˜3\.1](https://arxiv.org/html/2605.21792#S3.Thmexample1)already differs from skills such asdirect\_coder: it encourages the agent to solve a query by breaking it into subproblems rather than drafting SQL immediately and repairing it based on execution feedback\. However, the seed remains too high\-level to specify what the agent should verify during decomposition\. When optimized on the residual set, the prompt is refined precisely around the frequent kinds of errors that other skills leave unresolved\. The example below shows the optimizeddecomposeskill\.

optimized decompose skillBreak complex questions into simple subqueries, build bottom\-up\. Step 1: PARSE the question into atomic requirements \- What is the output? columns, derived metrics, ratios, counts, sums\. \- What is the grain? one row per what? month? patient? group? \- What grouping dimensions are needed? only those that match the desired grain\. \- Is there ranking, ordering, limiting, or a ratio/composition calculation? Step 2: ANCHOR the grain before grouping \- MatchGROUP BYcolumns precisely to the output grain: no more, no less\. \- If the question asks for monthly totals, group by month only; do not add route, city, or other columns unless explicitly requested\. \- If the question asks for a split, identify the correct column and values that represent that split; do not substitute a loosely related column\. Step 3: BUILD each piece as a standaloneCTE \- Add joins one at a time, verifying row counts do not explode\. \- For ratio/composition queries, compute totals in oneCTE, subgroup counts in another, thenJOINand divide\. Step 4: VERIFY join logic and filter semantics \- Confirm join keys actually link the intended entities; avoid fan\-out\. \- Confirm filter values match the domain as they appear in the data\.

After optimization, it evolves from generic advice to “build CTEs” into a concrete rubric: identify the intended output grain before grouping, matchGROUP BYcolumns to that grain, choose the correct column for requested splits, and compute ratio/composition queries using compatible aggregation levels\. Thus, the learned skill is not merely a more detailed version of the seed prompt; its added details are shaped by the residual cases it is meant to cover\. The full before/after skill comparison pool is provided in Appendix[C](https://arxiv.org/html/2605.21792#A3)\.

In practice, we further improve the finite\-batch learning procedure through careful reflection prompt and reward design, rotating the skill order, etc\.; Appendix[B\.2](https://arxiv.org/html/2605.21792#A2.SS2)discusses these implementation practices in detail\.

#### Population\-level guarantee\.

Under the population\-level objective, residual skill optimization is greedy maximization of the Pass@K coverage objective\. Formally, Proposition[A\.1](https://arxiv.org/html/2605.21792#A1.Thmproposition1)\(stated and proved in Appendix[A\.1](https://arxiv.org/html/2605.21792#A1.SS1)\) shows that, in the population limit,

Pass@K⁡\(\{s1,…,sK\}\)≥\(1−1/e\)max\|A\|≤K⁡Pass@K⁡\(A\)\.\\operatorname\{Pass@K\}\(\\\{s\_\{1\},\\ldots,s\_\{K\}\\\}\)\\;\\geq\\;\(1\-1/e\)\\max\_\{\|A\|\\leq K\}\\operatorname\{Pass@K\}\(A\)\.
Thus, the learned skill bank is guaranteed to achieve at least a constant\-factor fraction of the best possibleKK\-skill ensemble under the population objective\. The intuition is that the marginal value of a new skill lies in its ability to solve problems that the current skill set still fails to address\. Residual optimization therefore greedily adds the skill with the largest additional coverage of the remaining failure, yielding the standard approximation guarantee for monotone submodular maximization\[[25](https://arxiv.org/html/2605.21792#bib.bib53)\]\.

### 3\.3Inference

At inference time, theKKlearned skill\-conditioned agents run in parallel on each test instance, producingKKcandidate SQL queries\. We select the final query using pairwise candidate comparisons followingPourrezaet al\.\[[29](https://arxiv.org/html/2605.21792#bib.bib27)\]\. Since pairwise comparison scales quadratically inKK, we first deduplicate candidates by execution output: queries returning identical results on the target database are collapsed into one equivalence class, with one representative retained\. This leavesG≤KG\\leq Kcandidates and reduces the number of comparisons when multiple skills agree\. We then run an exhaustive round\-robin over all\(G2\)\\binom\{G\}\{2\}unordered pairs, rather than sampling pairs, becauseGGis small in practice\.111WithK=8K=8,G≈1\.7G\\approx 1\.7for baselines and2\.62\.6forDivSkill\-SQLon BIRD\-Critic\.To mitigate LLM judge position bias, each pair\(i,j\)\(i,j\)is judged twice with swapped presentation order\. Each judgment gives one win to the selected candidate, and we return the candidate with the highest win count, breaking ties arbitrarily\.

## 4Experiments

In the experiments, we evaluateDivSkill\-SQLthrough three research questions:

RQ1: End\-to\-end effectiveness\.How doesDivSkill\-SQLcompare with state\-of\-the\-art Text\-to\-SQL systems on recent complex benchmarks?RQ2: Transferability\.Do skills optimized on a single dataset or SQL dialect transfer to unseen datasets, alternative dialects, and new task formats?RQ3: Behavioral diversity\.How does residual skill optimization change the behavior of an agentic Text\-to\-SQL system, beyond simply changing final SQL strings?

Table 1:Summary of initial seed skills\.### 4\.1Experimental Setup

#### Benchmarks\.

We evaluateDivSkill\-SQLon two recent Text\-to\-SQL benchmarks that provide clean ground\-truth annotations and involve complex reasoning\. First, we use Spider2\-Lite\[[16](https://arxiv.org/html/2605.21792#bib.bib14)\], which tests complex SQL generation over realistic schemas and multiple dialects\. We report results on its SQLite, Snowflake, and BigQuery subsets, with 135, 207, and 209 examples respectively\. These subsets differ in schema organization, query style, and dialect\-specific syntax, allowing us to study both in\-domain performance and cross\-dialect transfer\. Second, we evaluate on BIRD\-Critic\[[20](https://arxiv.org/html/2605.21792#bib.bib19)\], a SQL debugging benchmark derived from BIRD\[[19](https://arxiv.org/html/2605.21792#bib.bib18)\]\. Each instance provides a natural\-language issue description, a buggy SQL query, and database context; the system must diagnose and repair the query rather than generate SQL from scratch\. This setting tests whetherDivSkill\-SQLtransfers beyond direct SQL generation to agentic SQL correction\. We evaluate on its pure PostgreSQL version\.

#### Skill optimization\.

We adopt the state\-of\-the\-art prompt and skill optimization technique GEPA\[[1](https://arxiv.org/html/2605.21792#bib.bib42)\]for skill optimization\. For the Spider2\-Lite benchmark, we learn skills from only approximately 200 examples sampled from proprietary data in Snowflake SQL dialect, and then evaluate the learned skills on Spider2\-Lite\. For the BIRD\-Critic benchmark, we optimize skills on BIRD\-mini\-dev\[[19](https://arxiv.org/html/2605.21792#bib.bib18)\], which contains 500 standard Text\-to\-SQL examples\.

#### Baselines\.

We compareDivSkill\-SQLagainst representative open\-source or most relevant Text\-to\-SQL systems, includingDIN\-SQL\[[30](https://arxiv.org/html/2605.21792#bib.bib48)\],ReFoRCE\[[9](https://arxiv.org/html/2605.21792#bib.bib23)\], andCHASE\-SQL\[[29](https://arxiv.org/html/2605.21792#bib.bib27)\]\. We use the open\-source implementations ofDIN\-SQLandReFoRCEdirectly\. SinceCHASE\-SQLwas originally designed for a workflow\-based pipeline, to adapt it to our agentic setting, we retain its transferable design choices, including schema\-link shuffling, high\-temperature sampling, and pairwise candidate selection, while replacing the fixed, manually designed workflow with our agent architecture222In our experiments, the workflow\-based solution consistently underperforms the agentic solution unless heavy engineering effort is applied, which led us to build around the agentic approach\.\. By default, we use the Opus\-4\.6\[[2](https://arxiv.org/html/2605.21792#bib.bib17)\]model\. However, we also show results with GPT\-5\.4\[[27](https://arxiv.org/html/2605.21792#bib.bib16)\]\. We share the implementation details in[Appendix˜B](https://arxiv.org/html/2605.21792#A2)\.

#### Metrics\.

We report three metrics\.*Pass@1*measures the execution accuracy of a single generated candidate\.*Pass@8*measures the oracle candidate\-set accuracy: a problem is counted as solved if at least one of the eight generated candidates is correct\. Pass@8 therefore measures the quality and coverage of candidate generation independently of selection333We compute Pass@8 based on internal candidates of ensemble\-based methods\. AsReFoRCEis generating a dynamic number of candidates in its workflow, we are not able to compute Pass@8\.\. Finally,*selected accuracy*measures the execution accuracy of the single SQL query returned by the selector, either through LLM judge or majority voting, and is the end\-to\-end performance of the deployed system\. This applies to ensemble\-based methods includingDivSkill\-SQL,ReFoRCE, andCHASE\-SQL\.

#### Skill seed pool and choice ofKK\.

We observe the Pass@kkcurve for various methods nearly saturates beyondK=8K\{=\}8empirically\. Therefore, unless explicitly specified, we useK=8K\{=\}8by default in the experiments\. We initialize the optimization withK=8K\{=\}8seed skills \(see seed construction in[Section˜3](https://arxiv.org/html/2605.21792#S3)\), each representing a distinct agent behavior family\. Table[1](https://arxiv.org/html/2605.21792#S4.T1)lists all seeds with a one\-line strategy summary; full prompt text is given in Appendix[C](https://arxiv.org/html/2605.21792#A3)\.

### 4\.2End\-to\-End Performance and Transferability

\(a\)
\(b\)

Table 2:Spider2\-Lite results across SQLite, Snowflake, and BigQuery using \(a\) Opus 4\.6 and \(b\) GPT\-5\.4\.Table 3:Hallucination diagnostics on Snowflake instances based on invalid\-reference failures\.Table 4:Structural mismatch analysis on Snowflake instances with reference SQL available\.[Tables˜2\(a\)](https://arxiv.org/html/2605.21792#S4.T2.st1),[2\(b\)](https://arxiv.org/html/2605.21792#S4.T2.st2)and[5](https://arxiv.org/html/2605.21792#S4.T5)report end\-to\-end results on BIRD\-Critic and Spider2\-Lite, grouped by dialect\. Overall,DivSkill\-SQLachieves the strongest selected accuracy in almost all settings and consistently outperforms the strongest ensemble baseline,CHASE\-SQL\.

The advantage is most visible on the harder dialects in Table[2\(a\)](https://arxiv.org/html/2605.21792#S4.T2.st1)\. With Opus\-4\.6,DivSkill\-SQLimproves selected accuracy overCHASE\-SQLby \+11\.11 on Snowflake and \+8\.29 on BigQuery, where large schemas and dialect\-specific syntax make unstable decoding more costly\. On SQLite,DivSkill\-SQLhas slightly lower Pass@8 thanCHASE\-SQL\(73\.33 vs\. 76\.30\), but still achieves higher selected accuracy \(64\.44 vs\. 63\.70\)\. This suggests that raw coverage alone is insufficient: SQLite is easier, so stochastic sampling can already cover many cases, but the additional candidates may include plausible wrong queries that are harder for the selector to distinguish\[[32](https://arxiv.org/html/2605.21792#bib.bib21)\]\. The same pattern holds with GPT\-5\.4 in Table[2\(b\)](https://arxiv.org/html/2605.21792#S4.T2.st2), whereDivSkill\-SQLimproves both Pass@8 and selected accuracy across all three dialects\. Note thatReFoRCEselects its final SQL by majority voting over internal candidates\. With GPT\-5\.4, its selected result underperforms mean Pass@1 on SQLite and Snowflake, showing that majority voting can fail when many candidates converge to the same incorrect result\. See Appendix[D](https://arxiv.org/html/2605.21792#A4)for detailed analysis\.

#### Hallucination and Error Analysis\.

[Tables˜3](https://arxiv.org/html/2605.21792#S4.T3)and[4](https://arxiv.org/html/2605.21792#S4.T4)provide a more structured error breakdown, evaluated on the Snowflake part of Spider2\-Lite\. Compared withCHASE\-SQL,DivSkill\-SQLproduces fewer invalid\-reference failures: the number of pools containing such a candidate drops from 10 to 7, and among solvable pools, the count drops from 6 to 2\. Missing\-function hallucinations also decrease from 6 to 2\. The structural comparison against gold SQL shows a similar pattern:DivSkill\-SQLmakes fewer errors involvingdistinct, window functions, andunionstructure\. These results indicate that residual skills*improve diversity in a more controlled way*\. Rather than relying on high\-temperature perturbations that can introduce hallucinated references or unstable SQL structures,DivSkill\-SQLinduces different agent behaviors while preserving candidate quality\.

#### Transfer across SQL dialects and task settings\.

We next analyze whether skills encode reusable problem\-solving strategies or merely overfit to one dialect or task\.

Recall that on Spider2\-Lite, the skills are optimized only on Snowflake SQL format data, but are applied to all dialects without further optimization\. The end\-to\-end results in[Tables˜2\(a\)](https://arxiv.org/html/2605.21792#S4.T2.st1)and[2\(b\)](https://arxiv.org/html/2605.21792#S4.T2.st2)show thatDivSkill\-SQLoutperforms all baselines on SQLite and BigQuery by a large margin\. This suggests that residual skill optimization does not merely memorize Snowflake\-specific syntax\. Instead, as we will show later in[Section˜4\.4](https://arxiv.org/html/2605.21792#S4.SS4), the learned skills capture higher\-level strategies for schema exploration, decomposition, query construction, and error checking that transfer across dialects\.

Table 5:Bird\-Critic PostgreSQL Results\.Although all text\-to\-SQL systems, datasets, and benchmarks target accurate SQL generation given a user’s question and a database, their settings vary widely\. For instance, BIRD\-Critic focuses on debugging SQL using feedback and a given buggy SQL query, which is quite different from existing benchmarks\. In this case, finding the training data in the same format is challenging\. In this experiment, we explore whether the skills learned on Bird\-mini\-dev—a regular Text\-to\-SQL dataset with no feedback or buggy SQL—translate well to Bird\-Critic\. Results in Table[5](https://arxiv.org/html/2605.21792#S4.T5)confirm that the skills transfer:DivSkill\-SQLimproves selected accuracy by\+2\.64\+2\.64points overCHASE\-SQL, with consistent gains in pass@1 \(\+1\.47\+1\.47\) and pass@8 \(\+2\.27\+2\.27\)\. The simultaneous pass@8 improvement indicates that the learned skills broaden the candidate pool, even on a task format \(debugging from feedback\) they were not optimized for\.

![Refer to caption](https://arxiv.org/html/2605.21792v1/x2.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.21792v1/x3.png)\(b\)

Figure 2:Pass@k comparison betweenDivSkill\-SQLand its variants on 100\-instance subsets of a\) Spider2\-lite and b\) Bird\-Critic\.![Refer to caption](https://arxiv.org/html/2605.21792v1/x4.png)

Figure 3:Trajectory comparison betweenDivSkill\-SQLand repeated runs of the default skill on the Snowflake part of Spider2\-Lite\.![Refer to caption](https://arxiv.org/html/2605.21792v1/x5.png)

Figure 4:Trajectory comparison betweenDivSkill\-SQLand repeated runs of the default skill on the BigQuery part of Spider2\-Lite\.![Refer to caption](https://arxiv.org/html/2605.21792v1/x6.png)

Figure 5:Trajectory comparison betweenDivSkill\-SQLand repeated runs of the default skill on the SQLite part of Spider2\-Lite\.

### 4\.3Ablation Studies

To understand the effectiveness of each component ofDivSkill\-SQL, including the residual optimization process and the skill pool, we compareDivSkill\-SQLwith its three variants:*Base*, which repeatedly samples from the original base skill;*Opt\. Base*, which runs GEPA to optimize residuals from only base skills rather than diverse seeds; and*Seeds*, which uses the unoptimized initial skill seeds\. These baselines separate the effect of residual optimization from the effect of simply sampling more times, optimizing one stronger prompt, or using manually diverse initial instructions\.

Figure[2](https://arxiv.org/html/2605.21792#S4.F2)studies how candidate\-set coverage changes as we increase the number of generated candidates\. Repeated sampling from the base skill, i\.e\. Base, improves Pass@kkonly gradually, indicating that independent runs of the same skill tend to make correlated mistakes\. Optimizing a single base skill improves individual quality, but still leaves many residual failures uncovered\. Initial diverse set of seed skills helps compared to Base in some cases \([Figure˜2\(b\)](https://arxiv.org/html/2605.21792#S4.F2.sf2)\), but might also degrades performance \([Figure˜2\(a\)](https://arxiv.org/html/2605.21792#S4.F2.sf1)\)\.

In contrast,DivSkill\-SQLachieves the strongest Pass@kkcurve across nearly all values ofkk\. The gain is especially meaningful at small and moderatekk, where each additional candidate must cover new failure modes to be useful\. As a result, to achieve the same coverage asDivSkill\-SQL’s pass@8, baselines need 3 to 8 additional passes, makingDivSkill\-SQLa more cost\-efficient choice\. This behavior is exactly what residual skill optimization is designed to produce: each later skill is not optimized to be globally better on all questions, but to solve examples that previous skills miss\.

### 4\.4How Diverse Skills Change Agent Behavior

To understand why residual skill optimization changes Text\-to\-SQL agent behavior, we analyze agent trajectories rather than only final SQL outputs\. A trajectory records the sequence of high\-level actions taken by an agent, such as schema inspection, exploratory SQL generation, and error repair\. We measure trajectory dissimilarity between two runs using edit distance normalized to\[0,1\]\[0,1\], and define trajectory similarity as11minus this normalized distance\. Lower similarity therefore indicates that two candidates are produced through more distinct reasoning and tool\-use paths\.

[Figure˜5](https://arxiv.org/html/2605.21792#S4.F5)shows a clear contrast betweenDivSkill\-SQLand repeated sampling\. Repeated runs form a high\-similarity cluster, with most pairwise similarities concentrated around0\.750\.75–0\.850\.85, suggesting that sampling alone mostly produces variants of the same reasoning path\. In contrast,DivSkill\-SQLspreads trajectories over a much wider similarity range: learned skills such as direct\-coder, template\-first, decomposition, and exploration\-heavy form low\-similarity pairs, indicating that they induce genuinely different agent behaviors rather than merely different SQL surface forms\. This pattern also transfers to the BigQuery and SQLite subsets of Spider2\-Lite \([Figures˜5](https://arxiv.org/html/2605.21792#S4.F5)and[5](https://arxiv.org/html/2605.21792#S4.F5)\), where skills learned from Snowflake data are applied without retraining, suggesting that the behavioral changes are not tied to a single SQL dialect\. Overall, the analysis supports the central mechanism ofDivSkill\-SQL: learned skills change how the agent approaches the task, thereby producing candidate sets with less\-correlated failure modes and helping explain the stronger Pass@KKcurve in[Figure˜2](https://arxiv.org/html/2605.21792#S4.F2)\.

## 5Discussion, Limitations, and Conclusions

#### Discussion and limitations\.

DivSkill\-SQLprimarily improves candidate generation and does not focus on enhanced candidate selection method\. Across several settings, there remains a substantial gap between Pass@8 and selected accuracy, indicating that the correct SQL is often present in the candidate pool but not selected\. Our pairwise comparison procedure mitigates direct 1\-of\-KKselection difficulty and reduces position bias by swapping candidate order, but it still relies on an LLM judge and scales quadratically in the number of surviving candidates\. Future work could combine residual skill optimization with stronger selectors\. Recent advances in language model reasoning demonstrate that verifiers trained on self\-generated correct and incorrect trajectories—often formulated as outcome or process reward models—can significantly enhance test\-time selection among multiple candidates\[[7](https://arxiv.org/html/2605.21792#bib.bib55),[21](https://arxiv.org/html/2605.21792#bib.bib56),[14](https://arxiv.org/html/2605.21792#bib.bib54),[26](https://arxiv.org/html/2605.21792#bib.bib57)\]\. This paradigm is naturally aligned with our framework:DivSkill\-SQLalready generates diverse candidate reasoning paths and SQL programs, while execution feedback provides an abundant, automated source of positive and negative supervision to effectively train such selectors without human annotation\.

In addition, our evaluation focuses on recent Text\-to\-SQL and SQL\-debugging benchmarks with executable ground truth\. Although the learned skills transfer across SQL dialects and even from standard Text\-to\-SQL training data to BIRD\-Critic, broader deployment settings may introduce additional challenges, including ambiguous user intent, missing schema documentation, and multi\-turn interactive feedback\-based debugging\. Future work might explore how to adaptDivSkill\-SQLto these more complex settings\.

#### Conclusions\.

We introduceDivSkill\-SQL, a residual skill optimization framework for constructing complementary Text\-to\-SQL ensembles\. Rather than relying on hand\-designed prompt or workflow variants, or high\-temperature sampling,DivSkill\-SQLlearns a bank of skill\-conditioned agents, where each skill is optimized to recover examples missed by the others\. This directly targets candidate\-set coverage and improves Pass@K while preserving the quality of SQL candidates\. Across Spider2\-Lite and BIRD\-Critic, spanning multiple SQL dialects and task formats,DivSkill\-SQLimproves end\-to\-end accuracy over strong ensemble baselines while producing fewer hallucinated outputs and less redundant agent behavior\.

## References

- \[1\]L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)GEPA: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p4.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.21792#S3.SS2.SSS0.Px4.p1.4),[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px2.p1.1)\.
- \[2\]\(2026\-02\)Introducing Claude Opus 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-6](https://www.anthropic.com/news/claude-opus-4-6)Cited by:[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px3.p1.1)\.
- \[3\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]K\. Chen, Y\. Chen, N\. Koudas, and X\. Yu\(2025\-02\)Reliable text\-to\-sql with adaptive abstention\.Proc\. ACM Manag\. Data3\(1\)\.External Links:[Link](https://doi.org/10.1145/3709719),[Document](https://dx.doi.org/10.1145/3709719)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p2.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px1.p1.2)\.
- \[6\]Z\. Chen, X\. Qin, Y\. Wu, Y\. Ling, Q\. Ye, W\. X\. Zhao, and G\. Shi\(2025\)Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models\.arXiv preprint arXiv:2508\.10751\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px1.p1.2)\.
- \[7\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2605.21792#S5.SS0.SSS0.Px1.p1.1)\.
- \[8\]Y\. Dai, H\. Yang, M\. Hao, and P\. Chao\(2025\-07\)PARSQL: enhancing text\-to\-SQL through SQL parsing and reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 661–681\.External Links:[Link](https://aclanthology.org/2025.findings-acl.37/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.37),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[9\]M\. Deng, A\. Ramachandran, C\. Xu, L\. Hu, Z\. Yao, A\. Datta, and H\. Zhang\(2025\)Reforce: a text\-to\-sql agent with self\-refinement, format restriction, and column exploration\.InICLR 2025 Workshop: VerifAI: AI Verification in the Wild,Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p1.1),[§1](https://arxiv.org/html/2605.21792#S1.p2.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.21792#S3.SS1.SSS0.Px1.p1.4),[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px3.p1.1)\.
- \[10\]X\. Dong, C\. Zhang, Y\. Ge, Y\. Mao, Y\. Gao, J\. Lin, D\. Lou,et al\.\(2023\)C3: zero\-shot text\-to\-sql with chatgpt\.arXiv preprint arXiv:2307\.07306\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel\(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§3\.2](https://arxiv.org/html/2605.21792#S3.SS2.SSS0.Px4.p1.4)\.
- \[12\]D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou\(2023\)Text\-to\-sql empowered by large language models: a benchmark evaluation\.arXiv preprint arXiv:2308\.15363\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[13\]Y\. Guo, D\. Jin, S\. Ye, S\. Chen, J\. Yang, and X\. Tan\(2025\-07\)SQLForge: synthesizing reliable and diverse data to enhance text\-to\-SQL reasoning in LLMs\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8441–8452\.External Links:[Link](https://aclanthology.org/2025.findings-acl.443/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.443),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]A\. Hosseini, X\. Yuan, N\. Malkin, A\. Courville, A\. Sordoni, and R\. Agarwal\(2024\)V\-star: training verifiers for self\-taught reasoners\.arXiv preprint arXiv:2402\.06457\.Cited by:[§5](https://arxiv.org/html/2605.21792#S5.SS0.SSS0.Px1.p1.1)\.
- \[15\]D\. Lee, C\. Park, J\. Kim, and H\. Park\(2025\-01\)MCS\-SQL: leveraging multiple prompts and multiple\-choice selection for text\-to\-SQL generation\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 337–353\.External Links:[Link](https://aclanthology.org/2025.coling-main.24/)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[16\]F\. Lei, J\. Chen, Y\. Ye, R\. Cao, D\. Shin, H\. Su, Z\. Suo, H\. Gao, W\. Hu, P\. Yin,et al\.\(2024\)Spider 2\.0: evaluating language models on real\-world enterprise text\-to\-sql workflows\.arXiv preprint arXiv:2411\.07763\.Cited by:[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px1.p1.1)\.
- \[17\]B\. Li, Y\. Luo, C\. Chai, G\. Li, and N\. Tang\(2024\-07\)The dawn of natural language to sql: are we fully ready?\.Proc\. VLDB Endow\.17\(11\),pp\. 3318–3331\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3681954.3682003),[Document](https://dx.doi.org/10.14778/3681954.3682003)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]H\. Li, J\. Zhang, H\. Liu, J\. Fan, X\. Zhang, J\. Zhu, R\. Wei, H\. Pan, C\. Li, and H\. Chen\(2024\-05\)CodeS: towards building open\-source language models for text\-to\-sql\.Proc\. ACM Manag\. Data2\(3\)\.External Links:[Link](https://doi.org/10.1145/3654930),[Document](https://dx.doi.org/10.1145/3654930)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[19\]J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo,et al\.\(2024\)Can llm already serve as a database interface? a big bench for large\-scale database grounded text\-to\-sqls\.Advances in Neural Information Processing Systems36\.Cited by:[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px2.p1.1)\.
- \[20\]J\. Li, X\. Li, G\. Qu, P\. Jacobsson, B\. Qin, B\. Hui, S\. Si, N\. Huo, X\. Xu, Y\. Zhang,et al\.\(2025\)Swe\-sql: illuminating llm pathways to solve user sql issues in real\-world applications\.arXiv preprint arXiv:2506\.18951\.Cited by:[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px1.p1.1)\.
- \[21\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§5](https://arxiv.org/html/2605.21792#S5.SS0.SSS0.Px1.p1.1)\.
- \[22\]Y\. Liu, Y\. Zhu, Y\. Gao, Z\. Luo, X\. Li, X\. Shi, Y\. Hong, J\. Gao, Y\. Li, B\. Ding,et al\.\(2026\)Xiyan\-sql: a novel multi\-generator framework for text\-to\-sql\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p2.1),[§1](https://arxiv.org/html/2605.21792#S1.p3.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[23\]P\. MA, X\. Zhuang, C\. Xu, X\. Jiang, R\. Chen, and J\. Guo\(2026\)SQL\-r1: training natural language to SQL reasoning model by reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=hgJQcuDwm1)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[24\]Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu\(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]G\. L\. Nemhauser, L\. A\. Wolsey, and M\. L\. Fisher\(1978\)An analysis of approximations for maximizing submodular set functions—i\.Mathematical programming14\(1\),pp\. 265–294\.Cited by:[§A\.1](https://arxiv.org/html/2605.21792#A1.SS1.1.p1.1),[§3\.2](https://arxiv.org/html/2605.21792#S3.SS2.SSS0.Px5.p3.1)\.
- \[26\]A\. Ni, S\. Iyer, D\. Radev, V\. Stoyanov, W\. Yih, S\. Wang, and X\. V\. Lin\(2023\)Lever: learning to verify language\-to\-code generation with execution\.InInternational Conference on Machine Learning,pp\. 26106–26128\.Cited by:[§5](https://arxiv.org/html/2605.21792#S5.SS0.SSS0.Px1.p1.1)\.
- \[27\]OpenAI\(2026\-03\)Introducing GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px3.p1.1)\.
- \[28\]S\. Pitis, M\. R\. Zhang, A\. Wang, and J\. Ba\(2023\)Boosted prompt ensembles for large language models\.arXiv preprint arXiv:2304\.05970\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[29\]M\. Pourreza, H\. Li, R\. Sun, Y\. Chung, S\. Talaei, G\. T\. Kakkar, Y\. Gan, A\. Saberi, F\. Ozcan, and S\. O\. Arik\(2024\)Chase\-sql: multi\-path reasoning and preference optimized candidate selection in text\-to\-sql\.arXiv preprint arXiv:2410\.01943\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p2.1),[§1](https://arxiv.org/html/2605.21792#S1.p3.1),[§1](https://arxiv.org/html/2605.21792#S1.p4.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.21792#S3.SS3.p1.7),[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px3.p1.1)\.
- \[30\]M\. Pourreza and D\. Rafiei\(2023\)Din\-sql: decomposed in\-context learning of text\-to\-sql with self\-correction\.Advances in neural information processing systems36,pp\. 36339–36348\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.21792#S4.SS1.SSS0.Px3.p1.1)\.
- \[31\]M\. Pourreza and D\. Rafiei\(2023\)Evaluating cross\-domain text\-to\-sql models and benchmarks\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 1601–1611\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p1.1)\.
- \[32\]M\. Pourreza, S\. Talaei, R\. Sun, X\. Wan, H\. Li, A\. Mirhoseini, A\. Saberi, S\. Arik,et al\.\(2025\)Reasoning\-sql: reinforcement learning with sql tailored partial rewards for reasoning\-enhanced text\-to\-sql\.arXiv preprint arXiv:2503\.23157\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p3.1),[§4\.2](https://arxiv.org/html/2605.21792#S4.SS2.p2.1)\.
- \[33\]T\. Shin, Y\. Razeghi, R\. L\. Logan IV, E\. Wallace, and S\. Singh\(2020\)Autoprompt: eliciting knowledge from language models with automatically generated prompts\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 4222–4235\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[34\]C\. Tai, Z\. Chen, T\. Zhang, X\. Deng, and H\. Sun\(2023\)Exploring chain of thought style prompting for text\-to\-sql\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 5376–5393\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]S\. Talaei, M\. Pourreza, Y\. Chang, A\. Mirhoseini, and A\. Saberi\(2024\)Chess: contextual harnessing for efficient sql synthesis\.arXiv preprint arXiv:2405\.16755\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p1.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.21792#S3.SS1.SSS0.Px1.p1.4)\.
- \[36\]C\. Walder and D\. Karkhanis\(2025\)Pass@ k policy optimization: solving harder reinforcement learning problems\.arXiv preprint arXiv:2505\.15201\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px1.p1.2)\.
- \[37\]B\. Wang, C\. Ren, J\. Yang, X\. Liang, J\. Bai, L\. Chai, Z\. Yan, Q\. Zhang, D\. Yin, X\. Sun,et al\.\(2025\)Mac\-sql: a multi\-agent collaborative framework for text\-to\-sql\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 540–557\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[38\]P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[39\]Y\. Xie, X\. Jin, T\. Xie, M\. Lin, L\. Chen, C\. Yu, L\. Cheng, C\. Zhuo, B\. Hu, and Z\. Li\(2024\-08\)Decomposition for enhancing attention: improving LLM\-based text\-to\-SQL through workflow paradigm\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10796–10816\.External Links:[Link](https://aclanthology.org/2024.findings-acl.641/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.641)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[40\]C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen\(2023\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.21792#S3.SS2.SSS0.Px4.p1.4)\.
- \[41\]H\. Yang, J\. Zhang, Z\. He, and Y\. R\. Fung\(2025\)MARS\-sql: a multi\-agent reinforcement learning framework for text\-to\-sql\.arXiv preprint arXiv:2511\.01008\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p2.1),[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[42\]J\. Yang, B\. Hui, M\. Yang, J\. Yang, J\. Lin, and C\. Zhou\(2024\-08\)Synthesizing text\-to\-SQL data from weak and strong LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7864–7875\.External Links:[Link](https://aclanthology.org/2024.acl-long.425/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.425)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[43\]J\. Yao, R\. Cheng, X\. Wu, J\. Wu, and K\. C\. Tan\(2025\)Diversity\-aware policy optimization for large language model reasoning\.arXiv preprint arXiv:2505\.23433\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px1.p1.2)\.
- \[44\]T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman,et al\.\(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 3911–3921\.Cited by:[§1](https://arxiv.org/html/2605.21792#S1.p1.1)\.
- \[45\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, S\. Song, and G\. Huang\(2025\)Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?\.arXiv preprint arXiv:2504\.13837\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px1.p1.2)\.
- \[46\]B\. Zhai, C\. Xu, Y\. He, and Z\. Yao\(2025\-07\)Optimizing reasoning for text\-to\-SQL with execution feedback\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 19206–19218\.External Links:[Link](https://aclanthology.org/2025.findings-acl.982/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.982),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[47\]B\. Zhang, K\. Lazuka, and M\. Murag\(2026\)Equipping agents for the real world with agent skills, october 2025\.URL https://www\. anthropic\. com/engineering/equipping\-agents\-for\-the\-real\-world\-with\-agent\-skills\. Accessed,pp\. 01–28\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[48\]H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng,et al\.\(2026\)EvoSkills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[49\]H\. Zhang, Q\. Long, J\. Bao, T\. Feng, W\. Zhang, H\. Yue, and W\. Wang\(2026\)MemSkill: learning and evolving memory skills for self\-evolving agents\.arXiv preprint arXiv:2602\.02474\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[50\]Q\. Zhang, H\. Chen, J\. Dong, S\. Chen, F\. Huang, and X\. Huang\(2025\)Structure\-guided large language models for text\-to\-SQL generation\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=gT8JSEFqaS)Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px3.p1.1)\.
- \[51\]H\. Zhou, S\. Guo, A\. Liu, Z\. Yu, Z\. Gong, B\. Zhao, Z\. Chen, M\. Zhang, Y\. Chen, J\. Li,et al\.\(2026\)Memento\-skills: let agents design agents\.arXiv preprint arXiv:2603\.18743\.Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1)\.
- \[52\]Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba\(2022\)Large language models are human\-level prompt engineers\.InThe eleventh international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.21792#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.21792#S3.SS2.SSS0.Px4.p1.4)\.

## Appendix AProofs

### A\.1Proof of Proposition[A\.1](https://arxiv.org/html/2605.21792#A1.Thmproposition1)

We analyze the population\-level version of residual skill optimization\. In this setting, the next skill is chosen to maximize its expected contribution on the residual failure mass of the current skill bank\. The finite\-batch procedure in Algorithm[1](https://arxiv.org/html/2605.21792#alg1)can be viewed as an empirical approximation to this objective; deriving finite\-sample guarantees would require additional assumptions on sample size and generalization of the learned skills\.

###### Proposition A\.1\(Residual skill optimization approximates optimal Pass@K\)\.

Let𝒮\\mathcal\{S\}be a fixed skill family, and letps\(x\)∈\[0,1\]p\_\{s\}\(x\)\\in\[0,1\]denote the probability that one execution of skillsssolves inputxx\. For any skill bankA⊆𝒮A\\subseteq\\mathcal\{S\}, define its population Pass@K objective as

F\(A\)=𝔼x∼P\[1−∏s∈A\(1−ps\(x\)\)\]\.F\(A\)=\\mathbb\{E\}\_\{x\\sim P\}\\left\[1\-\\prod\_\{s\\in A\}\(1\-p\_\{s\}\(x\)\)\\right\]\.Starting fromA0=∅A\_\{0\}=\\emptyset, suppose that at each roundj=1,…,Kj=1,\\ldots,K, residual skill optimization selects

sj∈arg⁡maxs∈𝒮⁡𝔼x∼P\[ps\(x\)∏s′∈Aj−1\(1−ps′\(x\)\)\],s\_\{j\}\\in\\arg\\max\_\{s\\in\\mathcal\{S\}\}\\mathbb\{E\}\_\{x\\sim P\}\\left\[p\_\{s\}\(x\)\\prod\_\{s^\{\\prime\}\\in A\_\{j\-1\}\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\],whereAj−1=\{s1,…,sj−1\}A\_\{j\-1\}=\\\{s\_\{1\},\\ldots,s\_\{j\-1\}\\\}, and setsAj=Aj−1∪\{sj\}A\_\{j\}=A\_\{j\-1\}\\cup\\\{s\_\{j\}\\\}\. Let

A⋆∈arg⁡maxA⊆𝒮,\|A\|≤K⁡F\(A\)A^\{\\star\}\\in\\arg\\max\_\{A\\subseteq\\mathcal\{S\},\\ \|A\|\\leq K\}F\(A\)be the optimal size\-KKskill bank\. Then

F\(AK\)≥\(1−1/e\)F\(A⋆\)\.F\(A\_\{K\}\)\\geq\(1\-1/e\)F\(A^\{\\star\}\)\.Equivalently,

Pass@K⁡\(\{s1,…,sK\}\)≥\(1−1/e\)max\|A\|≤K⁡Pass@K⁡\(A\)\.\\operatorname\{Pass@K\}\(\\\{s\_\{1\},\\ldots,s\_\{K\}\\\}\)\\geq\(1\-1/e\)\\max\_\{\|A\|\\leq K\}\\operatorname\{Pass@K\}\(A\)\.

###### Proof\.

The proof follows the standard greedy analysis for monotone submodular maximization under a cardinality constraint\[[25](https://arxiv.org/html/2605.21792#bib.bib53)\]; we first verify that the Pass@K objective in our setting is indeed monotone submodular\.

We first show that the Pass@K objective is monotone submodular\. For a skill bankA⊆𝒮A\\subseteq\\mathcal\{S\}, recall that

F\(A\)=𝔼x∼P\[1−∏s∈A\(1−ps\(x\)\)\]\.F\(A\)=\\mathbb\{E\}\_\{x\\sim P\}\\left\[1\-\\prod\_\{s\\in A\}\(1\-p\_\{s\}\(x\)\)\\right\]\.For any skills∉As\\notin A, the marginal gain of addingssis

Δ\(s∣A\)=F\(A∪\{s\}\)−F\(A\)\.\\Delta\(s\\mid A\)=F\(A\\cup\\\{s\\\}\)\-F\(A\)\.Expanding the definition ofFF, we obtain

Δ\(s∣A\)\\displaystyle\\Delta\(s\\mid A\)=𝔼x∼P\[1−\(1−ps\(x\)\)∏s′∈A\(1−ps′\(x\)\)\]−𝔼x∼P\[1−∏s′∈A\(1−ps′\(x\)\)\]\\displaystyle=\\mathbb\{E\}\_\{x\\sim P\}\\left\[1\-\(1\-p\_\{s\}\(x\)\)\\prod\_\{s^\{\\prime\}\\in A\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]\-\\mathbb\{E\}\_\{x\\sim P\}\\left\[1\-\\prod\_\{s^\{\\prime\}\\in A\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]=𝔼x∼P\[ps\(x\)∏s′∈A\(1−ps′\(x\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{x\\sim P\}\\left\[p\_\{s\}\(x\)\\prod\_\{s^\{\\prime\}\\in A\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]\.Sinceps\(x\)∈\[0,1\]p\_\{s\}\(x\)\\in\[0,1\], this marginal gain is nonnegative\. HenceFFis monotone\.

Next, letA⊆B⊆𝒮A\\subseteq B\\subseteq\\mathcal\{S\}\. Since each factor1−ps′\(x\)∈\[0,1\]1\-p\_\{s^\{\\prime\}\}\(x\)\\in\[0,1\], we have

∏s′∈B\(1−ps′\(x\)\)≤∏s′∈A\(1−ps′\(x\)\)\.\\prod\_\{s^\{\\prime\}\\in B\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\leq\\prod\_\{s^\{\\prime\}\\in A\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\.Therefore,

Δ\(s∣B\)=𝔼x∼P\[ps\(x\)∏s′∈B\(1−ps′\(x\)\)\]≤𝔼x∼P\[ps\(x\)∏s′∈A\(1−ps′\(x\)\)\]=Δ\(s∣A\)\.\\Delta\(s\\mid B\)=\\mathbb\{E\}\_\{x\\sim P\}\\left\[p\_\{s\}\(x\)\\prod\_\{s^\{\\prime\}\\in B\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]\\leq\\mathbb\{E\}\_\{x\\sim P\}\\left\[p\_\{s\}\(x\)\\prod\_\{s^\{\\prime\}\\in A\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]=\\Delta\(s\\mid A\)\.ThusFFsatisfies diminishing marginal returns and is submodular\.

The residual arg\-max rule is exactly greedy maximization of this objective, because the marginal gain of adding skillssto the current bankAj−1A\_\{j\-1\}is

Δ\(s∣Aj−1\)=𝔼x∼P\[ps\(x\)∏s′∈Aj−1\(1−ps′\(x\)\)\]\.\\Delta\(s\\mid A\_\{j\-1\}\)=\\mathbb\{E\}\_\{x\\sim P\}\\left\[p\_\{s\}\(x\)\\prod\_\{s^\{\\prime\}\\in A\_\{j\-1\}\}\(1\-p\_\{s^\{\\prime\}\}\(x\)\)\\right\]\.Thus choosingsjs\_\{j\}by residual maximization is the same as choosing the skill with largest marginal increase inFF\.

LetA⋆∈arg⁡max\|A\|≤K⁡F\(A\)A^\{\\star\}\\in\\arg\\max\_\{\|A\|\\leq K\}F\(A\)be an optimal size\-KKskill bank\. SinceFFis monotone,

F\(A⋆\)−F\(Aj\)≤F\(Aj∪A⋆\)−F\(Aj\)\.F\(A^\{\\star\}\)\-F\(A\_\{j\}\)\\leq F\(A\_\{j\}\\cup A^\{\\star\}\)\-F\(A\_\{j\}\)\.By submodularity,

F\(Aj∪A⋆\)−F\(Aj\)≤∑s∈A⋆Δ\(s∣Aj\)\.F\(A\_\{j\}\\cup A^\{\\star\}\)\-F\(A\_\{j\}\)\\leq\\sum\_\{s\\in A^\{\\star\}\}\\Delta\(s\\mid A\_\{j\}\)\.Since\|A⋆\|≤K\|A^\{\\star\}\|\\leq K, at least one skills∈A⋆s\\in A^\{\\star\}has marginal gain at least

F\(A⋆\)−F\(Aj\)K\.\\frac\{F\(A^\{\\star\}\)\-F\(A\_\{j\}\)\}\{K\}\.Greedy chooses the skill with largest marginal gain, so

F\(Aj\+1\)−F\(Aj\)≥F\(A⋆\)−F\(Aj\)K\.F\(A\_\{j\+1\}\)\-F\(A\_\{j\}\)\\geq\\frac\{F\(A^\{\\star\}\)\-F\(A\_\{j\}\)\}\{K\}\.Equivalently,

F\(A⋆\)−F\(Aj\+1\)≤\(1−1K\)\(F\(A⋆\)−F\(Aj\)\)\.F\(A^\{\\star\}\)\-F\(A\_\{j\+1\}\)\\leq\\left\(1\-\\frac\{1\}\{K\}\\right\)\\left\(F\(A^\{\\star\}\)\-F\(A\_\{j\}\)\\right\)\.Applying this recurrence forKKrounds gives

F\(A⋆\)−F\(AK\)≤\(1−1K\)KF\(A⋆\)\.F\(A^\{\\star\}\)\-F\(A\_\{K\}\)\\leq\\left\(1\-\\frac\{1\}\{K\}\\right\)^\{K\}F\(A^\{\\star\}\)\.Hence

F\(AK\)≥\(1−\(1−1K\)K\)F\(A⋆\)≥\(1−1/e\)F\(A⋆\)\.F\(A\_\{K\}\)\\geq\\left\(1\-\\left\(1\-\\frac\{1\}\{K\}\\right\)^\{K\}\\right\)F\(A^\{\\star\}\)\\geq\(1\-1/e\)F\(A^\{\\star\}\)\.This proves the claim\. ∎

## Appendix BImplementation Details and Best Empirical Practices

### B\.1Implementation Details

#### Machine setup\.

All experiments were run on a single Apple Silicon MacBook Pro \(Apple M\-series CPU, 10\+ cores; 32 GB unified memory; macOS 14\)\. Since our pipeline issues all language\-model calls to hosted inference APIs, no local GPU is required: the workstation only orchestrates prompting, parsing, and result aggregation\.

#### LLM backbone and inference settings\.

We evaluateDivSkill\-SQLwith two LLM backbones: Opus\-4\.6 and GPT\-5\.4\. Both are used in non\-reasoning mode with a decoding temperature of0\.20\.2, a canonical choice for coding tasks, and a maximum completion budget of 64000 tokens\. The low temperature is a deliberate design choice: sinceDivSkill\-SQLachieves diversity through learned skills rather than stochastic decoding, a low temperature preserves the reasoning stability and precision needed for complex SQL generation\. Each agent run is allowed up to 12 reasoning turns and 20 SQL executions\.

#### CHASE\-SQLadaptation\.

CHASE\-SQLwas originally designed for a workflow\-based Text\-to\-SQL pipeline\. To produce a fair comparison in our agentic setting, we adapt its three transferable design choices to our agent architecture:

- •*Schema\-link shuffling\.*We permute the column ordering presented to the agent across different candidates, matchingCHASE\-SQL’s schema perturbation strategy\.
- •*High\-temperature decoding\.*We set the agent’s decoding temperature to1\.01\.0forCHASE\-SQLcandidates, reproducing the stochastic variation thatCHASE\-SQLuses to induce candidate diversity\.
- •*Pairwise candidate selection\.*We use the same LLM\-based pairwise comparison selector for bothCHASE\-SQLandDivSkill\-SQL, ensuring the selection mechanism is held constant across methods\.

#### Candidate selection\.

For bothDivSkill\-SQLandCHASE\-SQL, we use an LLM\-based pairwise selector that compares each candidate pair and selects the winner via win\-rate\-based aggregation\. The selector uses the same backbone model as the candidate generator \(either Opus\-4\.6 or GPT\-5\.4\) at a temperature of0\.20\.2\. Each pairwise comparison receives the question, database schema, and the two candidate SQL queries along with execution previews, and returns a preference judgment\.

#### GEPA optimization settings\.

Table[6](https://arxiv.org/html/2605.21792#A2.T6)summarizes the GEPA hyperparameters used for skill optimization\.

Table 6:GEPA hyperparameters for BIRD\-mini\-dev skill optimization\. Spider2\-Lite Snowflake training uses the same settings except the training data source\.
#### Agent tools\.

The agent has access to six tools during both training and evaluation:

- •execute\_sql: run SQL against the live database and return results or error messages\.
- •lookup\_docs: retrieve dialect\-specific documentation \(e\.g\., dialect function/grammar reference\), database meta\-data, external knowledge, etc\.
- •review\_sql: invoke an LLM\-based critic to review the current SQL draft before submission\.
- •get\_sql\_pattern: retrieve anonymized SQL patterns for similar query types \(e\.g\., top\-N, running totals\)\.
- •get\_sql\_templates: retrieve SQL templates that are masked training instances categorized by query types\.
- •submit\_final\_sql: submit the final answer\.

The tool set and their implementations remain fixed across all skills and all experiments\. Skills influence only the strategy text in the system prompt; they cannot add, remove, or modify tools\.

### B\.2Practices for Better Skill Learning

We describe five practical lessons that proved important for making reflective skill optimization work reliably on Text\-to\-SQL agents\.

#### Proxy models for optimization\.

Running the full GEPA reflect\-mutate\-evaluate loop on the strongest available model \(e\.g\., Opus 4\.6\) is both expensive and, perhaps counter\-intuitively, less effective\. Stronger models already produce near\-correct trajectories on many training examples, leaving the reflector with subtle failure signals that are hard to attribute to specific strategy gaps\. We find that using a weaker model from the same family as a proxy, e\.g\., Sonnet 4\.6 for the Opus 4\.6 experiments, yields faster and cheaper optimization while producing larger per\-round accuracy gains that push skill evolution more aggressively\. The resulting optimized skills transfer well to the stronger model: the strategy\-level improvements \(e\.g\., “anchor the grain before grouping”\) are model\-agnostic, even though they were discovered from the proxy’s more frequent and more interpretable failures\.

#### Brevity as a tiebreaker\.

During GEPA optimization, multiple candidate skill mutations often achieve the same accuracy gain on the hard batch\. Rather than selecting arbitrarily, we break ties by preferring the shorter prompt\. This acts as a lightweight regularizer: longer prompts tend to accumulate instance\-specific details from the training batch, increasing the risk of overfitting to particular schemas or question patterns\.

#### Dialect\- and instance\-agnostic reflection\.

We explicitly instruct the reflector model to avoid including any instance\-specific or dialect\-specific details in the optimized strategy text, including concrete table names, column names, dialect\-specific function syntax, or schema patterns observed in the training batch\. The reflector is told that the skill must generalize across unseen databases, SQL dialects, and task formats\. This constraint is important because the skills optimized on training data with Snowflake grammar are later evaluated on SQLite and BigQuery, making dialect\-specific advice actively harmful\.

#### Correctness\-only reward signal\.

We deliberately use binary execution correctness as the sole reward signal for GEPA optimization\. Although finer\-grained intermediate rewards—such as keyword overlap with gold SQL, structural similarity scores, or partial\-credit metrics based on clause matching—might seem more informative, we found them prone to reward hacking in practice\. For example, a keyword\-coverage reward incentivizes skills that instruct the agent to speculatively include as many SQL keywords as possible, inflating the reward without improving correctness\. Similarly, structural similarity rewards can penalize valid alternative query plans that differ from the gold SQL in form but not in semantics\. Binary correctness avoids these pathologies: a skill is rewarded only if the agent’s final SQL produces the correct result on the target database, providing an unambiguous and ungameable training signal\.

#### Skill\-order rotation\.

In Algorithm[1](https://arxiv.org/html/2605.21792#alg1), each skill removes its solved examples before the next skill is optimized, so earlier skills in a batch see an easier and broader residual, while later skills are trained on a narrower and harder subset\. To guarantee that each skill occasionally sees the full batch before other skills remove solved examples, we rotate the skill order across batches\. This reduces positional bias and prevents later skills from being systematically specialized only to the hardest tail of failures\. Note that if the number of batches is fewer than the number of skills, we apply a larger stride to the rotation to maintain uniform positional coverage\.

## Appendix CSkill Pool

We list theK=8K=8seed skills used in our experiments, showing both the initial hand\-designed seed prompt and the prompt after residual optimization on Snowflake training data\. Each prompt is injected verbatim into the agent’s system message; the agent’s tool set and control flow remain unchanged\. Optimized prompts are labeled with a round suffix \(e\.g\.,\_r1\) indicating the GEPA batch that produced the accepted mutation\.

#### default\.

*Strategy:*the baseline agentic behavior—balanced exploration followed by incremental SQL construction\.

Seed prompt: default\#\# Strategy 1\. EXPLORE first: run queries to understand the data\-\-\-check table structures, column values, data types, join keys, actual string values in the data 2\. If unsure about any SQL function’s syntax or behavior, call lookup\_docs BEFORE writing the query 3\. For common patterns \(top\-N, running totals, pivots\), call get\_sql\_pattern for a template 4\. PLAN your approach based on what you discovered 5\. WRITE and TEST your SQL incrementally\-\-\-run it via execute\_sql to check results 6\. VERIFY results look reasonable \(right number of rows, right columns, sensible values\) 7\. Call review\_sql to get a second opinion before submitting 8\. SUBMIT only when confident

Optimized prompt: default1\. EXPLORE the data first \(as in seed\)\. 2\. CLARIFY ambiguities\-\-\-identify potential traps:NULLsin key columns, case sensitivity, duplicate rows, date formats, and whether counts should beDISTINCT\. 3\. MAP the question to SQL primitives\-\-\-explicitly decide join type \(INNERvsLEFT\), filter placement \(WHEREvsHAVING\), aggregation scope, andNULLhandling before coding\. 4\. Check templates\-\-\-call get\_sql\_pattern and lookup\_docs \(as in seed\)\. 5\. BUILD incrementally\-\-\-write and execute\_sql eachCTEor subquery alone \(as in seed\)\. 6\. VALIDATE against the question\-\-\-re\-read the question, then check: correct columns returned? correct filter conditions?DISTINCTwhere needed?NULL\-safe denominators? ordering and limits match? 7\. CROSS\-CHECK edge cases\-\-\-run a quick sanity query \(e\.g\., total counts, min/max values, a spot\-check join\) to confirm the final result is not inflated by fanout or deflated by over\-filtering\. 8\. REVIEW\-\-\-call review\_sql and address any flagged issues\. 9\. SUBMIT only after incremental checks and review pass\.

#### direct\_coder\.

*Strategy:*drafts SQL immediately, refines through execution feedback\.

Seed prompt: direct\_coder\#\# Strategy: DIRECT CODING You are an EFFICIENT SQL writer\. Write SQL quickly, test, iterate\. 1\. Read the question carefully\. Identify the core tables, joins, and aggregations needed\. 2\. Write your best SQL attempt IMMEDIATELY based on the schema\. 3\. Execute it\. If errors occur, read the error message carefully and fix\. 4\. If the query runs but results look wrong, investigate specific columns/values\. 5\. Iterate rapidly\-\-\-each revision should fix one specific issue\. 6\. Do NOT over\-explore\. Only investigate columns/values that are directly relevant to errors\. 7\. SUBMIT as soon as the query produces reasonable results\.

GEPA added lookup\-table awareness and structured error\-repair guidance\. Key additions over the seed \(new or substantially expanded material inbold\):

Optimized prompt: direct\_coder\#\# Strategy: DIRECT CODING 1\.Read the schema first\.Before writing any SQL, identify ALL tables mentioned or implied by the question\.Pay special attention to lookup/reference/static tables \(e\.g\., category tables, node tables, type tables\) that provide human\-readable names or filter criteria\-\-\-these almost always require aJOIN\. 2\.Map question terms to schema columns\.If the question references a name, label, or category, find which table owns that column\.Never filter or select on a column that doesn’t exist in the target table\-\-\-use the correct table viaJOINinstead\. 3\. Write your best SQL IMMEDIATELY based on the schema\.Use explicitJOINconditions\. When lookup tables exist, join them rather than filtering on raw IDs or payload strings\. 4\.Execute\. Fix errors specifically: \- Column\-not\-found→\\tofind the correct table andJOINit \- Wrong results→\\toverifyJOINkeys andWHEREfilters match actual values \- Missing rows→\\tocheckJOINtype \(INNERvsLEFT\) 5\.Aggregation and output checks:VerifyGROUP BYincludes all non\-aggregated SELECT columns; verifyORDER BY, LIMIT, and NULLS LAST where appropriate; confirm output column names match the question\. 6\. Iterate rapidly\-\-\-each revision fixes one specific issue\. 7\. SUBMIT as soon as the query produces reasonable results\. Key reminder: Questions involving categories, types, nodes, or classifications almost always require joining to a static/lookup table\. Never assume the needed label lives in the fact table\-\-\-check the schema\.

#### decompose\.

*Strategy:*identifies sub\-questions before composing the final SQL\.

Seed prompt: decompose\#\# Strategy: DECOMPOSE & CONQUER Break complex questions into simple subqueries, build bottom\-up\. 1\. PARSE the question into atomic requirements: What is being counted/summed/averaged? What are the filter conditions? What are the grouping columns? Is there ranking, ordering, or limiting? 2\. BUILD each piece as a standaloneCTE: Start with the base data, add joins one at a time verifying row counts, add aggregations verifying results at each step\. 3\. COMPOSECTEsinto the final query using WITH\.\.\.SELECT\. 4\. Run eachCTEindividually via execute\_sql to verify intermediate results\. 5\. Call review\_sql on the assembled final query before submitting\. 6\. SUBMIT only when confident\.

GEPA expanded this into a structured six\-step process with explicit grain\-anchoring and filter\-validation phases\. Key additions \(inbold\):

Optimized prompt: decompose\#\# Strategy: DECOMPOSE & CONQUER Step 1: PARSE the question into atomic requirements\-\-\-explicitly identify: output columns/metrics, grain \(one row per what?\), filter conditions, grouping dimensions, and whether there is ranking, ordering, limiting, or a ratio/composition calculation\. Step 2:ANCHOR the grain before grouping\.MatchGROUP BYcolumns precisely to the output grain\-\-\-no more, no less\.If the question asks for monthly totals, group by month only; do not add route, city, or other columns unless explicitly requested\. Step 3: BUILD each piece as a standaloneCTE\.For ratio/composition queries, compute totals in oneCTE, subgroup counts in another, thenJOINand divide\. Step 4:VERIFY join logic and filter semantics\.Confirm join keys actually link the intended entities; avoid fan\-out\.Confirm filter values match the domain as they appear in the data, not as paraphrased in the question\. Step 5: ASSEMBLE and REVIEW\. Call review\_sql; confirm output columns and grain match the question\. Step 6: SUBMIT only when confident\.

#### explore\_heavy\.

*Strategy:*spends extra steps on schema and sample\-row inspection before drafting\.

Seed prompt: explore\_heavy\#\# Strategy: DEEP EXPLORATION You are a THOROUGH explorer\. Before writing ANY SQL, deeply understand the data\. 1\. CATALOG SCAN: List all schemas and tables\. Check which tables have data \(SELECT COUNT\(\*\)\)\. 2\. COLUMN AUDIT: For each relevant table, run DESCRIBE or SHOW COLUMNS\. Check actual column types\. 3\. VALUE PROFILING: For key columns, run SELECTDISTINCTto see actual values, formats, ranges\. 4\.JOINDISCOVERY: Test joins between tables with small queries before using them in final SQL\. 5\. Only after thorough exploration, write your query\. 6\. Test edge cases: What if there areNULLs? What if the join produces duplicates? 7\. Call review\_sql before submitting\. 8\. SUBMIT only when confident\.

GEPA reorganized the strategy into six explicit phases and added an output\-requirements clarification phase \(Phase 4\) that was absent from the seed\. Key additions \(inbold\):

Optimized prompt: explore\_heavy\#\# Strategy: DEEP EXPLORATION Phase 1\-\-3: Schema Discovery, Value Profiling, Join Validation \(as in seed\)\. Phase 4: Clarify Output Requirements Before Writing SQL\. Before coding, explicitly answer: \- Granularity: one summary row, or one row per dimension? \- Aggregation: which columns require AVG, SUM, COUNT? \- Filters: date ranges, status filters, or categorical constraints implied? \- Extra columns: would adding dimension columns change the granularity? If the question asks for summary metrics without specifying a grouping dimension, produce a single aggregated row\-\-\-do NOT return raw row\-level data or add unrequested dimension columns\. Phase 5: Write and Validate SQL\-\-\-apply NULLIF in denominators to avoid divide\-by\-zero; useCTEsto separate filtering, joining, and aggregation stages\. Phase 6: Submit only when the output granularity, aggregation, and columns exactly match what was requested\.

#### conservative\.

*Strategy:*prefers the simplest faithful query; avoids speculative constructs\.

Seed prompt: conservative\#\# Strategy: CONSERVATIVE & SAFE Prefer simple, safe SQL\. Avoid unnecessary complexity\. 1\. Start with the SIMPLEST possible query that could answer the question\. 2\. Avoid: Complex nested subqueries, window functions unless required, multiple joins when one will do,HAVINGwhenWHEREsuffices, correlated subqueries\. 3\. Do NOT add any clause the question didn’t ask for: No ROUND unless asked, no COALESCE unlessNULLsare a proven problem, no extraWHEREfilters or output columns\. 4\. When ambiguous, pick the LITERAL interpretation\. 5\. Test your query\. If it works and looks reasonable, submit\. 6\. SUBMIT once results make sense\. Don’t over\-iterate\.

GEPA added explicit guidance on AND\-vs\-OR filter logic and output granularity matching—two recurring failure modes in the training data\. Key additions \(inbold\):

Optimized prompt: conservative\(Steps 1\-\-3 retained from seed\.\) 4\. Filter logic\-\-\-AND vs OR: \- AND: both conditions must hold simultaneously on the same row\. \- OR: either condition suffices\. \- When filtering across related entities \(e\.g\., origin OR destination\), default to OR unless the question explicitly requires all conditions on the same record\. 5\. Output structure\-\-\-match the question’s requested granularity exactly: \- Identify the grouping dimensions the question asks for before writingGROUP BY\. \- Do not substitute finer\-grained groupings \(e\.g\., day\) when coarser ones \(e\.g\., month\) are requested\. \- Column names and aliases should reflect the question’s terminology\. 6\. When ambiguous, pick the LITERAL interpretation\.Consult reference templates for grouping, join, and conditional aggregation patterns\. 7\. Test and submit once results make sense\.

#### adversarial\_checker\.

*Strategy:*actively stress\-tests the query for edge cases before finalizing\.

Seed prompt: adversarial\_checker\#\# Strategy: ADVERSARIAL SELF\-CHECK After writing SQL, actively try to BREAK it before submitting\. 1\. Explore and write your initial SQL query\. 2\. Execute it and get results\. 3\. NOW, CHALLENGE your own query: Does the question ask for X but you computed Y? Could a join be producing duplicates? Are you filtering correctly? For ratios: verify numerator and denominator independently\. For ‘top N’: verify ordering\. 4\. Fix any issues you discover\. 5\. Call review\_sql for an independent check\. 6\. SUBMIT only after surviving both your own and the reviewer’s scrutiny\.

GEPA restructured the strategy into four explicit phases, adding an upfront decomposition step and specific guidance for period\-over\-period comparisons and join\-type selection:

Optimized prompt: adversarial\_checkerPhase 1\-\-\-Decompose the Question Before Writing: Identify ALL computations required\. If the question involves change/comparison, plan for two separate aggregations and a join or pivot\-\-\-never a single flat filter\. If it involves rates or ratios, plan numerator and denominator explicitly\. Sketch the output shape\. Phase 2\-\-\-Write SQL Matching Full Complexity: UseCTEsfor multi\-step logic\. Use window functions for rankings\. Use FULLOUTER JOINwhen comparing two periods where either side may have no data\. Avoid the trap of ‘‘just filter and return raw rows’’ when computation is required\. Phase 3\-\-\-Adversarial Challenge \(expanded from seed\): complexity check, output check, aggregation check, filter check,join type check \(OUTER vsINNER\), ratio/rate check\. Phase 4\-\-\-Fix and Validate\.Consult templates for period\-over\-periodCTEpatterns\.

#### template\_first\.

*Strategy:*anchors query shape from retrieved SQL patterns before coding\.

Seed prompt: template\_first\#\# Strategy: TEMPLATE\-FIRST PLANNER Before writing SQL, call get\_sql\_templates and get\_sql\_pattern to anchor the query shape\. Pick the closest template, adapt only the table names, columns, filters, grain, and ordering confirmed from the schema, then execute one focused validation query\. Prefer template\-guidedCTE/window/ratio patterns over ad hoc exploration\. Submit after the result shape matches the question\.

GEPA appended a mandatory pre\-submit checklist targeting schema hallucination—a failure mode where the agent references tables or columns it has not verified\. Key addition \(inbold\):

Optimized prompt: template\_first\(Seed text retained in full\.\) MANDATORY pre\-submit checklist \(do NOT skip\): 1\. Tables exist as named\-\-\-every table in your final SQL was confirmed to exist via execute\_sql\. 2\. Columns exist as named\-\-\-every column appeared in actual schema output\. 3\. Identifier quoting is correct for the dialect\. 4\. Final query has executed at least once and returned non\-empty rows whose shape matches the question\. Submitting SQL that references non\-existent columns, or that has never been successfully executed, is an automatic failure\.

#### fast\_error\_repair\.

*Strategy:*prioritizes execution\-error recovery over upfront planning\.

Seed prompt: fast\_error\_repair\#\# Strategy: FAST ERROR REPAIR Move quickly: inspect only the most relevant tables, write the first plausible SQL early, execute it, and repair from concrete errors or wrong\-shaped results\. Do not exhaustively profile\. Each iteration changes one thing: missing column, join key, filter value, aggregation grain, or dialect syntax\. Submit as soon as the SQL runs and the output shape answers the question\.

GEPA expanded the single\-paragraph seed into a five\-phase process, adding a structured triage table and a value\-sanity\-check phase\. Key additions \(inbold\):

Optimized prompt: fast\_error\_repairCore loop: write→\\torun→\\toverify values→\\torepair→\\tosubmit\. Phase 1\-\-\-Minimal Table Scan \(as in seed\)\. Phase 2\-\-\-First SQL Draft\.PreferCTEsfor multi\-step logic\. Commit one join key and one filter assumption up front\. Phase 3\-\-\-Execute & Triage\.Change exactly one thing per iteration: Column not found→\\toalias or rename Join returns 0 rows→\\torelax join key or flip direction Count inflated→\\toaddDISTINCTor check fan\-out join Wrong grain→\\tore\-examineGROUP BYcolumns Phase 4\-\-\-Value Sanity Check \(the key addition\): Before submitting: do not only check shape\-\-\-check values\. Are counts suspiciously 0 or implausibly large? Does a percentage exceed 1\.0 or go negative? Phase 5\-\-\-Submit when both shape and values are plausible\.

## Appendix DAdditional Experimental Results

### D\.1Variance and Candidate\-Level Breakdowns

[Table˜7](https://arxiv.org/html/2605.21792#A4.T7)reports the full Spider2\-Lite Opus\-4\.6 table with standard deviations\. The qualitative conclusion from the main text is unchanged after adding variance:DivSkill\-SQLremains the strongest method on selected accuracy in all three dialects, with the largest gains on the more complex Snowflake and BigQuery settings\.

Table 7:Spider2\-Lite results with Opus 4\.6\.#### WhyReFoRCEselection can underperform mean candidate quality\.

In the main experiment, we noted that ReFoRCE’s final selected SQL can be worse than the average quality of its individual candidate runs\.[Table˜8](https://arxiv.org/html/2605.21792#A4.T8)makes this concrete\. On SQLite, the mean per\-candidate Pass@1 is 64\.44 while the final selected accuracy is only 58\.52; on Snowflake, the same gap is 47\.83 vs\. 41\.06\. This indicates thatReFoRCEoften produces correlated candidates that agree on the same wrong answer, so majority voting can amplify a dominant failure mode rather than recover the strongest candidate\. BigQuery is less pathological: selected accuracy \(51\.22\) slightly exceeds the mean candidate accuracy \(48\.19\), but still remains well below the oracle upper bound of 56\.59\. Here, “Max” is exactly Pass@4 becauseReFoRCEexposes four candidate runs in this evaluation\. ReFoRCE’s\-\-num\_votesparameter only sets the number of self\-refinement threads per instance, not the number of surviving candidates: each thread is a 5\-step refine loop that may terminate without writing any SQL \(max\-iter exhaustion, empty\-result early\-stop, invalid response, or oversized schema\)\. In our GPT\-5\.4 run the realized candidate count per instance ranges over all threads, so a Pass@k oracle is ill\-defined and not directly comparable to the Pass@8 results we report for the other methods\. Pass@1 is defined as a complete set of first generated candidates for each question to show the interior intermediate results of ReFoRCE before majority voting\. In this table, we simply give four sets of candidates to explain why ReFoRCE gets lower accuracy on Sel\. acc\. compared to so\-called Pass@1\.

Table 8:ReFoRCEper\-candidate Pass@1 on Spider2\-Lite for GPT\-5\.4

### D\.2AdditionalCHASE\-SQLComparison with GPT\-5\.4

#### Candidate\-pool quality and rankability\.

The main Spider2\-Lite results suggest thatDivSkill\-SQL’s advantage overCHASE\-SQLis not only that it contains slightly more correct candidates, but also that its candidate pool is easier to rank\.[Tables˜9](https://arxiv.org/html/2605.21792#A4.T9),[10](https://arxiv.org/html/2605.21792#A4.T10)and[11](https://arxiv.org/html/2605.21792#A4.T11)provide supporting evidence on the GPT\-5\.4 Spider2\-Lite evaluation\. Compared withCHASE\-SQL,DivSkill\-SQLimproves pass@1 \(56\.95 vs\. 54\.96\), pass@8 \(78\.79 vs\. 76\.42\), and selected accuracy \(64\.90 vs\. 60\.69\), while reducing the oracle–selector gap \(13\.89 vs\. 15\.72\)\.

The same\-instance head\-to\-head analysis shows 47DivSkill\-SQL\-only wins versus 24CHASE\-SQL\-only wins on final selection, and McNemar’s test confirms that the selected\-accuracy gain is statistically significant \(p=0\.0086p=0\.0086\)\. The candidate\-density view further shows thatDivSkill\-SQLyields fewer dead pools with zero correct candidates, slightly more rich pools with 6–8 correct candidates, and fewer exact duplicate slots\. Together, these results support the claim that residual skill optimization improves both candidate quality and candidate rankability, rather than merely increasing surface\-form variation\.

Table 9:Distribution of the number of correct candidates per instance in the 8\-candidate pool\.DivSkill\-SQLpool yields fewer dead pools \(0 correct\), slightly more rich pools \(6–8 correct\)\.Table 10:Same\-instance head\-to\-head on Spider2\-Lite GPT\-5\.4\. “Ours\-only sel\.” counts instances where the final selected SQL is correct forDivSkill\-SQLbut not forCHASE\-SQL\. “Ours\-only oracle” counts instances where the 8\-candidate pool contains at least one correct SQL forDivSkill\-SQLbut not forCHASE\-SQL\.Table 11:Paired same\-instance comparisons on Spider2\-Lite GPT\-5\.4 using the exact McNemar test\. The strongest effect is on final selected accuracy, including the subset where both methods already contain at least one correct candidate somewhere in the pool\. This supports the claim that the low\-temperatureDivSkill\-SQLpool is easier to rank, not merely that it has slightly better oracle coverage\.
Residual Skill Optimization for Text-to-SQL Ensembles

Similar Articles

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

Submit Feedback

Similar Articles

SkillOpt: Executive Strategy for Self-Evolving Agent Skills
@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration