ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL
Summary
ProSPy is a profiling-driven SQL-Python agentic framework for enterprise text-to-SQL that structures reasoning into four stages: automatic profiling, schema pruning, dialect-agnostic SQL interface, and Python-based analysis. It achieves execution accuracies of 60.15% and 60.51% on Spider 2.0-Lite and Spider 2.0-Snow with Claude-4.5-Opus, outperforming strong baselines.
View Cached Full Text
Cached at: 06/05/26, 08:08 AM
# ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL
Source: [https://arxiv.org/html/2606.05836](https://arxiv.org/html/2606.05836)
Zhaorui Yang1, Huawei Zheng2, Sen Yang2, Yuhui Zhang3, Haoxuan Li1, Zhizhen Yu5, Xuan Yi4, Chen Hou3, DefengXie3, Chao Hu3, Minfeng Zhu5, Dazhen Deng1,2, Haozhe Feng3, Danqing Huang3, Yingcai Wu1, Peng Chen3, Wei Chen1 1State Key Lab of CAD&CG, Zhejiang University 2School of Software Technology, Zhejiang University3Tencent TEG 4School of Mathematical Sciences, Peking University,5Zhejiang University
###### Abstract
Large language models have substantially advanced Text\-to\-SQL systems, yet applying them to enterprise\-scale databases remains challenging\. Real\-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect\-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query\. To address these challenges, we proposeProSPy, aProfiling\-drivenSQL–Python agentic framework for enterprise\-scale Text\-to\-SQL\. ProSPy structures the reasoning process into four stages: it first extracts fine\-grained data evidence through automatic profiling, progressively prunes large schemas into task\-relevant contexts, fetches intermediate views through a dialect\-agnostic SQL interface, and finally performs flexible downstream analysis with Python\. This design combines the efficiency of SQL over large databases with the flexibility of Python\-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects\. Experiments on Spider 2\.0\-Lite and Spider 2\.0\-Snow show that ProSPy consistently outperforms strong baselines with both open\-source and proprietary models, achieving execution accuracies of 60\.15% and 60\.51% with Claude\-4\.5\-Opus, without majority voting\. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade\-off between schema recall and precision\.
ProSPy: A Profiling\-Driven SQL\-Python Agentic Framework for Enterprise Text\-to\-SQL
Zhaorui Yang1, Huawei Zheng2, Sen Yang2, Yuhui Zhang3, Haoxuan Li1,Zhizhen Yu5, Xuan Yi4, Chen Hou3, DefengXie3, Chao Hu3, Minfeng Zhu5,Dazhen Deng1,2, Haozhe Feng3, Danqing Huang3, Yingcai Wu1, Peng Chen3, Wei Chen11State Key Lab of CAD&CG, Zhejiang University2School of Software Technology, Zhejiang University3Tencent TEG4School of Mathematical Sciences, Peking University,5Zhejiang University
††footnotetext:Work done during Zhaorui Yang’s internship at Data Computing Platform Department, Tencent TEG\.## 1Introduction
Recent advances in large language models \(LLMs\) have substantially improved semantic parsing, making Text\-to\-SQL a promising interface for accessing relational databases through natural language\. On standard benchmarks such as SpiderYuet al\.\([2018](https://arxiv.org/html/2606.05836#bib.bib19)\)and BIRDLiet al\.\([2023b](https://arxiv.org/html/2606.05836#bib.bib9)\), prior work has achieved strong performance through prompting and in\-context learningPourreza and Rafiei \([2023](https://arxiv.org/html/2606.05836#bib.bib11)\); Zhanget al\.\([2023](https://arxiv.org/html/2606.05836#bib.bib12)\); Gaoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib13)\), test\-time search and multi\-path reasoningLiet al\.\([2025a](https://arxiv.org/html/2606.05836#bib.bib17)\); Yuanet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib16)\); Pourrezaet al\.\([2025a](https://arxiv.org/html/2606.05836#bib.bib18)\), as well as post\-training with supervised fine\-tuning or reinforcement learningLiet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib15),[2023a](https://arxiv.org/html/2606.05836#bib.bib14)\); Pourrezaet al\.\([2025b](https://arxiv.org/html/2606.05836#bib.bib23)\); Liet al\.\([2025b](https://arxiv.org/html/2606.05836#bib.bib22)\)\.
Despite these advances, conventional benchmark settings still differ substantially from enterprise\-scale database environments\. Real\-world schemas are often much larger and more heterogeneous, containing numerous tables, ambiguous naming conventions, and deeply nested or semi\-structured fields\. Recent methods address these challenges through retrieval\-augmented selectionWanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4),[2026](https://arxiv.org/html/2606.05836#bib.bib3)\)and agentic exploration workflowsLeiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\); Caoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\)\. However, these approaches typically require repeated LLM calls and database probing to identify relevant schema elements and inspect data values, resulting in substantial latency and computational overhead\.
Figure 1:Overview of ProSPy, which combines a dialect\-agnostic DSL for SQL\-based retrieval with Python\-based analysis to support scalable retrieval, shield SQL dialect differences, and perform complex data analysis for enterprise\-scale Text\-to\-SQL\.Figure 2:Illustration of the ProSPy framework\. ProSPy consists of four stages: \(A\) data profiling, \(B\) progressive schema pruning, \(C\) agentic data fetching, and \(D\) Python analysis\. It extracts structured data evidence from raw tables, prunes large schemas into task\-relevant contexts, fetches intermediate views through a dialect\-agnostic DSL\-to\-SQL interface, and performs final analysis with Python interpreter to produce the result CSV\.Beyond schema scale, enterprise databases often suffer from incomplete or unreliable metadata, making schema semantics difficult to interpret\. Real\-world queries also frequently require complex analytical workflows involving multiple intermediate views and downstream statistical computationLeiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\)\. Although reinforcement learning can enhance the reasoning capabilities of LLMsGuoet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib27)\), existing reasoning data are largely concentrated in mathematical and coding domainsMaet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib29)\), while such data remain scarce for SQL generationYaoet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib28)\)\. Together with dialect differences across database engines, these factors make direct SQL generation brittle and cause methods that perform well on simpler benchmarks to degrade in enterprise\-scale scenarios\.
To address these challenges, we proposeProSPy, aProfiling\-drivenSQL–Python agentic framework for enterprise\-scale Text\-to\-SQL\. As illustrated in Figure[1](https://arxiv.org/html/2606.05836#S1.F1), ProSPy is motivated by the complementary strengths of SQL and Python\. SQL is well suited for scalable retrieval and lightweight aggregation over large databases, but complex analytical questions are often difficult to express as a single query\. Python, in contrast, provides flexible support for multi\-step analysis, but is inefficient and impractical for directly accessing large\-scale relational databases\. ProSPy therefore decomposes enterprise\-scale data reasoning into SQL\-based retrieval and Python\-based analysis\.
Figure[2](https://arxiv.org/html/2606.05836#S1.F2)presents the overall framework of ProSPy\. To reduce reliance on provided metadata, ProSPy first performs automatic data profiling over raw tables and extracts structured evidence such as type information, statistics, samples, and nested\-field structures\. Based on these profiles, ProSPy progressively prunes large schemas from tables to columns, producing compact task\-relevant schema contexts\. Given the pruned schema, ProSPy performs agentic data fetching to construct intermediate views through a lightweight dialect\-agnostic DSL, which abstracts retrieval intent from executable SQL syntax and is compiled into target SQL dialect\. The views are then exported as CSV files to a Python environment, where the agent performs flexible downstream analysis and produces the final answer\.
We systematically evaluate ProSPy on Spider 2\.0\-Lite and Spider 2\.0\-Snow using a comprehensive set of metrics, including execution accuracy for end\-to\-end performance and SRR, NSR, NSP, and NSF for schema linking quality\. Experiments with both open\-source and proprietary models show that ProSPy substantially outperforms strong baselines, achieving execution accuracies of 60\.15% and 60\.51% on Spider 2\.0\-Lite and Spider 2\.0\-Snow, without relying on test\-time scaling strategies such as majority voting\. Notably, the comparable performance across the two benchmarks suggests that ProSPy is robust to SQL dialect variations, benefiting from its dialect\-agnostic data fetching design\. Further analysis shows that ProSPy achieves a favorable balance between schema recall and precision, enabling compact and task\-relevant schema contexts for downstream reasoning\.
## 2Preliminaries and Related Work
##### Preliminaries\.
The text\-to\-SQL task can be formally defined as follows\. Given a database schemaℰ\\mathcal\{E\}, a natural language question𝒬\\mathcal\{Q\}, and auxiliary documentation𝒦\\mathcal\{K\}, a semantic parserffparameterized byθ\\thetaaims to generate a target SQL querySS:
S=f\(𝒬,ℰ,𝒦∣θ\)\.S=f\(\\mathcal\{Q\},\\mathcal\{E\},\\mathcal\{K\}\\mid\\theta\)\.To evaluate the parser, the generated querySSis executed against the underlying database engine to obtain an execution resultRR, which is then compared with the ground\-truth resultR⋆R^\{\\star\}to determine execution accuracyLeiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\); Liet al\.\([2023b](https://arxiv.org/html/2606.05836#bib.bib9)\)\. Formally, given a benchmark withNNexamples, execution accuracy \(EX\) is defined as:
EX=1N∑i=1N𝕀\(Ri=Ri⋆\),\\mathrm\{EX\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\(R\_\{i\}=R\_\{i\}^\{\\star\}\),where𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)denotes the indicator function\.
For schema linking evaluation, let𝒯i⋆\\mathcal\{T\}\_\{i\}^\{\\star\}denote the set of gold tables required by theii\-th example, and let𝒯^i\\hat\{\\mathcal\{T\}\}\_\{i\}denote the set of retrieved tables\. Strict recall rate \(SRR\) measures the proportion of examples for which all gold tables are retrieved:
SRR=1N∑i=1N𝕀\(𝒯i⋆⊆𝒯^i\)\.\\mathrm\{SRR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\(\\mathcal\{T\}\_\{i\}^\{\\star\}\\subseteq\\hat\{\\mathcal\{T\}\}\_\{i\}\\right\)\.Non\-strict recall \(NSR\), non\-strict precision \(NSP\), and non\-strict F1 \(NSF\) further measure table\-level retrieval quality\. For theii\-th example, they are defined as:
NSRi=\|𝒯^i∩𝒯i⋆\|\|𝒯i⋆\|,NSPi=\|𝒯^i∩𝒯i⋆\|\|𝒯^i\|\.\\mathrm\{NSR\}\_\{i\}=\\frac\{\|\\hat\{\\mathcal\{T\}\}\_\{i\}\\cap\\mathcal\{T\}\_\{i\}^\{\\star\}\|\}\{\|\\mathcal\{T\}\_\{i\}^\{\\star\}\|\},\\quad\\mathrm\{NSP\}\_\{i\}=\\frac\{\|\\hat\{\\mathcal\{T\}\}\_\{i\}\\cap\\mathcal\{T\}\_\{i\}^\{\\star\}\|\}\{\|\\hat\{\\mathcal\{T\}\}\_\{i\}\|\}\.NSFi=2⋅NSRi⋅NSPiNSRi\+NSPi\.\\mathrm\{NSF\}\_\{i\}=\\frac\{2\\cdot\\mathrm\{NSR\}\_\{i\}\\cdot\\mathrm\{NSP\}\_\{i\}\}\{\\mathrm\{NSR\}\_\{i\}\+\\mathrm\{NSP\}\_\{i\}\}\.The final NSR, NSP, and NSF scores are obtained by averaging the corresponding per\-example scores over allNNexamples\.
##### Related Work\.
The strong language understanding and code generation capabilities of large language models \(LLMs\) have motivated extensive research on applying them to text\-to\-SQL parsing\. Early approaches mainly rely on prompt engineering and in\-context learningPourreza and Rafiei \([2023](https://arxiv.org/html/2606.05836#bib.bib11)\); Zhanget al\.\([2023](https://arxiv.org/html/2606.05836#bib.bib12)\); Gaoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib13)\), where demonstrations, decomposition strategies, and intermediate reasoning steps are designed to elicit SQL generation capabilities from general\-purpose LLMs\. More recent studies further improve performance through test\-time computation, including Monte Carlo Tree SearchLiet al\.\([2025a](https://arxiv.org/html/2606.05836#bib.bib17)\); Yuanet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib16)\)and multi\-path reasoning or candidate selection strategiesPourrezaet al\.\([2025a](https://arxiv.org/html/2606.05836#bib.bib18)\)\. These methods have achieved strong results on established benchmarks such as BIRDLiet al\.\([2023b](https://arxiv.org/html/2606.05836#bib.bib9)\)and Spider 1\.0Yuet al\.\([2018](https://arxiv.org/html/2606.05836#bib.bib19)\)\.
Despite their effectiveness, existing methods face substantial challenges when applied to enterprise\-level text\-to\-SQL benchmarks such as Spider 2\.0Leiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\)\. Compared with earlier benchmarks, this setting involves significantly larger schemas, massive data volumes, heterogeneous SQL dialects, imprecise metadata, and complex analytical workflows\. These characteristics make it difficult for models to identify the relevant context, construct appropriate intermediate computations, and produce executable SQL queries\. Although supervised fine\-tuningLiet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib15),[2023a](https://arxiv.org/html/2606.05836#bib.bib14)\)and reinforcement learningPourrezaet al\.\([2025b](https://arxiv.org/html/2606.05836#bib.bib23)\); Liet al\.\([2025b](https://arxiv.org/html/2606.05836#bib.bib22)\)techniques can enhance the reasoning and generation capabilities of text\-to\-SQL models, their practical deployment remains limited by the high cost of data annotation, the diversity of SQL dialects, and the difficulty of covering enterprise\-specific analytical patterns\.
Motivated by the recent success of LLM\-based agents, a growing line of work has explored agentic workflows for complex Text\-to\-SQL tasks\. These methods typically involve two stages:schema linking, which identifies a task\-relevant schemaℰ′⊆ℰ\\mathcal\{E\}^\{\\prime\}\\subseteq\\mathcal\{E\}, andSQL generation, which produces an executable query from the selected schema\. For schema linking, existing methods use LLMs to identify relevant tables and columnsDenget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\); Caoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib6)\), or adopt vector retrieval with iterative query refinementWanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4),[2026](https://arxiv.org/html/2606.05836#bib.bib3)\)\. Concurrent work EviLinkAnonymous \([2026](https://arxiv.org/html/2606.05836#bib.bib30)\)further introduces multi\-path hypothetical schema grounding and estimates uncertainty through voting\. While effective, these methods mainly optimize schema grounding as an intermediate step, whereas ProSPy targets the end\-to\-end problem solving process\.
For SQL generation, many methods rely on test\-time scaling by generating multiple candidate queries, refining them via self\-correctionPourreza and Rafiei \([2023](https://arxiv.org/html/2606.05836#bib.bib11)\), and selecting the final answer through majority votingWanget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib3)\)or LLM\-based rerankingCaoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib6)\); Pourrezaet al\.\([2025a](https://arxiv.org/html/2606.05836#bib.bib18)\)\. Although effective, such strategies introduce substantial computational overhead and latency, limiting their practicality in enterprise\-scale scenarios where repeated model calls and database executions are costly\.
Several recent methods further introduce iterative exploration to cope with imprecise metadata and complex schema structures in large databasesHaoet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib24)\); Caoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\); Wanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4)\)\. These approaches interact with LLMs and database engines to discover schema semantics, inspect data distributions, and refine intermediate decisions\. However, repeated exploration can incur considerable overhead, and the resulting workflow may remain constrained by the expressiveness of SQL generation\. For example, although APEX\-SQLCaoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\)performs agentic exploration for SQL construction, its action space is still centered on SQL generation and therefore cannot fully exploit the advanced reasoning and flexible data analysis capabilities of current LLMs\.
## 3Methodology
To address the challenges of enterprise\-level databases, we proposeProSPy, aProfiling\-drivenSQL–Python agentic framework that leverages the complementary strengths of SQL and Python for hybrid computation\. Unlike conventional methods that directly use SQL as the final reasoning interface, ProSPy shifts complex analytical reasoning from SQL generation to the Python coding space, where LLMs can better exploit their capabilities in program synthesis and iterative data analysis\. Specifically, ProSPy consists of four core components: \(1\)data profiling, which extracts fine\-grained column\-level metadata to mitigate metadata inaccuracies; \(2\)progressive schema pruning, which constructs a compact and task\-oriented schema context in a coarse\-to\-fine manner; \(3\)agentic data fetching, which generates multiple intermediate views over the reduced schema via a principled domain\-specific language \(DSL\) designed to abstract away cross\-dialect discrepancies; and \(4\)Python\-based analysis, which retrieves data from the generated views and performs downstream analytical reasoning in Python\. The overall framework is illustrated in Figure[2](https://arxiv.org/html/2606.05836#S1.F2)\. In ProSPy, SQL is used for data retrieval and lightweight aggregation, while Python is responsible for complex analysis, enabling a practical balance between effcacy and efficiency\. We present case studies in Appendix[B](https://arxiv.org/html/2606.05836#A2)\.
### 3\.1Data Profiling
Data profiling aims to provide the agent with fine\-grained metadata about the underlying database content\. Existing methods often acquire such information implicitly through iterative interactions between the LLM and the databaseWanget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib3)\); Denget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\); Caoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\), leading to substantial exploration overhead\. We observe that these exploration trajectories exhibit recurring patterns: agents first infer column semantics, such as whether a column serves as an identifier, a filtering dimension, or an aggregation metric; then inspect value distributions of relevant columns; and finally parse nested fields, such asVARIANTcolumns in Snowflake, when their internal structures are required\.
Motivated by these observations, we replace ad\-hoc database exploration with an automatic profiling procedure based on templated SQL queries\. Given an arbitrary table name, the profiler retrieves all columns and their declared types, performs implicit type detection for cases such as integer\-encoded timestamps, and categorizes each column into one of five semantic types: metric, dimension, identifier, time, or other special types\. It then collects type\-specific statistics, including value distributions for dimension columns, summary statistics for metric columns, and temporal ranges for time columns, while uniformly recording null ratios and representative sample values\.
For complex nested types, the profiler performs recursive schema inference to expose internal structures, enabling downstream agents to access nested fields without repeated manual exploration\. The resulting structured profiles, together with a lightweight data preview, are serialized and injected into the agent’s context\. Apart from data profiles, we further generate database\-level knowledge for subsequent schema pruning and analysis\. Since profiling requires only table names, it mitigates the dependence on accurate external metadata\. Moreover, each database is profiled only once in parallel, and the resulting profiles are reused across all questions associated with the same database, reducing redundant computation and improving overall efficiency\. More details of data profiling are provided in Appendix[A\.1](https://arxiv.org/html/2606.05836#A1.SS1), and example outputs of the profiling process are in Appendix[B\.1](https://arxiv.org/html/2606.05836#A2.SS1)\.
### 3\.2Progressive Schema Pruning
Existing schema linking methods often prioritize maximizing recall, resulting in a large number of irrelevant tables being retained in the retrieved schema\. In data\-agent settings, however, excessive schema context can enlarge the reasoning space, increase token consumption, and interfere with downstream table selection, column grounding, and analytical reasoning\. Therefore, rather than maximizing recall alone, ProSPy aims to construct a compact yet informative schema context that preserves necessary schema coverage while minimizing irrelevant information\.
To this end, we formulate schema linking as a progressive pruning process\. Letℰ\\mathcal\{E\}denote the original schema andℰ\(t\)\\mathcal\{E\}^\{\(t\)\}the retained schema after thett\-th pruning round, withℰ\(0\)=ℰ\\mathcal\{E\}^\{\(0\)\}=\\mathcal\{E\}\. At each iteration, an LLM\-based pruning operatorPPremoves schema elements that are unlikely to be relevant to the query:
ℰ\(t\+1\)=P\(𝒬,ℰ\(t\)\)\.\\mathcal\{E\}^\{\(t\+1\)\}=P\(\\mathcal\{Q\},\\mathcal\{E\}^\{\(t\)\}\)\.
Specifically, large schemas are first partitioned into batches that fit within the context window\. Following prior workDenget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\), tables sharing identical column sets are grouped together for compression\. For each batch, the LLM identifies task\-relevant tables and prunes irrelevant candidates, progressively reducing the search space while preserving potentially useful schema elements\. After table\-level pruning, the same procedure is applied to the columns of the retained tables\. Finally, the remaining tables and columns from all batches are merged to form the linked schemaℰ′\\mathcal\{E\}^\{\\prime\}\. More details are provided in Appendix[A\.2](https://arxiv.org/html/2606.05836#A1.SS2)\.
### 3\.3Agentic Data Fetching
Complex queries often require constructing multiple intermediate views over the original database, where these views often contain dependencies\. Based on this observation, we introduce an iterative data fetching paradigm instead of relying on a single SQL query\. Given the linked schema and the target question, the agent first performs aplanningstep to identify the required data and decompose the retrieval objective\. It then conducts a structuredanalysisphase, reasoning over the retrieved schema and previously defined views to determine the next retrieval action and generate view definitions\.
At each iteration, the agent generates multiple view definitions that specify the data required at the current stage\. These view definitions are compiled into executable SQL queries and materialized as intermediate results\. The resulting views are made available in subsequent iterations, allowing the agent to reuse previously defined views and construct nested views when necessary\. This process continues until the accumulated views are sufficient for downstream Python\-based analysis\.
##### SQL Generation via Domain\-Specific Language\.
To improve robustness across heterogeneous database dialects, we use a lightweight domain\-specific language \(DSL\) as an intermediate representation for SQL generation\. Instead of directly producing raw SQL, the agent generates dialect\-agnostic view definitions organized around three core components:dimensions,metrics, andconditions\. These components naturally align with the semantic categories obtained from schema linking and data profiling \(Section[3\.1](https://arxiv.org/html/2606.05836#S3.SS1)\): dimensions specify grouping or selection attributes, metrics define quantitative values for aggregation, and conditions constrain the retrieval scope\. The DSL representations are then transpiled into executable SQL queries for the target database engine\. More details of this process are provided in Appendix[A\.3](https://arxiv.org/html/2606.05836#A1.SS3)\.
### 3\.4Python Analysis
After iterative data fetching, the data from the materialized views are saved as CSV files and provided to the agent equipped with a Python tool\. The agent loads these CSV files and performs the final computation in Python, complementing SQL with flexible multi\-step transformations and customized analytical operations\.
The agent executes the generated code, observes the results, and revises the code when execution errors or invalid outputs occur\. Once the analysis is completed, the final result is written to a CSV file and compared with the ground\-truth answerR⋆R^\{\\star\}for evaluation\. Examples of the whole process are provided in Appendix[B\.2](https://arxiv.org/html/2606.05836#A2.SS2)\.
## 4Experiments
### 4\.1Experimental Setup
##### Dataset\.
We evaluate ProSPy on the Spider 2\.0 benchmarkLeiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\), which consists of two subsets: Spider 2\.0\-Lite and Spider 2\.0\-Snow, each containing 547 examples\. Spider 2\.0 targets enterprise\-level text\-to\-SQL scenarios, characterized by large\-scale database schemas, heterogeneous SQL dialects, and complex analytical workflows\. Specifically, Spider 2\.0\-Lite covers Snowflake, BigQuery, and SQLite, thereby requiring robust cross\-dialect reasoning, whereas Spider 2\.0\-Snow is restricted to the Snowflake dialect\. We conduct experiments with two open\-source models, DeepSeek V3\.2Liuet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib25)\)and GLM\-5Zenget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib26)\), as well as one proprietary model, Claude\-Opus\-4\.5\.
##### Metrics\.
We report both effectiveness and efficiency metrics, following the definitions in Section[2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px1)\. For effectiveness, we use execution accuracy \(EX\) for end\-to\-end evaluation, and SRR, NSR, NSP, and NSF for table\-level schema linking evaluation, as described in Section[2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px1)\. We additionally report token consumption to assess context efficiency\. For efficiency analysis, Section[5\.1](https://arxiv.org/html/2606.05836#S5.SS1)reports the average execution time and the number of LLM invocation turns\. ProSPy solves each task with a single trial without majority votingDenget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\); Wanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4)\), and retries are performed only when no valid output is produced\. Hyperparameters are presented in Appendix[C](https://arxiv.org/html/2606.05836#A3)\.
MethodModelSingle TrialLite EXSpider\-AgentDeepSeek\-R1✓13\.71Spider\-AgentClaude\-4\-Sonnet✓27\.79ReForceGPT\-o3✗36\.56ReForceo4\-mini✗31\.99LinkAlignDeepSeek\-R1✗33\.09AutoLinkDeepSeek\-R1✓34\.92RSL\-SQLDeepSeek\-R1✗30\.53RSL\-SQLGPT\-o3✗33\.09ProSpyGLM\-5✓38\.21ProSpyDeepSeek V3\.2✓41\.32ProSpyClaude\-Opus\-4\.5✓60\.15ProSpyClaude \+ DeepSeek✓43\.14ProSpyClaude \+ GLM✓48\.63
Table 1:Execution accuracy \(EX\) on Spider 2\.0\-Lite\. “Single Trial” indicates that each example is solved with a single generation, without test\-time scaling techniques such as majority voting\.
##### Baselines\.
To assess the effectiveness of ProSPy, we compare it with a representative set of competitive baselines, including Spider\-AgentLeiet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib1)\), ReForceDenget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\), AutoLinkWanget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib3)\), DSR\-SQLHaoet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib24)\), RSL\-SQLCaoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib6)\), LinkAlignWanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4)\), and APEX\-SQLCaoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\)\. For all baselines, we directly adopt the scores reported in their original papers to ensure a fair and consistent comparison\.
### 4\.2Execution Results
As shown in Tables[1](https://arxiv.org/html/2606.05836#S4.T1)and[2](https://arxiv.org/html/2606.05836#S4.T2), ProSPy consistently outperforms prior approaches across both open\-source and proprietary models\. On Spider 2\.0\-Lite, ProSPy achieves an execution accuracy of60\.15%with Claude\-Opus\-4\.5 and41\.32%with DeepSeek V3\.2, without relying on majority voting strategies used by several previous methods\. In comparison, LinkAlignWanget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib4)\)achieves 33\.09% with majority voting, while ReForceDenget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\)attains 36\.56% with GPT\-o3\.
Similarly, on Spider 2\.0\-Snow, ProSPy achieves60\.51%with Claude\-Opus\-4\.5 and40\.77%with DeepSeek V3\.2, outperforming previous single\-trial methods such as AutoLinkWanget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib3)\)and DSR\-SQLHaoet al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib24)\)\. It also exceeds APEX\-SQLCaoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\)by 9\.50 percentage points, despite not using majority voting\. Notably, many existing approaches are evaluated on only one of the two subsets, whereas ProSPy demonstrates strong performance on both Lite and Snow\. This highlights the generalization capability of our framework across different execution environments and evaluation protocols\.
MethodModelSingle TrialSnow EXSpider\-AgentDeepSeek\-R1✓10\.55Spider\-AgentClaude\-4\-Sonnet✓25\.78ReForceGPT\-o3✗35\.83ReForceo4\-mini✗29\.80DSR\-SQLDeepSeek\-R1✓35\.28APEX\-SQLDeepSeek\-R1✗51\.01ProSpyGLM\-5✓40\.04ProSpyDeepSeek V3\.2✓40\.77ProSpyClaude\-Opus\-4\.5✓60\.51ProSpyClaude \+ DeepSeek✓43\.51ProSpyClaude \+ GLM✓51\.74
Table 2:Execution accuracy \(EX\) on Spider 2\.0\-Snow\. “Single Trial” indicates that each example is solved with a single generation, without test\-time scaling techniques such as majority voting\.Moreover, ProSPy exhibits strong stability across the two benchmarks\. Specifically, the performance gap between Lite and Snow is only 0\.36 percentage points for Claude\-Opus\-4\.5 \(60\.15% vs\. 60\.51%\) and 0\.55 percentage points for DeepSeek V3\.2 \(41\.32% vs\. 40\.77%\)\. This consistency suggests that ProSPy is robust to variations in execution environments and SQL dialects\.
In addition, the modular design of ProSPy naturally supports the use of different models at different stages of the pipeline\. Beyond using the same model throughout, we further explore hybrid settings, where we use different models for schema likning and the following stages\. The results show that using stronger models to specific submodules can further improve overall performance\. For example, the Claude \+ GLM configuration achieves 48\.63% on Lite and 51\.74% on Snow, substantially outperforming the singlel GLM\-5 configuration\.
### 4\.3Schema Linking Results
In addition to end\-to\-end execution accuracy, we separately evaluate the progressive schema pruning module of ProSPy\. Table[3](https://arxiv.org/html/2606.05836#S4.T3)reports the table\-level schema linking results\. Since the goal of progressive pruning is to retain task\-relevant schemas while removing redundant ones, we evaluate both schema coverage and retrieval quality through recall\-, precision\-, and F1\-based metrics, together with the number of tokens used during the process\.
Table 3:Table\-level schema linking results on Spider 2\.0\-Snow\. Baselines are reimplemented with the same model for fair comparison\. \#Tokens denote tokens used\.As shown in Table[3](https://arxiv.org/html/2606.05836#S4.T3), existing methods generally achieve strong recall but often suffer from low precision, whereas ProSPy attains a substantially better balance between the two\. For example, RSL\-SQLCaoet al\.\([2024](https://arxiv.org/html/2606.05836#bib.bib6)\)achieves an SRR of 97\.88%, but its NSP is only 30\.53%\. Similarly, APEX\-SQLCaoet al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib10)\)obtains an SRR of 94\.98% while achieving only 53\.67% NSP and consuming 574\.53K schema tokens\. In contrast, ProSPy achieves the highest NSP \(86\.02%\) and NSF \(87\.40%\), largely surpassing APEX\-SQL by 32\.35 and 26\.5 pp respectively with fewer tokens\. These results indicate that progressive pruning effectively filters out irrelevant schemas without substantially sacrificing coverage, producing a more compact and informative schema context\. The corresponding gains in execution accuracy reported in Tables[1](https://arxiv.org/html/2606.05836#S4.T1)and[2](https://arxiv.org/html/2606.05836#S4.T2)further suggest that such high\-quality schema contexts are beneficial for downstream reasoning and SQL generation\.
### 4\.4Ablation Studies
We conduct ablation studies to assess the contribution of each major component in ProSPy\. Experiments are performed on a randomly sampled subset of 110 examples from Spider 2\.0\-Snow\. As shown in Table[4](https://arxiv.org/html/2606.05836#S4.T4), removing the data profiling module and relying only on the official metadata decreases EX by 6\.4 pp, demonstrating the importance of fine\-grained column\-level evidence\. Replacing the DSL\-based data fetching module with direct SQL generation leads to a larger drop of 9\.1 points, indicating that the proposed DSL effectively improves the robustness of intermediate view construction by abstracting away dialect\-specific SQL details\.
Table 4:Ablation results on a randomly sampled subset of 110 examples from Spider 2\.0\-Snow\. Nagative value indicate the decline compared to the full ProSPy\.We further examine the role of Python\-based analysis by replacing the data fetching and Python computation stage with SQL generation\. Even with majority voting, this variant suffers a substantial decrease of 16\.3 pp, while removing majority voting further enlarges the drop to 19\.0 pp\. These results suggest that expressing the full analytical process in SQL remains challenging for complex questions, and that test\-time scaling only partially compensates for this limitation\. Overall, the ablation results verify that data profiling, DSL\-guided data fetching, and Python\-based analysis all contribute substantially to the effectiveness of ProSPy\.
## 5Analysis
### 5\.1Efficiency Analysis
In this section, we analyze the efficiency of ProSPy from two perspectives: the number of interaction turns required for Python\-based analysis and the execution time of different stages\.
Figure 3:Time distribution across different stages\.We first examine the number of Python executions needed to derive the final answer\. On Spider 2\.0\-Snow with DeepSeek V3\.2, ProSPy produces the final result within at most three executions, with more than 80% of instances completed in a single execution\. On average, ProSPy requires only 1\.26 executions per instance\. This suggests that Python\-based analysis serves as an effective and efficient mechanism for addressing complex data reasoning tasks\. We further observe that more challenging instances tend to require additional execution attempts\. Specifically, incorrectly solved cases require an average of 1\.33 executions, slightly higher than the 1\.22 executions required for correctly solved cases\. This indicates that difficult cases often demand more iterative reasoning and analysis before reaching a final answer\.
We further evaluate the execution time of different stages on Spider 2\.0\-Snow\. Since data profiling can be performed in parallel and its results can be reused across tasks, we focus on three major online components: \(1\) schema linking, \(2\) data fetching, and \(3\) Python\-based analysis\. As shown in Figure[3](https://arxiv.org/html/2606.05836#S5.F3), schema linking dominates the overall execution time for both models, accounting for 84\.8% of the total runtime with GLM\-5 and 70\.4% with Claude\-Opus\-4\.5\. Python\-based analysis is the second most time\-consuming stage, especially for Claude\-Opus\-4\.5, where it accounts for 20\.2% of the total runtime\. In contrast, data fetching introduces relatively limited overhead, accounting for only 4\.1% and 9\.4% of the total runtime for GLM\-5 and Claude\-Opus\-4\.5, respectively\. These results indicate that improving the efficiency of schema linking is the most critical direction for further reducing the overall runtime of ProSPy\.
### 5\.2Error Analysis
We further analyze the failure cases and categorize them into three types\.Schema linking errorsoccur when the agent fails to identify the required schema elements, often due to parsing failures or excessively large schema contexts\.Data fetching errorsrefer to cases where the agent cannot construct suitable intermediate views for subsequent Python\-based analysis\.Python analysis errorsinclude failures in loading the materialized views, executing the generated analysis code, or producing a valid final answer\.
Figure 4:Error distribution across different stages\.As shown in Figure[4](https://arxiv.org/html/2606.05836#S5.F4), the distribution of errors varies models used\. For DeepSeek V3\.2, data fetching errors constitute the largest source of failures, accounting for 27\.6% of all examples\. In contrast, Claude 4\.5 Opus substantially reduces such errors, but Python analysis becomes the dominant remaining failure type, accounting for 21\.4% of all examples\. Schema linking errors are relatively less frequent for both models, suggesting that the progressive schema pruning module can generally retain the necessary schema\. Overall, these results indicate that main bottlenecks lie in constructing reliable intermediate views and improving the robustness of downstream Python\-based analysis\.
## 6Conclusion
In this work, we introduceProSPy, a profiling\-driven SQL–Python agentic framework for enterprise\-scale Text\-to\-SQL\. ProSPy is designed for real\-world database environments with large schemas, incomplete metadata, SQL dialect variations, and complex analytical requirements\. To address these challenges, ProSPy decomposes data reasoning into a structured process that combines automatic data profiling, progressive schema pruning, dialect\-agnostic data fetching, and Python\-based analysis\. Experiments on Spider 2\.0\-Lite and Spider 2\.0\-Snow show that ProSPy consistently outperforms strong baselines without relying on majority voting\. Further analysis confirms its robustness to SQL dialect variations and its ability to maintain a favorable balance between schema recall and precision\.
## Limitations
Although ProSPy demonstrates strong performance on enterprise\-scale Text\-to\-SQL benchmarks, it still has several limitations\. First, ProSPy decomposes complex reasoning into data fetching and Python\-based analysis, which improves flexibility but may introduce error propagation: incorrect intermediate views can mislead the final Python analysis\. Futhermore, our experiments are conducted on Spider 2\.0\-Lite and Spider 2\.0\-Snow; further evaluation on more diverse real\-world enterprise databases is needed to better understand the generality of ProSPy in practical deployments\.
## Ethical Considerations
This work evaluates Text\-to\-SQL methods on publicly available benchmark datasets, including Spider 2\.0\-Lite and Spider 2\.0\-Snow and we adhere their terms of use and licenses\. The experiments do not involve proprietary, private, or personally identifiable data, and we do not identify direct privacy or copyright risks associated with the data used in this study after manual inspection\. Text\-to\-SQL is a general\-purpose semantic parsing task, and we are not aware of any task\-specific ethical concerns\. AI assistants were used to support code development and the preparation of this manuscript\.
## References
- EviLink: multi\-path schema linking with uncertainty\-guided evidence acquisition for large\-scale text\-to\-sql\.Note:Under reviewCited by:[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p3.1)\.
- B\. Cao, W\. Liao, Y\. Sun, D\. Fang, H\. Li, and W\. Lam \(2026\)APEX\-sql: talking to the data via agentic exploration for text\-to\-sql\.arXiv preprint arXiv:2602\.16720\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p2.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p5.1),[§3\.1](https://arxiv.org/html/2606.05836#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.05836#S4.SS2.p2.1),[§4\.3](https://arxiv.org/html/2606.05836#S4.SS3.p2.1)\.
- Z\. Cao, Y\. Zheng, Z\. Fan, X\. Zhang, W\. Chen, and X\. Bai \(2024\)Rsl\-sql: robust schema linking in text\-to\-sql generation\.arXiv preprint arXiv:2411\.00073\.Cited by:[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p4.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2606.05836#S4.SS3.p2.1)\.
- M\. Deng, A\. Ramachandran, C\. Xu, L\. Hu, Z\. Yao, A\. Datta, and H\. Zhang \(2025\)ReFoRCE: a text\-to\-sql agent with self\-refinement, consensus enforcement, and column exploration\.arXiv preprint arXiv:2502\.00675\.Cited by:[§A\.1](https://arxiv.org/html/2606.05836#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p3.1),[§3\.1](https://arxiv.org/html/2606.05836#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.05836#S3.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.05836#S4.SS2.p1.1)\.
- D\. Gao, H\. Wang, Y\. Li, X\. Sun, Y\. Qian, B\. Ding, and J\. Zhou \(2024\)Text\-to\-sql empowered by large language models: a benchmark evaluation\.Proc\. VLDB Endow\.\.External Links:[Link](https://doi.org/10.14778/3641204.3641221)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p3.1)\.
- Z\. Hao, Q\. Song, R\. Cai, and B\. Xu \(2025\)Text\-to\-sql as dual\-state reasoning: integrating adaptive context and progressive generation\.arXiv preprint arXiv:2511\.21402\.Cited by:[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p5.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.05836#S4.SS2.p2.1)\.
- F\. Lei, J\. Chen, Y\. Ye, R\. Cao, D\. Shin, H\. SU, Z\. SUO, H\. Gao, W\. Hu, P\. Yin, V\. Zhong, C\. Xiong, R\. Sun, Q\. Liu, S\. Wang, and T\. Yu \(2025\)Spider 2\.0: evaluating language models on real\-world enterprise text\-to\-SQL workflows\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XmProj9cPs)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p2.1),[§1](https://arxiv.org/html/2606.05836#S1.p3.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px1.p1.10),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p2.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1)\.
- B\. Li, J\. Zhang, J\. Fan, Y\. Xu, C\. Chen, N\. Tang, and Y\. Luo \(2025a\)Alpha\-SQL: zero\-shot text\-to\-SQL using monte carlo tree search\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=kGg1ndttmI)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Li, S\. Wu, X\. Zhang, X\. Huang, J\. Zhang, F\. Jiang, S\. Wang, T\. Zhang, J\. Chen, R\. Shi, H\. Chen, and C\. Li \(2025b\)OmniSQL: synthesizing high\-quality text\-to\-sql data at scale\.Proc\. VLDB Endow\.,pp\. 4695–4709\.External Links:[Link](https://doi.org/10.14778/3749646.3749723)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p2.1)\.
- H\. Li, J\. Zhang, C\. Li, and H\. Chen \(2023a\)Resdsql: decoupling schema linking and skeleton parsing for text\-to\-sql\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 13067–13075\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p2.1)\.
- H\. Li, J\. Zhang, H\. Liu, J\. Fan, X\. Zhang, J\. Zhu, R\. Wei, H\. Pan, C\. Li, and H\. Chen \(2024\)Codes: towards building open\-source language models for text\-to\-sql\.Proceedings of the ACM on Management of Data2\(3\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p2.1)\.
- J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo,et al\.\(2023b\)Can llm already serve as a database interface? a big bench for large\-scale database grounded text\-to\-sqls\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px1.p1.10),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Ma, Q\. Liu, D\. Jiang, G\. Zhang, Z\. Ma, and W\. Chen \(2026\)General\-reasoner: advancing llm reasoning across all domains\.InAdvances in Neural Information Processing Systems,pp\. 56596–56618\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p3.1)\.
- M\. Pourreza, H\. Li, R\. Sun, Y\. Chung, S\. Talaei, G\. T\. Kakkar, Y\. Gan, A\. Saberi, F\. Ozcan, and S\. O\. Arik \(2025a\)CHASE\-SQL: multi\-path reasoning and preference optimized candidate selection in text\-to\-SQL\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CvGqMD5OtX)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p4.1)\.
- M\. Pourreza and D\. Rafiei \(2023\)Din\-sql: decomposed in\-context learning of text\-to\-sql with self\-correction\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p4.1)\.
- M\. Pourreza, S\. Talaei, R\. Sun, X\. Wan, H\. Li, A\. Mirhoseini, A\. Saberi, and S\. O\. Arik \(2025b\)Reasoning\-SQL: reinforcement learning with SQL tailored partial rewards for reasoning\-enhanced text\-to\-SQL\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=HbwkIDWQgN)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p2.1)\.
- Y\. Wang, P\. Liu, and X\. Yang \(2025\)LinkAlign: scalable schema linking for real\-world large\-scale multi\-database text\-to\-SQL\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://aclanthology.org/2025.emnlp-main.51/)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p2.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p5.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.05836#S4.SS2.p1.1)\.
- Z\. Wang, Y\. Zheng, Z\. Cao, X\. Zhang, Z\. Wei, P\. Fu, Z\. Luo, W\. Chen, and X\. Bai \(2026\)AutoLink: autonomous schema exploration and expansion for scalable schema linking in text\-to\-sql at scale\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:[Link](https://arxiv.org/abs/2511.17190)Cited by:[§A\.1](https://arxiv.org/html/2606.05836#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.05836#S1.p2.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p4.1),[§3\.1](https://arxiv.org/html/2606.05836#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.05836#S4.SS2.p2.1)\.
- Z\. Yao, G\. Sun, L\. Borchmann, G\. Nuti, Z\. Shen, M\. Deng, B\. Zhai, H\. Zhang, A\. Li, and Y\. He \(2025\)Arctic\-text2sql\-r1: simple rewards, strong reasoning in text\-to\-sql\.arXiv preprint arXiv:2505\.20315\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p3.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman,et al\.\(2018\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 3911–3921\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Yuan, L\. Chen, M\. Yuan, and Z\. Jin \(2026\)MCTS\-sql: light\-weight llms can master the text\-to\-sql through monte carlo tree search\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 34521–34529\.Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie,et al\.\(2026\)Glm\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[§4\.1](https://arxiv.org/html/2606.05836#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Zhang, R\. Cao, L\. Chen, H\. Xu, and K\. Yu \(2023\)ACT\-SQL: in\-context learning for text\-to\-SQL with automatically\-generated chain\-of\-thought\.InFindings of the Association for Computational Linguistics: EMNLP 2023,External Links:[Link](https://aclanthology.org/2023.findings-emnlp.227/)Cited by:[§1](https://arxiv.org/html/2606.05836#S1.p1.1),[§2](https://arxiv.org/html/2606.05836#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix AImplementation Details
In this section, we present the implementation details for each component of ProSPy\.
### A\.1Data Profiling Pipeline
The role of data profiling is to equip the model with rich, type\-aware schema representations for schema linking and SQL generation, avoiding the long, repeated exploration required by prior methodsWanget al\.\([2026](https://arxiv.org/html/2606.05836#bib.bib3)\); Denget al\.\([2025](https://arxiv.org/html/2606.05836#bib.bib2)\)\. For each table, the profiler extracts column\-level statistics, inferred semantic types, representative value samples and lineage\-aware descriptions\. The results are then serializes into a structured JSON document with summary\. The same pipeline can be applied to views given their definitions\.
##### Schema extraction\.
Given a table name, we first extract all column names, physical types, and nullability flags\. Physical types are grouped into eight coarse families: string, integer, float, temporal, boolean, binary, semi\-structured and geospatial\.
##### Semantic type inference\.
Physical types alone are insufficient, as real\-world schemas frequently encode richer semantics inside generic types\. The profiler therefore performs the following inference passes before computing statistics:
- •Identifier detection: numeric or string columns whose names contain tokens such asidoruinare sampled, and those whose unique\-value ratio exceeds a threshold are treated as identifiers and excluded from distributional analysis\.
- •Hidden datetime detection: string columns, together with integer columns carrying time\-related name tokens \(time,date,stamp\), are tested against a dictionary of common datetime formats\. Candidate formats are validated by parsing a sample together with the column’sMIN/MAXvalues to avoid spurious matches\.
- •Hidden numeric detection: non\-identifier string columns are probed by attempting to cast their values asINTEGERorFLOAT\.
- •Categorical detection: string and integer columns with a unique ratio below a threshold and a bounded maximum string length are marked as dimensional or categorical, and subsequently summarized by their top\-kkvalue distribution\.
##### Per\-type statistics\.
Given the inferred type, the profiler computes type\-specific summaries\. Numeric columns receiveMIN,MAX,AVG, andSTDDEV; datetime columns report the earliest and latest parseable timestamps; categorical columns yield the top\-kkvalues with their relative frequencies; and boolean columns report true/false ratios\. In addition, every column is annotated with its null count, null ratio, and top\-k preview values, each truncated to 300 characters to bound the prompt length\.
##### Semi\-structured columns\.
Semi\-structured columns are profiled recursively\. Array indices are normalized to a unified\[\]placeholder so that structurally equivalent paths are merged\. Each discovered leaf path is then profiled as if it were a top\-level column by rewriting it into a scalar expression, while non\-leaf paths are labeled asdict,list\[str\], orlist\[dict\]according to their structural signature\. The resulting sub\-field profiles are attached to the parent column, giving the model access to fine\-grained statistics over nested fields without exposing raw JSON blobs\.
##### Geospatial columns\.
Geospatial columns are summarized by the distribution of their GeoJSON geometry types \(e\.g\., Point, LineString, Polygon\), along with truncated coordinate previews in which only the outer structure and a small number of representative coordinates are retained\. Binary columns that are fully convertible to a geography type are reclassified as geospatial; the remainder are treated as opaque bytes\.
##### Scalability\.
Because exact statistics are prohibitive on very large tables, the profiler adopts size\-aware sampling when the table is too large\. Column\-level profiling is parallelized across a thread pool for acceleration\.
##### Lineage\-aware descriptions\.
When the profiler targets a view defined by an input SQL query, we run a static SQL lineage analysis to recover \(i\) the set of underlying source tables and \(ii\) a mapping from each output column to its originating base\-table columns\. The table description is then composed from the source\-table list, any curated natural\-language descriptions of those tables, and the defining SQL itself\. Column descriptions are synthesized analogously by aggregating the descriptions of all originating base columns into a single provenance\-annotated string, allowing the LLM to reason about derived schemas using the same semantic vocabulary available for base tables\.
##### Final Output
The pipeline emits a structured JSON profile together with a compact textual rendering for each table\. The JSON records its name, inferred semantic category, original physical type, null statistics, preview values, and all type\-specific summaries described above\. Apart from the data profiles, we also analyze the provided external knowledge document and query to gain some knowledge, which will also be utilized in subsequent processes\.
### A\.2Progressive Schema Pruning Details
As illustrated in Figure[2](https://arxiv.org/html/2606.05836#S1.F2), we first partition the original database schema into multiple batches, ensuring that each batch fits within the context window of the LLM\. To reduce redundant profile information, we group tables with similar naming patterns, i\.e\., tables whose names differ only by numerical suffixes, and use the first table available from each group as the representative table for profile display\.
Within each batch, the agent performs progressive table\-level pruning by iteratively excluding tables that are unlikely to be relevant to the input question\. This process is repeated until the retained schema is sufficiently compact\. Once table\-level pruning is completed, the same procedure is applied at the column level to remove irrelevant columns from the retained tables\. We conduct this pruning process in two passes: a coarse pass to reduce the schema broadly, followed by a finer pass to further refine the retained elements\. The resulting compact schema preserves the information necessary for downstream data fetching and Python analysis\.
### A\.3Data Fetching Details
Given the reduced schema, the agent first identifies the tables that need to be joined and decomposes the retrieval process into multiple intermediate views\. It then plans and performs data fetching iteratively, defining new views based on both the raw database tables and previously materialized views\. Each view definition is expressed in a structured DSL, which is subsequently compiled into the target SQL dialect for execution\. The full prompt used for DSL\-based view generation is provided in Section[D](https://arxiv.org/html/2606.05836#A4)\.
## Appendix BCase Study
We present a detailed case study based on instancesf\_local194from the Spider 2\.0\-Snow benchmark to illustrate the full execution pipeline of ProSPy\. This instance targets the Sakila database hosted on Snowflake and involves a complex analytical question requiring multi\-table joins, aggregation, and ranking logic\.
### B\.1Data Profiling Examples
The following JSON excerpt shows the profiling output for thePAYMENTtable in the Sakila database\. The profiler automatically identifies each column’s semantic category, infers the actual data type from the physical type, and computes type\-specific statistics:
`Profiling Output for the PAYMENT Table The profiler simultaneously produces a compact textual rendering that is injected into the LLM context for downstream reasoning\. Each database is profiled once, and the results are reused across all questions targeting the same database\.`
`B\.2 Whole Process Examples B\.2\.1 Data Profiling Beyond the PAYMENT table shown above, the profiler analyzes all tables in the SQLITE\_SAKILA schema \(Section B\.1\)\. It traces join keys along the relational path \(e\.g\., RENTAL\.inventory\_id and INVENTORY\.film\_id\), identifies FILM\_ACTOR as a many\-to\-many bridge through its composite dimensions, and infers key cardinalities of 5,462 rows in FILM\_ACTOR, 16,049 in PAYMENT, and 200 in ACTOR\. These statistics jointly drive the subsequent join\-decomposition and aggregation choices\. B\.2\.2 Progressive Schema Pruning The schema linking agent applies coarse\-to\-fine pruning over the 16\-table Sakila database\. In the coarse pass, tables unrelated to the revenue–actor analysis path \(e\.g\., STORE, STAFF, ADDRESS, CATEGORY\) are eliminated\. In the fine pass, irrelevant columns from retained tables are removed\. The final linked schema retains 6 tables with only the required fields: Linked Schema Output The schema linking achieves 100% table recall and 100% field recall on this instance\. The corresponding table and field precision are 83% and 77%, respectively\. B\.2\.3 Agentic Data Fetching Join graph decomposition\. A single six\-table wide join would balloon to roughly 96,000 rows under the many\-to\-many film\-actor relationship, most of which are irrelevant to the target metric\. The agent therefore partitions the schema into two independent join groups: • Group G1 \(Revenue Path\): PAYMENT ⋈\\bowtie RENTAL ⋈\\bowtie INVENTORY ⋈\\bowtie FILM, which traces the payment chain to compute per\-film revenue\. • Group G2 \(Actor Mapping\): ACTOR ⋈\\bowtie FILM\_ACTOR, which materializes the actor\-to\-film correspondence\. The two groups remain logically connected through the shared film\_id key, whose resolution is deferred to the Python analysis stage rather than realized as an additional SQL join\. Join DSL structure\. The agent produces a structured Join DSL assigning each table a semantic role and declaring the join topology: Join DSL Structure DSL view definitions\. Two view specifications are generated over the respective join groups: View 1: film\_revenue \(from G1\) Data View DSL: film\_revenue View 2: actor\_film\_mapping \(from G2\) Data View DSL: actor\_film\_mapping DSL\-to\-SQL compilation\. Each view definition is transpiled into executable SQL in the target dialect\. For the film\_revenue view, the compiled Snowflake query takes the following form: Compiled SQL: film\_revenue The two\-level structure realizes aggregation push\-down: SUM\(amount\) is evaluated inside the database, yielding 958 per\-film aggregated rows instead of 16,049 raw records \(94% reduction\)\. The second view returns 5,462 actor–film pairs\. In total, ProSPy materializes 6,420 rows locally, versus ∼\\sim96,000 from a naïve wide join\. Because both views use only standard inner\-join and aggregation semantics, the same DSL can be transpiled to any SQL\-compliant backend without modification\. B\.2\.4 Python Analysis The Python agent loads the two CSV views and implements the remaining multi\-step analytical logic: Python Analysis Code The sort\_values followed by groupby\(\)\.head\(3\) idiom is functionally equivalent to a rank\-filtered ROW\_NUMBER\(\) OVER \(PARTITION BY actor\_id ORDER BY total\_revenue DESC\), while remaining dialect\-agnostic\. The resulting output comprises 600 rows \(200 actors ×\\times 3 films per actor\), with columns reporting actor identifiers, film titles, total revenue, per\-actor revenue share, and the actor\-level mean over the top\-3 films Appendix C Hyperparameters For DeepSeek V3\.2, GLM\-5, and Claude 4\.5 Opus, we set the temperature to 1\.01\.0 and top\_ptop\\\_p to 0\.950\.95, with thinking mode enabled for consistency\. Appendix D Prompts Below, we present the full prompts used in the first pass of schema pruning, covering both table and column selection, as well as the prompt for generating data views during the agentic data fetching process\. Prompt for Schema Pruning First Pass \(Part 1/2\) Figure 5: Prompt used for the first\-pass table pruning stage \(Part 1/2\)\. Prompt for Schema Pruning First Pass \(Part 2/2\) Figure 6: Prompt used for the first\-pass table pruning stage \(Part 2/2\)\. Prompt for Schema Pruning Fields \(Part 1/2\) Figure 7: Prompt used for the field pruning stage \(Part 1/2\)\. Prompt for Schema Pruning Fields \(Part 2/2\) Figure 8: Prompt used for the field pruning stage \(Part 2/2\)\. Prompt for Data Fetching \(Part 1/6\) Figure 9: Prompt used for data fetching \(Part 1/6\)\. Prompt for Data Fetching \(Part 2/6\) Figure 10: Prompt used for data fetching \(Part 2/6\)\. Prompt for Data Fetching \(Part 3/6\) Figure 11: Prompt used for data fetching \(Part 3/6\)\. Prompt for Data Fetching \(Part 4/6\) Figure 12: Prompt used for data fetching \(Part 4/6\)\. Prompt for Data Fetching \(Part 5/6\) Figure 13: Prompt used for data fetching \(Part 5/6\)\. Prompt for Data Fetching \(Part 6/6\) Figure 14: Prompt used for data fetching \(Part 6/6\)\.`Similar Articles
SPy: an interpreter and a compiler for a statically typed variant of Python
SPy is a statically typed variant of Python that includes both an interpreter for rapid development and a compiler for high performance, aiming to retain Python's dynamic features while enabling static compilation.
SPEAR: Code-Augmented Agentic Prompt Optimization
SPEAR is a code-augmented agentic prompt optimizer that uses a Python sandbox for structural error analysis, achieving state-of-the-art performance on multiple LLM evaluation suites including industrial judge tasks, BBH, and GSM8K.
AgentSPEX: An Agent SPecification and EXecution Language
AgentSPEX introduces a domain-specific language for specifying modular, interpretable LLM-agent workflows with explicit control flow, state management, and a visual editor, outperforming existing Python-coupled frameworks.
Spy‑code: local codebase graph for AI agents (feedback wanted)
Spy-code is an open-source tool that builds a local codebase graph using tree-sitter, extracting functions, classes, and references to give AI coding agents a structured map of the codebase, currently supporting Rust, Python, TypeScript/JS, and Go.
Skopx - AI agents that autonomously analyze business data
Skopx is a conversational AI analytics platform that lets users ask business questions in plain English, automatically generating insights from connected data sources without SQL. It provides transparent reasoning, role-based access, and integrates with existing tools.