A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases
Summary
This paper presents a semantic-layer-mediated NL2SQL agent that decouples intent from physical execution by reasoning over a curated semantic model, achieving 94.15% execution accuracy on the Spider2-snow benchmark.
View Cached Full Text
Cached at: 07/01/26, 05:32 AM
# A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases
Source: [https://arxiv.org/html/2606.31041](https://arxiv.org/html/2606.31041)
2ndSaksonita Khoeurn3rdYe Ji YoonHa Jeong Kim and Saksonita Khoeurn contributed equally to this work\.Corresponding author: Saksonita Khoeurn \(saksonita@chungbuk\.ac\.kr\)\.
###### Abstract
Natural language to SQL \(NL2SQL\) over real enterprise databases remains substantially harder than over academic benchmarks: schemas contain hundreds of physical tables with opaque column names, dialects differ across engines, and a single analytical question may require nested aggregation, time\-windowed logic, and multi\-table joins\. Directly prompting a large language model \(LLM\) with raw schema text exposes the model to this complexity all at once and yields brittle queries\. This paper presents the architecture of a semantic\-layer\-mediated NL2SQL agent that decouples*intent*from*physical execution*\. Rather than asking the LLM to write SQL against raw tables, the agent reasons over a curated semantic layer through a compact intermediate representation we call the Semantic Model Query \(SMQ\); a deterministic engine compiles SMQs into dialect\-correct SQL, which the agent then inspects, composes, and executes\. The system follows a single\-tool think–act loop, routes execution across SQLite, BigQuery, and Snowflake backends, and is packaged as an end\-to\-end evaluation harness\. Driven by Gemini 3 Pro, the system attains 94\.15% execution accuracy on the 547\-task Spider2\-snow benchmark—a suite of real\-world enterprise NL2SQL tasks executed against Snowflake—the third\-highest entry on the official leaderboard and far above schema\-only baselines\. We describe the component design, the SMQ representation, the agent’s exploration strategy, and the per\-backend results, and we discuss the quality and overfitting tensions inherent to maintaining a semantic layer as a context source for an LLM agent\.
## IIntroduction
Translating natural language questions into executable SQL \(NL2SQL\) is a long\-standing goal for democratizing data access\. Recent large language models \(LLMs\) achieve strong results on academic benchmarks such as Spider\[[2](https://arxiv.org/html/2606.31041#bib.bib2)\]and BIRD\[[3](https://arxiv.org/html/2606.31041#bib.bib3)\], where schemas are small and questions map fairly directly to single queries\. Enterprise settings are different in kind, not merely in degree\. The Spider2 benchmark\[[1](https://arxiv.org/html/2606.31041#bib.bib1)\]was introduced precisely to capture this gap: its tasks operate over production data warehouses with hundreds of columns, cryptic naming conventions, dialect\-specific functions, and questions whose gold answers routinely span dozens of lines of SQL with common table expressions \(CTEs\), window functions, and multi\-step aggregation\. Reported accuracies on Spider2 are far below the near\-saturation numbers seen on Spider, confirming that “put the schema in the prompt and ask for SQL” does not transfer to real workloads\.
We argue that the difficulty has two distinct sources\. The first is*grounding*: the model must discover which physical tables and columns are relevant and how they join, out of a large and poorly self\-describing schema\. The second is*composition*: the model must assemble correct, dialect\-valid SQL once the right building blocks are known\. Conflating these two problems in a single free\-form generation step is what makes naive prompting brittle—a single wrong column name or join predicate fails the whole query, and the model has no structured surface on which to recover\.
This paper presents the architecture of*spider2\-daquv\-quvi*, an NL2SQL agent that separates grounding from composition by interposing a*semantic layer*between the LLM and the database\. The semantic layer is a curated, business\-oriented description of each database: tables are wrapped as*semantic models*exposing named*dimensions*,*measures*, and*metrics*, with human\-readable descriptions and the physical column expressions they map to\. The agent never sees raw schema first; instead it issues a compact, structured query—the Semantic Model Query \(SMQ\)—against the semantic layer\. A deterministic engine compiles each SMQ into dialect\-correct SQL and returns it\. The agent uses these compiled fragments as*verified building blocks*: it inspects the physical column expressions and join patterns they reveal, composes them into a final query \(adding constructs the SMQ compiler does not support, such as window functions or recursive CTEs\), and executes that query against the appropriate backend\.
The contributions of this paper are:
- •A system architecture for NL2SQL that mediates LLM reasoning through a semantic layer and a structured intermediate representation \(SMQ\), decoupling schema grounding from SQL composition \(Sections[III](https://arxiv.org/html/2606.31041#S3)–[IV](https://arxiv.org/html/2606.31041#S4)\)\.
- •A single\-tool think–act agent loop in which SMQ compilation is used for*exploration*and direct SQL execution is the single terminal action, together with the prompting strategy that enforces this discipline \(Section[V](https://arxiv.org/html/2606.31041#S5)\)\.
- •A multi\-backend execution and evaluation harness that routes queries to SQLite, BigQuery, and Snowflake and scores them under the Spider2 protocol \(Section[VI](https://arxiv.org/html/2606.31041#S6)\)\.
- •An empirical study on the 547\-task Spider2\-snow benchmark, reporting 94\.15% execution accuracy with a per\-backend breakdown and a comparison against published leaderboard methods, and a discussion of the practical tension between semantic\-layer quality and overfitting to the evaluation set \(Sections[VII](https://arxiv.org/html/2606.31041#S7)–[VIII](https://arxiv.org/html/2606.31041#S8)\)\.
## IIRelated Work
Text\-to\-SQL benchmarks\.Spider\[[2](https://arxiv.org/html/2606.31041#bib.bib2)\]established cross\-domain, multi\-table semantic parsing as a standard task; BIRD\[[3](https://arxiv.org/html/2606.31041#bib.bib3)\]added larger, dirtier databases and an emphasis on efficiency and external knowledge\. Spider2\[[1](https://arxiv.org/html/2606.31041#bib.bib1)\]raised the bar to enterprise\-scale warehouses on BigQuery, Snowflake, and local engines, with long, realistic gold queries; it is the benchmark we target\. Our work evaluates on the Spider2\-snow split\.
LLM methods for text\-to\-SQL\.Prompting and decomposition methods such as DIN\-SQL\[[4](https://arxiv.org/html/2606.31041#bib.bib4)\], DAIL\-SQL\[[5](https://arxiv.org/html/2606.31041#bib.bib5)\], and C3\[[6](https://arxiv.org/html/2606.31041#bib.bib6)\]improve single\-shot generation through schema linking, few\-shot selection, and self\-correction\. Multi\-agent and pipeline systems such as MAC\-SQL\[[7](https://arxiv.org/html/2606.31041#bib.bib7)\]and CHESS\[[8](https://arxiv.org/html/2606.31041#bib.bib8)\]introduce specialized decomposer/selector/refiner roles and schema\-pruning stages\. These approaches operate directly on physical schemas; our system instead routes reasoning through a curated semantic layer, treating compiled SMQ→\\rightarrowSQL fragments as verified grounding evidence rather than relying on the model to recall physical names from a flat schema dump\.
Semantic layers\.Business\-intelligence semantic layers and metrics frameworks \(e\.g\., dbt’s MetricFlow\[[10](https://arxiv.org/html/2606.31041#bib.bib10)\]\) let analysts define dimensions, measures, and metrics once and reuse them across queries\. We repurpose this idea as an*LLM grounding substrate*: the semantic layer is both the model’s view of the data and the source of compilable, dialect\-correct SQL\.
LLM agents and tool use\.The think–act paradigm of ReAct\[[9](https://arxiv.org/html/2606.31041#bib.bib9)\]interleaves reasoning traces with tool calls\. Our agent adopts a strict single\-tool\-per\-step variant in which one tool \(SMQ compilation\) is for exploration and another \(SQL execution\) is the single terminating action, constraining the agent’s action space to reduce error propagation\.
## IIISystem Architecture
Fig\.[1](https://arxiv.org/html/2606.31041#S3.F1)shows the end\-to\-end architecture\. The system is organized into four tiers: \(1\) an*orchestration*tier that loads benchmark instances and drives batched, parallel execution; \(2\) the*QUVI*NL2SQL service that hosts the LLM agent and the semantic layer; \(3\) a deterministic*SMQ\-to\-SQL engine*; and \(4\) a*multi\-backend executor*that runs the final SQL against the correct database\. Each instance is processed as an independent workflow keyed by an instance identifier and timestamp\.

Figure 1:End\-to\-end architecture\. The orchestrator dispatches NL questions to the QUVI service, which runs the agent loop over the semantic layer\. SMQs are compiled to SQL by the engine; the final SQL is routed by instance prefix to the SQLite, BigQuery, or Snowflake executor\. Results are written to submission files and scored by the Spider2 evaluation suite\.### III\-AOrchestration Tier
The entry point parses a run specification—database split \(lite/snow\), and an instance selector \(explicit IDs, ranges, prefixes, or all\)\. An instance loader reads the Spider2 instance files, which contain the natural\-language question and metadata\. An execution service then dispatches instances through a thread pool for parallel batches, with backoff on rate limiting \(e\.g\., HTTP 429\), per\-instance timeouts, and bounded retries\. For each instance it mints a workflow identifier and calls the NL2SQL client, which authenticates per database user before posting the question to the QUVI service\.
### III\-BQUVI NL2SQL Service
QUVI hosts the agent and the semantic layer\. On receiving a question it \(i\) looks up the relevant semantic models for the target database, \(ii\) runs the think–act agent loop \(Section[V](https://arxiv.org/html/2606.31041#S5)\) that emits SMQs and SQL, and \(iii\) delegates SMQ compilation to the engine\. The semantic layer is a per\-database directory containing source table definitions, a join graph, date and time\-spine configuration, and one YAML file per semantic model\.
### III\-CSMQ\-to\-SQL Engine
A separate engine endpoint compiles an SMQ into SQL deterministically \(SmqToSql\)\. It resolves each referenced dimension/measure/metric to its physical column expression, injects join predicates from the join graph, and emits SQL in the target dialect\. This component is the source of*ground truth physical names*: every table name and column expression the agent uses in its final query is lifted from a compiler output, not hallucinated\.
### III\-DMulti\-Backend Executor
A custom execution server exposes a single endpoint and routes by instance\-ID prefix to the correct backend \(Table[I](https://arxiv.org/html/2606.31041#S3.T1)\)\. Each executor holds the appropriate credentials/driver and returns results as a list of row dictionaries, which the orchestrator serializes to a SQL file and a CSV for scoring\.
TABLE I:Backend routing by instance identifier prefix
## IVThe Semantic Layer and SMQ
### IV\-ASemantic Models
Each physical table is wrapped by a semantic model that gives it a business name and exposes typed, described data elements\.*Dimensions*are aggregation criteria \(grouping/filtering keys\),*measures*are aggregation targets, and*metrics*are predefined aggregations or derived calculations over measures and dimensions\. Crucially, each element carries \(a\) a human\-readable description that the LLM reads during grounding, and \(b\) anexprfield holding the exact physical column expression used during compilation\. The abstract name and the physical expression are thus kept separate, which lets the agent reason about*intent*while the engine handles*physical mapping*\. Listing[1](https://arxiv.org/html/2606.31041#LST1)shows an excerpt\.
\-name:RetailAnalyticsSalesModel
table:\.\.\.\_snowflake\(’RETAIL\_ANALYTICS\_SALES’\)
dimensions:
\-name:asin
type:varchar
description:AmazonStandardIdentificationNumber
expr:ASIN
measures:
\-name:orderedRevenue
type:float
description:orderedrevenue
expr:ORDERED\_REVENUE
Listing 1:Semantic model excerpt: a dimension and a measure, each pairing a description with a physical expression\.
### IV\-BThe Join Graph
Inter\-model joins are declared once in a per\-database join graph as typed edges: afrommodel, atomodel, a join key, and anonpredicate giving the left/right expressions \(including transformations such asTRIMon a dirty key\)\. The compiler consults this graph to assemble joins automatically, so the agent does not need to rediscover join predicates for the supported cases\.
### IV\-CSemantic Model Query \(SMQ\)
The SMQ is the intermediate representation between the agent and the engine\. It is a compact JSON object with three lists—metrics\(query targets\),filters\(WHERE conditions\), andgroup\_by\(grouping dimensions\)\. Elements are referenced by a uniform naming convention,ModelName\_\_elementNamefor dimensions and measures, and by bare name for metrics\. An example SMQ and its role are shown in Listing[2](https://arxiv.org/html/2606.31041#LST2)\. The SMQ deliberately covers only the common analytical core \(selection, filtering, grouping, declared joins\); it does not express CROSS JOINs, arbitrary subqueries, advanced window functions, or recursive CTEs\. This is by design: the SMQ is an*exploration and grounding*instrument, and the long tail of SQL complexity is handled by the agent composing on top of compiler outputs\.
\{
"metrics":\["RetailAnalyticsSalesModel\_\_orderedRevenue"\],
"filters":\["RetailAnalyticsSalesModel\_\_period=’DAILY’"\],
"group\_by":\["RetailAnalyticsSalesModel\_\_asin"\]
\}
Listing 2:An SMQ\. The engine compiles it to dialect\-correct SQL and returns the SQL plus a result preview\.
## VAgent Workflow
### V\-AThink–Act Loop
The agent runs a constrained think–act loop\. Each turn it emits a private reasoning block followed by exactly one tool call\. Three tools are available:getModelDataElements\(list the metrics/dimensions of selected models\),convertSmqToSql\(compile an SMQ, returning the SQL and a five\-row result preview\), andexecute\(run a final SQL query\)\. Restricting each response to a single tool call narrows the action space and makes the trajectory auditable\. Algorithm[1](https://arxiv.org/html/2606.31041#alg1)summarizes the loop\.
Algorithm 1Semantic\-layer\-mediated NL2SQL agent loop1:Input:question
qq, semantic models
MMfor target DB, dialect
dd
2:
E←getModelDataElements\(relevant\(M,q\)\)E\\leftarrow\\textsc\{getModelDataElements\}\(\\text\{relevant\}\(M,q\)\)⊳\\trianglerightdiscover elements
3:
B←∅B\\leftarrow\\emptyset⊳\\trianglerightverified SQL building blocks
4:repeat
5:write an SMQ
ssfor an intermediate result
6:
\(sql,preview\)←convertSmqToSql\(s\)\(sql,preview\)\\leftarrow\\textsc\{convertSmqToSql\}\(s\)
7:
B←B∪\{\(expr/table names fromsql\)\}B\\leftarrow B\\cup\\\{\(\\text\{expr/table names from \}sql\)\\\}
8:untilbuilding blocks suffice for
qq
9:compose final SQL
QQfrom
BB\(add CTEs, window fns, subqueries\)
10:return
execute\(Q\)\\textsc\{execute\}\(Q\)⊳\\trianglerightsingle terminal action
### V\-BSMQ\-for\-Exploration Discipline
The defining prompt strategy is thatconvertSmqToSqlis used for*exploration*, not for producing the final answer\. Its purpose is to reveal how abstract elements map to physical columns, the dialect’s syntax, and join patterns\. The agent extracts physical table names and column expressions from the compiled SQL and assembles its own final query, applying constructs beyond the SMQ’s expressiveness\. A strict constraint forbids referencing semantic model names as physical tables in any executed SQL: only names lifted from a compiler output may appear\. Direct schema introspection \(e\.g\., queryingINFORMATION\_SCHEMA\) is disallowed, forcing all grounding through the semantic layer\. The singleexecutecall is terminal: a successful execution ends the workflow\.
### V\-CWhy Mediation Helps
This design addresses both difficulty sources from Section[III](https://arxiv.org/html/2606.31041#S3)\. Grounding becomes a guided search over described, business\-named elements rather than a guess over a flat schema; and every physical identifier the model commits to has already been validated by the compiler, so composition errors are confined to the structure the agent adds on top, where they are easier to detect and repair via re\-exploration\.
## VIExecution and Evaluation Harness
Final SQL is dispatched to the backend executor and the returned rows are written as a per\-instance SQL file and CSV under a submission directory, with a human\-readable Markdown trace of the full tool\-call history saved alongside for debugging\. Scoring uses the official Spider2 evaluation suite in two modes: execution\-result comparison \(the submitted result set is compared against the gold result\) and SQL comparison\. Configuration \(service ports, data and semantic\-layer paths, timeouts, retry limits, credentials\) is centralized in a single application config so the same harness runs across the SQLite, BigQuery, and Snowflake environments without code changes\.
## VIIEvaluation
### VII\-ASetup
We evaluate on Spider2\-snow, a benchmark of 547 enterprise NL2SQL instances executed against Snowflake, spanning databases of SQLite\-, BigQuery\-, GA4\-, and native\-Snowflake origin\. The underlying LLM is Gemini 3 Pro \(preview\), invoked through the semantic\-layersmqThinkagent node at temperature 0\.1 with extended thinking enabled at the high setting, a 16,384\-token output budget, and a cap of 20 SMQ iterations per instance; the SMQ→\\rightarrowSQL compiler is deterministic\. We find the agent’s accuracy to be sensitive to the model’s thinking configuration: extended thinking is what allows the model to compose the long, multi\-step SQL that Spider2\-snow demands, and disabling it substantially degrades accuracy\. We report execution accuracy: an instance is correct if the executed query’s result set matches the gold result under the official Spider2 execution\-result protocol\.
### VII\-BOverall Result
On Spider2\-snow the system answers 515 of 547 instances correctly, an execution accuracy of94\.15%\. Given that Spider2 gold queries are long, multi\-step, and dialect\-specific, this result indicates that semantic\-layer mediation, paired with a strong reasoning model and well\-curated per\-database semantic layers, provides highly effective grounding on real enterprise schemas\.
### VII\-CPer\-Backend Breakdown
Table[II](https://arxiv.org/html/2606.31041#S7.T2)breaks accuracy down by instance class\. Accuracy exceeds 94% on every database\-origin class except native Snowflake \(sf, 55\.6%\), which is also the smallest class \(18 instances\) and thus the highest\-variance\. Backend\-specific dialect handling and the maturity of each database’s semantic layer—not the agent loop itself—are the dominant factors in the remaining failures\.
TABLE II:Execution accuracy on Spider2\-snow by database\-origin class \(Gemini 3 Pro\)
### VII\-DComparison with Published Methods
Table[III](https://arxiv.org/html/2606.31041#S7.T3)places our Spider2\-snow result against entries on the official Spider2 leaderboard\[[12](https://arxiv.org/html/2606.31041#bib.bib12)\]\. General agent and prompting pipelines on raw schemas remain low on this benchmark—DAIL\-SQL\[[5](https://arxiv.org/html/2606.31041#bib.bib5)\]with GPT\-4o reaches only 2\.2%, Spider\-Agent\[[1](https://arxiv.org/html/2606.31041#bib.bib1)\]23–26% depending on the backbone, and ReFoRCE\[[11](https://arxiv.org/html/2606.31041#bib.bib11)\]31\.3%\. Our semantic\-layer\-mediated agent \(listed as*QUVI\-3 \+ Gemini\-3\-pro\-preview*\) attains94\.15%, the third\-highest entry overall\. The large margin over schema\-only baselines is consistent with our central claim that mediating the LLM through a curated semantic layer is decisive for enterprise NL2SQL; we note, however, that the semantic layer encodes per\-database domain knowledge that the zero\-shot baselines do not have access to, so the comparison reflects the value of the*system*\(semantic layer plus agent\), not the LLM alone\.
TABLE III:Execution accuracy on Spider2\-snow: published methods \(Spider2 leaderboard\)
### VII\-EFailure Modes
From the saved execution traces, recurring failure modes include: \(i\) SMQ compilation gaps, where a referenced model lacks a needed join relation and the compiler cannot assemble the fragment; \(ii\) composition errors in the agent\-authored portion of complex queries \(nested time windows, exact rounding and ranking semantics\); and \(iii\) under\-specified semantic layers, where a column needed by the gold query is not yet exposed as a described element\. The first and third are addressed by improving the semantic layer rather than the agent, which motivates the discussion below\.
## VIIIDiscussion
### VIII\-ASemantic\-Layer Quality is the Lever
Because grounding is delegated to the semantic layer, system accuracy is strongly bounded by how well each database’s models describe the data and declare its joins\. In practice, improving a database’s semantic models—adding missing elements, sharpening descriptions, declaring join edges—directly lifts accuracy on that database\. This makes the semantic layer the primary maintenance surface\.
### VIII\-BThe Overfitting Tension
Curating the semantic layer against an evaluation set introduces a Goodhart\-style risk: descriptions can drift from concise, generalizable annotations toward verbose text that effectively encodes the expected answer for known questions\. Such context may raise benchmark accuracy while reducing robustness to unseen questions and schema change—the analogue of reward\-proxy overoptimization\. We therefore treat the semantic layer as code subject to review: descriptions should encode reusable schema knowledge, not question\-specific hints, and structural information \(joins, expressions\) should live in typed fields rather than prose\. Quantifying and regularizing this trade\-off—an intrinsic quality signal for constructed context that is independent of downstream task accuracy—is an open problem and a direction for future work\.
### VIII\-CLimitations
The SMQ deliberately covers only a common analytical core, so the hardest queries depend on agent\-authored SQL whose correctness the engine does not guarantee\. The evaluation is single\-benchmark \(Spider2\-snow\) and uses execution\-result matching, which can over\- or under\-credit edge cases\. Per\-backend results for the smallest classes are high\-variance\. Finally, accuracy is entangled with the maturity of each database’s hand\-authored semantic layer, which varies across databases\.
## IXConclusion
We presented the architecture of a semantic\-layer\-mediated NL2SQL agent that decouples schema grounding from SQL composition\. By interposing a curated semantic layer and a compact SMQ intermediate representation between the LLM and the database, and by using deterministic SMQ→\\rightarrowSQL compilation as a source of verified building blocks within a constrained single\-tool agent loop, the system reaches 94\.15% execution accuracy on the 547\-task Spider2\-snow benchmark—the third\-highest entry on the official leaderboard and far above schema\-only baselines\. The results indicate that mediating LLM reasoning through a structured, business\-oriented layer is an effective strategy for enterprise NL2SQL, and that the central remaining challenge is maintaining semantic\-layer quality without overfitting it to the evaluation set\.
## References
- \[1\]F\. Lei, J\. Chen, Y\. Ye, R\. Cao,*et al\.*, “Spider 2\.0: Evaluating language models on real\-world enterprise text\-to\-SQL workflows,” in*Proc\. Int\. Conf\. Learning Representations \(ICLR\)*, 2025\.
- \[2\]T\. Yu, R\. Zhang, K\. Yang,*et al\.*, “Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-SQL task,” in*Proc\. Conf\. Empirical Methods in Natural Language Processing \(EMNLP\)*, 2018, pp\. 3911–3921\.
- \[3\]J\. Li, B\. Hui, G\. Qu,*et al\.*, “Can LLM already serve as a database interface? A big bench for large\-scale database grounded text\-to\-SQLs \(BIRD\),” in*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- \[4\]M\. Pourreza and D\. Rafiei, “DIN\-SQL: Decomposed in\-context learning of text\-to\-SQL with self\-correction,” in*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- \[5\]D\. Gao, H\. Wang, Y\. Li,*et al\.*, “Text\-to\-SQL empowered by large language models: A benchmark evaluation,”*Proc\. VLDB Endowment*, vol\. 17, no\. 5, pp\. 1132–1145, 2024\.
- \[6\]X\. Dong, C\. Zhang, Y\. Ge,*et al\.*, “C3: Zero\-shot text\-to\-SQL with ChatGPT,” arXiv:2307\.07306, 2023\.
- \[7\]B\. Wang, C\. Ren, J\. Yang,*et al\.*, “MAC\-SQL: A multi\-agent collaborative framework for text\-to\-SQL,” in*Proc\. Int\. Conf\. Computational Linguistics \(COLING\)*, 2025\.
- \[8\]S\. Talaei, M\. Pourreza, Y\.\-C\. Chang,*et al\.*, “CHESS: Contextual harnessing for efficient SQL synthesis,” arXiv:2405\.16755, 2024\.
- \[9\]S\. Yao, J\. Zhao, D\. Yu,*et al\.*, “ReAct: Synergizing reasoning and acting in language models,” in*Proc\. Int\. Conf\. Learning Representations \(ICLR\)*, 2023\.
- \[10\]dbt Labs, “dbt Semantic Layer and MetricFlow,” Technical documentation, 2024\. \[Online\]\. Available:https://docs\.getdbt\.com/docs/build/about\-metricflow
- \[11\]M\. Deng, A\. Ramachandran, C\. Xu,*et al\.*, “ReFoRCE: A text\-to\-SQL agent with self\-refinement, consensus enforcement, and column exploration,” arXiv:2502\.00675, 2025\.
- \[12\]XLang Lab, “Spider 2\.0 leaderboard,” 2026\. \[Online\]\. Available:https://spider2\-sql\.github\.io/Similar Articles
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
This paper presents AgentNLQ, a multi-agent system for natural language to SQL conversion that achieves 78.1% semantic accuracy on the BIRD benchmark through schema enrichment and a self-correcting orchestrator.
Pattern for giving an agent reliable "talk to my data warehouse" access without raw text-to-SQL
A pattern for giving AI agents reliable access to data warehouses by using a curated semantic layer (Databricks Genie) instead of raw text-to-SQL, improving accuracy and governance. The agent calls Genie's Conversation API as a tool, receiving both natural-language responses and exact SQL.
Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries
This paper presents a schema-grounded natural language interface for transportation safety analysis that uses a large language model to interpret user queries while preserving deterministic execution against an authoritative database. The framework is evaluated on a Massachusetts transportation safety database, successfully executing all queries and correcting errors in 29% of cases, demonstrating a practical approach to broadening access to safety data.
Bootstrapping Semantic Layer from Execution for Text-to-SQL
Introduces GATE (Grounding After Test from Execution), a method that bootstraps missing semantic groundings from execution feedback to handle under-specified user phrases in text-to-SQL tasks, consistently improving over strong baselines.
SANE Schema-aware Natural-language Evaluation of Biological Data
SANE is a novel schema-aware evaluation paradigm for natural-language (text-to-SQL) querying of biological/pharmacological datasets, enabling automatic benchmark generation tied to real experimental schemas. The study shows that few-shot LLMs with structured prompting can achieve accurate SQL generation without fine-tuning, with most failures stemming from ambiguous inputs rather than incorrect query generation.