FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
Summary
FD-NL2SQL is a feedback-driven natural language to SQL system for clinical oncology databases that improves with use through clinician edits and logic-based SQL augmentation. The system decomposes natural language questions into predicates, retrieves expert-verified exemplars, and synthesizes executable SQL with continuous learning capabilities.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
# FD-NL2Sql: Feedback-Driven Clinical NL2SQL that Improves with Use Source: https://arxiv.org/html/2604.15646 Suparno Roy Chowdhury¹*, Tejas Anvekar¹, Manan Roy Choudhury¹, Muhammad Ali Khan², Kaneez Zahra Rubab Khakwani², Mohamad Bassam Sonbol², Irbaz Bin Riaz¹*, Vivek Gupta² ¹Arizona State University ²Mayo Clinic [Project Page](https://tejasanvekar.github.io/FD-NL2SQL/) | [Demo](https://tejasanvekar.github.io/FD-NL2SQL/try) | [Video](https://youtu.be/VMfOc440JKM) | [Code](https://github.com/TejasAnvekar/FD-NL2SQL) [email protected], [email protected] ###### Abstract Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demonstrate FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement. ## 1 Introduction **Figure 1:** A clinician question is decomposed into schema-aligned predicate sub-questions; for each predicate, semantically similar expert-approved exemplars are retrieved; this guides the schema-grounded SQL synthesizer. Users can edit and approve the final SQL to update the exemplar bank. To expand coverage with minimal annotation, approved SQL is augmented by a single atomic mutation (e.g., operator or column substitution) and retained only if it returns non-empty results. A second LLM back-translates augmented SQL into an NL question & predicate sub-questions; new samples are added to the bank for continual improvement. Clinical trial databases are central to modern oncology research and drug development. Public registries such as ClinicalTrials.gov and institutional repositories contain rich structured data, including trial phase, biomarkers, eligibility criteria, endpoints, recruitment status, and sponsor information, supporting competitive intelligence, hypothesis generation, regulatory planning, and translational research. As oncology increasingly shifts toward biomarker-driven and precision trials, efficient access to this structured data is critical. Yet these databases remain difficult to query. Access typically requires SQL expertise and detailed schema knowledge. Schemas are often complex, spanning multiple relational tables for eligibility, interventions, endpoints, and disease ontologies. Clinicians and translational researchers, though domain experts, are rarely trained in database querying, leading to analyst-mediated workflows that slow iterative exploration. In the high-stakes setting of oncology drug development, this friction directly affects research velocity and decision quality. Existing tools only partially mitigate this gap. Registry interfaces rely on keyword-based search, which cannot reliably enforce structured multi-constraint filtering (e.g., biomarker + phase + endpoint + recruitment criteria). Business intelligence dashboards provide predefined reports but remain rigid and cannot cover the combinatorial space of exploratory clinical queries. Natural Language to SQL (NL2SQL) systems have advanced substantially in recent years, beginning with neural approaches such as Seq2SQL and large-scale benchmarks such as Spider. More recent work has introduced schema-aware reasoning (e.g., RAT-SQL) and constrained decoding for syntactic validity (e.g., PICARD). Large Language Models (LLMs) further demonstrate strong in-context semantic parsing capabilities. However, these systems are largely designed for general-purpose benchmarks and do not explicitly incorporate domain-aware constraint decomposition, exemplar retrieval grounded in clinical schemas, or interactive feedback loops tailored to high-stakes biomedical querying. In specialized domains such as oncology, naive generation without schema-aligned grounding and domain-specific retrieval can lead to brittle or clinically implausible queries. To address these limitations, we introduce a domain-aware NL2SQL system designed specifically for oncology clinical trial databases. Our approach integrates three core components. First, we perform LLM-guided self-evolving decomposition of a user's question into atomic, schema-aligned sub-questions that each correspond to a filterable predicate. Second, we retrieve semantically similar seed exemplars using Sentence-BERT embeddings, enabling structured grounding in prior validated query patterns. Third, we perform retrieval-guided SQL synthesis with controlled decoding and post-processing to ensure structural validity and constraint satisfaction. This decomposition-retrieval-synthesis architecture improves robustness by aligning generation with both schema structure and domain-specific precedent. Beyond static query translation, our system operates as a *living clinical review assistant*. Each generated query can be previewed, refined, and corrected interactively. User feedback on retrieved exemplars and synthesized SQL is incorporated into the seed bank, improving retrieval neighborhoods and generation fidelity over time. This feedback-driven refinement is consistent with emerging paradigms of interactive and adaptive language model systems, but is operationalized here in a structured, database-grounded clinical setting. As clinicians issue more domain-specific queries, the system progressively aligns with real-world oncology reasoning patterns. From a clinician's perspective, this approach substantially reduces dependency on technical skills, accelerates hypothesis testing, and enables real-time, multi-constraint exploration of trial criteria. By combining domain-aware decomposition, exemplar-guided synthesis, and iterative feedback, the system bridges the gap between oncology expertise and structured data access, transforming static registries into interactive analytical tools. Our main contributions are: - A schema-aware, predicate-level decomposition strategy that improves robustness of clinical NL2SQL in complex oncology databases. - A retrieval-guided SQL synthesis pipeline that grounds LLM generation in expert-verified exemplars for reliable query construction. - A feedback-driven, self-evolving clinical trial query assistant that improves continuously through clinician interaction. ## 2 Related Work Recent advances in text-to-SQL have shifted from supervised semantic parsing toward large language model (LLM) prompting and in-context learning. DIN-SQL demonstrates that decomposed prompting improves SQL generation by breaking complex questions into intermediate reasoning steps. Similarly, execution-guided decoding improves robustness by validating generated queries against database constraints during generation. These approaches highlight the importance of structural grounding when generating executable SQL. Retrieval-based prompting has also emerged as an effective strategy for improving LLM reasoning. In-context example selection significantly affects downstream generation quality, and retrieval-augmented generation (RAG) shows that grounding outputs in external memory improves reliability. Our method extends this paradigm by retrieving semantically similar question-SQL exemplars using dense embeddings and conditioning synthesis on predicate-aligned decompositions, rather than relying solely on flat prompt demonstrations. Within the biomedical domain, pretrained scientific language models such as SciBERT have demonstrated gains on domain-specific NLP tasks. However, prior work primarily focuses on unstructured text understanding rather than structured clinical database querying. Our system bridges biomedical language understanding with schema-aware SQL synthesis, enabling clinician-driven, multi-constraint exploration of oncology clinical trial databases. ## 3 FD-NL2SQL ### 3.1 System Architecture and Workflow **Figure 2:** FD-NL2SQL demo UI and feedback loop. Clinicians issue a natural-language query in the chat interface (right) and view executed results in the table view (left). The system shows the generated SQL, which an expert can *accept*, *modify*, or *reject*; accepted / edited queries are saved back to the exemplar bank to improve future retrieval and synthesis (autofill supports rapid refinement). As illustrated in Figure 1, FD-NL2SQL is an interactive NL2SQL assistant for oncology clinical-trial databases. The system follows a modular pipeline combining LLM-based reasoning with retrieval over an evolving exemplar bank and lightweight programmatic checks. The demo UI exposes intermediate artifacts, retrieved exemplars, synthesized SQL, and results to support transparent refinement and feedback-driven improvement. #### Resources We assume a SQLite database $\mathcal{D}$ with schema metadata and an exemplar bank $\mathcal{S}=\{(s_j, y_j)\}_{j=1}^M$ of expert-approved NL2SQL pairs. We pre-compute sentence embeddings $\mathbf{e}(s_j)$ for all $s_j$ and maintain an index for fast top-$k$ retrieval. #### 1) Schema grounding Before generation, we introspect $\mathcal{D}$ to build a schema dictionary (tables, columns, types, and join keys). This schema context is injected into prompts and used for post-generation validation (e.g., column existence and join feasibility). #### 2) Decomposition-retrieval Given a user question $x$, an LLM produces a schema-aligned, WHERE-oriented decomposition $\mathcal{X}(x) = \{x_1, \ldots, x_n\}$ where each $x_i$ targets one atomic predicate (column, operator, value). For each $x_i$, we retrieve the top-$k_r$ nearest exemplars from $\mathcal{S}$ using cosine similarity in embedding space, storing $(s_j, y_j, \text{score}(x_i, s_j))$. To reduce prompt noise, we additionally extract a compact WHERE-pattern hint from each retrieved SQL when available. #### 3) Retrieval-guided SQL synthesis & execution A second LLM synthesizes the final SQL $\hat{y}$ conditioned on (i) $x$, (ii) $\mathcal{X}(x)$, and (iii) the retrieved exemplar bundles $\{(x_i, \mathcal{N}_i)\}_{i=1}^n$. The model is instructed to use retrieved SQL as structural templates while satisfying all constraints in $x$ and avoiding irrelevant literal copying. We parse the output into a single executable SELECT/WITH statement, apply lightweight guards (read-only policy, schema checks, timeout), and execute $\hat{y}$ against SQLite to render results in the UI. ### 3.2 Feedback and Exemplar Bank Expansion #### 4) Expert approval The Figure 2 interface allows users to edit generated SQL. If an expert approves the corrected query $y^{\star}$ for question $x$, the pair $(x, y^{\star})$ is appended to the exemplar bank $\mathcal{S}$ and embedded for future retrieval. #### 5) SQL augmentation for bank growth To expand exemplar coverage with minimal manual annotation, FD-NL2SQL augments approved queries in two sub-steps: - **a) Logic-based SQL mutation.** Starting from an approved query $y^{\star}$, we apply exactly one atomic transformation, such as (i) operator change ($=\rightarrow\geq$, LIKE, etc.), (ii) column substitution within a compatible type group, or (iii) controlled value edits (e.g., year thresholds). The mutated query $\tilde{y}$ is retained only if it executes successfully and returns a non-empty result on $\mathcal{D}$; otherwise it is discarded. - **b) NL back-translation.** For each retained $\tilde{y}$, a separate LLM generates (i) a natural-language question $\tilde{x}$ consistent with $\tilde{y}$ and (ii) predicate-level sub-questions $\mathcal{X}(\tilde{x})$ aligned with the mutated constraints. The resulting pair $(\tilde{x}, \tilde{y})$ (and optional decomposition) is added back to $\mathcal{S}$, improving retrieval and synthesis coverage over time. ## 4 Experimental Setup ### 4.1 Dataset Construction #### Seed set We start with 500 seed questions authored by a Mayo Clinic oncology scientist to reflect realistic evidence-review queries over IOTOX (e.g., cancer type, class of ICI, trial phase, endpoints / follow-up, temporal filters). Each question is paired with a gold SQLite query, verified by execution, and stored as JSON (question, SQL, optional metadata). The seed set serves as (i) the initial exemplar bank for retrieval and (ii) the pool for few-shot demonstrations. #### Programmatic benchmark expansion To evaluate generalization beyond the seed distribution, we create a benchmark by applying a *single* atomic transformation to each seed pair $(x, y)$: $(\tilde{x}, \tilde{y}) = T(x, y)$, where $T$ edits either the projection (SELECT) or constraints (WHERE) while preserving the overall query intent. We generate the following variant types:
Similar Articles
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
This paper presents AgentNLQ, a multi-agent system for natural language to SQL conversion that achieves 78.1% semantic accuracy on the BIRD benchmark through schema enrichment and a self-correcting orchestrator.
Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
This paper presents a modular retrieval-augmented generation (RAG) pipeline for extracting structured clinical observations from conversational nurse-patient transcripts, using schema-constrained prompting and second-pass auditing with Llama and GPT backbones, achieving 80.36% F1 score.
NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic
NeuroNL2LTL is a neurosymbolic framework that translates natural language to Linear Temporal Logic (LTL) using a two-stage architecture with verifier-in-the-loop training, achieving improved correctness guarantees for safety-critical specifications.
SANE Schema-aware Natural-language Evaluation of Biological Data
SANE is a novel schema-aware evaluation paradigm for natural-language (text-to-SQL) querying of biological/pharmacological datasets, enabling automatic benchmark generation tied to real experimental schemas. The study shows that few-shot LLMs with structured prompting can achieve accurate SQL generation without fine-tuning, with most failures stemming from ambiguous inputs rather than incorrect query generation.
ROSE: An Intent-Centered Evaluation Metric for NL2SQL
ROSE is a novel intent-centered evaluation metric for NL2SQL that uses a Prover-Refuter cascade to assess semantic correctness independently of ground-truth SQL, achieving 24% better agreement with human experts than existing metrics. The paper addresses limitations of Execution Accuracy and provides a re-evaluation of 19 NL2SQL methods with publicly released resources.