R^3-SQL: Ranking Reward and Resampling for Text-to-SQL
Summary
# Paper page - R^3-SQL: Ranking Reward and Resampling for Text-to-SQL Source: [https://huggingface.co/papers/2604.25325](https://huggingface.co/papers/2604.25325) ## Abstract R$^3$\-SQL addresses inconsistencies in scoring functionally equivalent SQL queries and improves candidate recall through unified reward ranking and agentic resampling techniques\. Modern[Text\-to\-SQL](https://huggingface.co/papers?q=Text-to-SQL)systems generate multiple candidate[SQL queries](https://huggingface.co/papers
View Cached Full Text
Cached at: 05/11/26, 10:44 AM
Paper page - R^3-SQL: Ranking Reward and Resampling for Text-to-SQL
Source: https://huggingface.co/papers/2604.25325
Abstract
R^3-SQL addresses inconsistencies in scoring functionally equivalent SQL queries and improves candidate recall through unified reward ranking and agentic resampling techniques.
ModernText-to-SQLsystems generate multiple candidateSQL queriesand rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalentSQL queriesinconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R^3-SQL, aText-to-SQLframework that addresses both issues through unified reward for ranking and resampling. R^3-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines apairwise preferenceacross groups with apointwise utilityfrom the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R^3-SQL introducesagentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent. R^3-SQL achieves 75.03execution accuracyon BIRD-dev, a new state of the art among methods using models with disclosed sizes, with consistent gains across five benchmarks.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.25325 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.25325 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.25325 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@omarsar0: Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs …
Research introduces Skill-RAG, a novel approach that combines Skills with Retrieval-Augmented Generation to address inefficiencies in traditional RAG systems that retrieve on every query regardless of whether the model actually needs the information.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
SCURank introduces Summary Content Units to rank candidate summaries, enabling small models distilled from multiple LLMs to outperform traditional metrics and single-LLM distillates.
Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
This paper describes a system for SemEval-2026 Task 8 that uses a three-stage pipeline involving query rewriting with a fine-tuned Qwen model, hybrid retrieval, and cross-encoder reranking to improve multi-turn retrieval performance.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.