When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking
Summary
This paper challenges the assumption that reranking always improves few-shot selection for LLMs, proposing a training-free gated reranking approach that uses model uncertainty to decide when to rerank, reducing computational costs by 15-80% while slightly improving performance.
View Cached Full Text
Cached at: 07/01/26, 05:32 AM
# Uncertainty-Based Gating for Few-Shot Reranking
Source: [https://arxiv.org/html/2606.31087](https://arxiv.org/html/2606.31087)
Orian Dabod1, Amir DN Cohen2, Gabriel Stanovsky1
1The Hebrew University of Jerusalem,2OriginAI / Ramat Gan, Israel orian\.dabod@mail\.huji\.ac\.il
###### Abstract
Few\-shot selection typically assumes that reranking retrieved examples always improves performance\. We challenge this view by identifying that the expensive reranking step can in fact degrade performance\. Instead, we propose*Training\-Free Gated Reranking*, which decides whether to rerank the few\-shot examples based on the model’s uncertainty\. Extensive experiments across 8 LLMs, covering 7 NLU datasets and 9 MT domain\-language combinations, demonstrate that our approach reduces computational costs by 15%\-80% while improving average performance by up to 2%\. These findings indicate that higher computational cost does not guarantee better performance, and that reranking is most beneficial when targeted at high\-uncertainty instances\.
When Reranking Hurts: Uncertainty\-Based Gating for Few\-Shot Reranking
Orian Dabod1, Amir DN Cohen2, Gabriel Stanovsky11The Hebrew University of Jerusalem,2OriginAI / Ramat Gan, Israelorian\.dabod@mail\.huji\.ac\.il
## 1Introduction
Recent work has found that adaptive few\-shot selection can improve the performance of LLMs on in\-context learning tasks\(Agrawalet al\.,[2023](https://arxiv.org/html/2606.31087#bib.bib37); Chitaleet al\.,[2024](https://arxiv.org/html/2606.31087#bib.bib32); Liuet al\.,[2022](https://arxiv.org/html/2606.31087#bib.bib49)\)\. In particular, several works adopt a “retrieve\-then\-rerank” approach, fetching a broad pool of candidates and using trained cross\-encodersLiet al\.\([2023](https://arxiv.org/html/2606.31087#bib.bib43)\); Wanget al\.\([2024b](https://arxiv.org/html/2606.31087#bib.bib42)\); Rubinet al\.\([2022](https://arxiv.org/html/2606.31087#bib.bib50)\)or training\-free scorers to select demonstrations for a specific inference sampleWuet al\.\([2023](https://arxiv.org/html/2606.31087#bib.bib51)\); Penget al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib9)\)\. However, reranking with repeated LLM calls increases computational costs, total token consumption can increase by 13\.4×\\timesfor MT and by 29\.5×\\timesfor NLU\.111Calculated per task, as the total sum of tokens across all models and datasets\.
0101020203030404050506060707080809090100100−6\-6%−4\-4%−2\-2%0%22%Relative BLEU Change \(%\)\(a\)Machine Translation \(MT\) Tasks0101020203030404050506060707080809090100100−2\-2%0%22%Token Saving \(%\)Relative Performance Change \(%\)\(b\)Natural Language Understanding \(NLU\)
Figure 1:Relative performance impact versus computational token savings achieved via selective reranking for Machine Translation \(MT, measured in BLEU\) and Natural Language Understanding \(NLU\) tasks\. The majority of models \(3B–70B\) exhibit performance gains, though the optimal savings window shifts from 15–50% for MT to roughly 50–80% for most NLU models\.To address these concerns, we propose*Training\-Free Gated Reranking*, a simple yet effective approach, which uses the perplexity of the LLM’s initial generation to rerank examples only when the model is uncertain\. Rather than reranking every instance, we apply the reranking step exclusively when the input’s perplexity falls above a predefined threshold\.
Our approach improves efficiency while matching or slightly surpassing performance across various models and datasets\. Figure[1](https://arxiv.org/html/2606.31087#S1.F1)shows that tuning the perplexity threshold for reranking yields computational savings of 15%\-80% while maintaining or surpassing baseline performance\. Our manual analysis confirms that reranking provides clear gains for difficult instances, but can degrade performance when applied to easy instances\. Overall, our approach presents a convenient tradeoff between resources and performance across 8 models ranging from 3B to 70B parameters, covering 9 combinations of language pairs and specialized domains, as well as 7 NLU datasets\.222Code and experimental data are provided in the supplementary material for review, and will be made publicly available upon acceptance\.
## 2Perplexity\-Based Reranking of Few\-Shot Examples
We propose a simple approach that adaptively allocates computational resources to choose few\-shot examples based on the model’s intrinsic uncertainty\.
#### Formal definitions\.
Given an LLMMM, an inputxxand a large pool of candidate demonstrations𝒞=\(\(x1,y1\),…,\(xn,yn\)\)\\mathcal\{C\}=\(\(x\_\{1\},y\_\{1\}\),\\ldots,\(x\_\{n\},y\_\{n\}\)\), each consisting of an input and a corresponding gold label, ranked according to their relevance toxx, we are looking for a function:
T\(M,x,𝒞\)↦\{c1,…,ck\|ci∈𝒞\}T\(M,x,\\mathcal\{C\}\)\\mapsto\\\{c\_\{1\},\\ldots,c\_\{k\}\|c\_\{i\}\\in\\mathcal\{C\}\\\}\(1\)
That is, a selection ofkkdemonstration samples from the pool according toxx\.
#### Method\.
We start by generating a prediction without rerankingy^=M\(x,c1,…,ck\)\\hat\{y\}=M\(x,c\_\{1\},\\ldots,c\_\{k\}\), and use it to compute a normalized uncertainty scoreU∈\[0,1\]U\\in\[0,1\]\. This score is derived from the inverse perplexity of the generation:
U\(M,x,𝒞\)=1−PPLM\(y^∣x,c1,…,ck\)−1\\text\{U\}\(M,x,\\mathcal\{C\}\)=1\-\\text\{PPL\}\_\{M\}\(\\hat\{y\}\\mid x,c\_\{1\},\\ldots,c\_\{k\}\)^\{\-1\}\(2\)
Where, and PPL computes the conditional perplexity ofy^\\hat\{y\}onxxand the firstkkexamples\.UUis in the range\[0,1\]\[0,1\], where larger values indicate that the model is less certain in its own prediction\.
We then define our demonstration selection functionTTbased on a predefined uncertainty thresholdτ\\tau:
T\(M,x,𝒞\)=\{\(\(x1,y1\),…\(xk,yk\)\)U≤τ,rerank\(M,x,𝒞\)U\>τT\(M,x,\\mathcal\{C\}\)=\\begin\{cases\}\(\(x\_\{1\},y\_\{1\}\),\\ldots\(x\_\{k\},y\_\{k\}\)\)&U\\leq\\tau,\\\\ \\operatorname\{rerank\}\(M,x,\\mathcal\{C\}\)&U\>\\tau\\end\{cases\}\(3\)
wherererankrerankis defined as the conditional entropy rerankPenget al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib9)\):
rerank\(M,x,𝒞\)=Top−k\(xi,yi\)∈𝒞\(−PPLM\(x∣\(xi,yi\)\)\)\\operatorname\{rerank\}\(M,x,\\mathcal\{C\}\)=\\operatorname\*\{Top\-k\}\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{C\}\}\\Big\(\-\\text\{PPL\}\_\{M\}\(x\\mid\(x\_\{i\},y\_\{i\}\)\)\\Big\)\(4\)
In practice, we calibrate the thresholdτ\\tauto maximize performance on a development set\. During this calibration phase, we apply a moving average with a window size of 5 to smooth local variance\.
Global \(All\)Small ModelsBigger ModelsMethodPerf\.↑\\uparrowSave\. %Perf\.↑\\uparrowSave\. %Perf\.↑\\uparrowSave\. %Machine Translation \(BLEU / COMET\)No Reranking37\.09/81\.05100\.033\.37/79\.46100\.038\.33/81\.57100\.0Full Reranking38\.32/81\.370\.034\.77/79\.890\.039\.51/81\.860\.0Gating \(dev\-set calibrated\)38\.42†/81\.3920\.8634\.95†/79\.9420\.5239\.58†/81\.8820\.98Gating \(test\-set calibrated\)38\.69†/81\.4317\.4235\.19†/79\.9819\.2239\.85†/81\.9216\.82Natural Language Understanding \(Accuracy\)No Reranking79\.88100\.077\.16100\.080\.78100\.0Full Reranking80\.730\.078\.450\.081\.500\.0Gating \(dev\-set calibrated\)80\.8154\.378\.6050\.281\.5555\.6Gating \(test\-set calibrated\)81\.41†47\.1679\.30†46\.6982\.11†47\.32
Table 1:Performance summary across tasks and model sizes\. Performance \(Perf\.\) shows BLEU/COMET for MT and Accuracy for NLU\.Boldindicates row maxima;underlinedvalues exceed Full Reranking performance\. Gating \(dev\-set calibrated\) uses dev\-setτ\\tau, while Gating \(test\-set calibrated\) uses test\-setτ\\tau\.†denotes a statistically significant improvement over No Reranking \(p<0\.05p<0\.05\)\. Exactpp\-values are provided for transparency\.
## 3Evaluation
### 3\.1Experimental Setup
We list below key experimental details, see more details in the Appendix\.
#### Tasks and models\.
We evaluate our approach across 8 LLMs spanning 3B–70B parameters, compared to a full reranking baselinePenget al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib9)\), assessing performance on both natural language understanding \(NLU\) and machine translation \(MT\) tasks\. Specifically, we adopt the NLU benchmarks used byPenget al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib9)\): SST\-2, SST\-5Socheret al\.\([2013](https://arxiv.org/html/2606.31087#bib.bib45)\), CR, AgNews, SubjWanget al\.\([2019](https://arxiv.org/html/2606.31087#bib.bib46)\), MNLIWilliamset al\.\([2018](https://arxiv.org/html/2606.31087#bib.bib47)\), and QNLIWanget al\.\([2019](https://arxiv.org/html/2606.31087#bib.bib46)\)\. For MT, we use three domain specific corporaKoehn and Knowles \([2017](https://arxiv.org/html/2606.31087#bib.bib26)\): EMEA \(medical\)Tiedemann \([2012](https://arxiv.org/html/2606.31087#bib.bib19)\), JRC\-Acquis \(legal\)Steinbergeret al\.\([2006](https://arxiv.org/html/2606.31087#bib.bib20)\), and KDE \(technical\)Tiedemann \([2012](https://arxiv.org/html/2606.31087#bib.bib19)\)\. We consider both translation directions between English and Spanish, Portuguese, and German, yielding six directions per domain\. We sample 1,000 parallel sentence pairs for each domain\-direction combination, partitioned into 200 development examples for threshold tuning and 800 test examples for evaluation\. All results are averaged over 20 random splits for statistical robustness\.
#### Baselines\.
Baselines include retrieval\-basedkk\-shot selection with BM25Robertson and Zaragoza \([2009](https://arxiv.org/html/2606.31087#bib.bib25)\)and dense retrieval using e5\-base\-v2 or multilingual\-e5\-base depending on the languageWanget al\.\([2022](https://arxiv.org/html/2606.31087#bib.bib24),[2024a](https://arxiv.org/html/2606.31087#bib.bib27)\), and always\-rerank \(τ=0\\tau=0, reranking for every input atO\(N\)O\(N\)\)\. We also compare gating signals to simple input proxies \(e\.g\., source length, source entropy\) and an uncertainty proxy \(token\-level logit gap \- TARG\)Wanget al\.\([2025](https://arxiv.org/html/2606.31087#bib.bib7)\)\.
#### Hyperparameters and metrics\.
We fixk=5k=5demonstrations and a candidate pool ofN=100N=100, report MT quality with BLEUPapineniet al\.\([2002](https://arxiv.org/html/2606.31087#bib.bib21)\)and COMET \(wmt22\-comet\-da\)Reiet al\.\([2022](https://arxiv.org/html/2606.31087#bib.bib22)\), and NLU quality with accuracy\. Following scaling laws for inference computeKaplanet al\.\([2020](https://arxiv.org/html/2606.31087#bib.bib41)\)and standard open\-weight serving metricsGriggset al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib48)\), we evaluate efficiency using the unweighted sum of input and output tokens across all processing stages: \(a\) initial draft generation, \(b\) second\-pass generation after reranking for triggered cases, and \(c\) all conditional perplexity computations over theN=100N=100candidates\. We report relative savings against an always\-reranking baseline\.
### 3\.2Results
Our experiments yield several interesting observations\.
#### Always reranking is not the optimal policy for few\-shot selection in MT and NLU\.
Applying reranking to every query is highly inefficient and unnecessary for maintaining quality\. Table[1](https://arxiv.org/html/2606.31087#S2.T1)demonstrates that our practical Empirical Gating method reduces average computational costs by 20\.86% and 54\.3% for MT and NLU, while achieving performance on par with Full Reranking \(38\.42 vs 38\.32 avg BLEU; 80\.81 vs 80\.73 avg Accuracy\)\. Furthermore, using a post\-hoc test\-set calibrated gating threshold achieves comparable cost reductions \(17\.42% / 47\.16%\) while slightly increasing Full Reranking performance \(\+0\.37 avg BLEU, \+0\.68 avg Accuracy\), demonstrating there is room for improvement\.
#### Selective reranking via uncertainty can improve performance, especially for smaller models\.
Selective reranking remains highly competitive with the Full Reranking baseline across model scales\. Table[1](https://arxiv.org/html/2606.31087#S2.T1)highlights that this dynamic is most pronounced for small models, Full Reranking performance \(34\.77 avg BLEU / 78\.45 avg Accuracy\) is slightly improved upon by our retrospective Gating \(test\-set calibrated\) \(\+0\.42 avg BLEU / \+0\.85 avg Accuracy\) and practical Gating \(dev\-set calibrated\) results \(\+0\.18 avg BLEU / \+0\.15 avg Accuracy\) while yielding average token savings of 19\.22% / 46\.69% and 20\.52% / 50\.2%\. Larger models in the optimal Gating \(test\-set calibrated\) settings show smaller yet positive margins \(\+0\.34 avg BLEU / \+0\.61 avg Accuracy\), while the Gating \(dev\-set calibrated\) yields strong savings \(20\.98% and 55\.6%\) and comparable performance \(\+0\.07 avg BLEU / \+0\.05 avg Accuracy\) demonstrating that selective application remains a highly efficient alternative even at scale\.
#### Selective reranking via uncertainty provides efficient way to optimize performance with minimal impact on task performance\.
As Figure[1](https://arxiv.org/html/2606.31087#S1.F1)illustrates, adjusting the gating threshold provides a way to balance cost and quality\. This tradeoff relies on the correlation between the LLM’s initial uncertainty and the utility of the reranker\. This allows us us to to reduce average token consumption from 15% to 80%\.
#### Uncertainty gating outperforms other gating methods\.
Alternative gating signals fail to achieve the same balance of quality and efficiency\. As shown in Table[2](https://arxiv.org/html/2606.31087#S3.T2), where we compare our approach against several alternative baseline metrics used for the same gating purpose: the perplexity of the input text \(Source Entropy\), the count of words \(Word Length\), along with TARG\(Wanget al\.,[2025](https://arxiv.org/html/2606.31087#bib.bib7)\)\.
StrategyTaskScoreSavingsNo RerankingMT37\.09100\.0%NLU79\.88100\.0%Full RerankingMT38\.320\.0%NLU80\.730\.0%TARGMT38\.158\.6%NLU80\.4635\.6%Source EntropyMT38\.1911\.8%NLU80\.4642\.2%Word LengthMT38\.1117\.1%NLU80\.3738\.1%Uncertainty \(Ours\)MT38\.4220\.9%NLU80\.8154\.3%Table 2:Comparison of gating indicators averaged across all models for MT and NLU\. Thresholds \(τ\\tau\) for each strategy are calibrated on a development set and evaluated on a test set\. Uncertainty achieves the best performance\-efficiency balance\.
## 4Manual Qualitative Analysis
Error CauseMTNLUOverallReranking errorsStructural Template212344High Variance4711Other505No Reranking errorsDomain/Term Shift15823High Variance9918Relational Mapping099Bleu Misleading404Other246Table 3:Distribution of error causes when reranking degrades performance compared to baseline vs\. when reranking improved performance on translation and NLU tasks\.To understand the mechanics driving the performance trade\-offs of uncertainty\-based gating, we manually annotate and analyze instances with the highest and lowest uncertainty scores\. Specifically, we examine some of the top 30 instances \(highest uncertainty, reranking beneficial\) and bottom 30 instances \(lowest uncertainty, reranking degraded\) across both Machine Translation \(MT\) and Natural Language Understanding \(NLU\) tasks to diagnose how the few\-shot examples influence the model’s generation\. See Table[3](https://arxiv.org/html/2606.31087#S4.T3)and Appendix Table[7](https://arxiv.org/html/2606.31087#S7.T7)\.
#### High uncertainty reveals weak baseline retrieval\.
Under high uncertainty, the baseline frequently retrieves misleading demonstrations, such as out\-of\-domain examples in MT \(15 instances\) or missed Relational Mappings in NLU \(9 instances\)\. The reranker resolves these issues by correctly identifying the task’s structural logic and domain constraints\. By successfully retrieving examples that enforce low variance, specialized terminology or exact relationships \(e\.g\., mapping a neighborhood to a borough\), reranking effectively grounds the model and reduces hallucinations\.
#### Low uncertainty indicates strong baseline retrieval\.
For low\-uncertainty instances, the baseline already retrieves highly relevant, format\-preserving contexts\. Here, reranking can actually degrade performance\. The primary cause across both MT \(21 instances\) and NLU \(23 instances\) is the disruption of Structural Templates\. While the baseline retrieves near\-exact syntactic templates, the reranker often overrides them with examples that are topically relevant but structurally varied\. Furthermore, reranking can introduce high\-variance examples with divergent meanings or lexical distractions that confuse the model\. \(Note: A small subset of MT errors were "BLEU Misleading," where the reranker’s valid translation was unfairly penalized by the metric
## 5Conclusion
We show that reranking few\-shot examples is not always beneficial and can even hurt performance\. Instead, we introduce*Training\-Free Gated Reranking*, which selectively ranks when the model is uncertain, reducing computation by 15%–80% while improving average performance by up to 2%\. Overall, our findings suggest that targeted reranking can break the usual tradeoff between higher computational cost and better downstream quality\.
## 6Limitations
Our approach presents three primary limitations\. First, the gating mechanism relies on an empirically calibrated threshold from a development set; distribution shifts between development and test data can cause suboptimal thresholding\. Second, calculating the uncertainty score requires generating an initial, non\-reranked prediction\. For tasks with long output sequences, the latency of this initial step may reduce the efficiency gained from skipping the reranking phase\. Finally, the method requires access to token\-level perplexity, limiting its use to models and APIs that expose logits\.
## References
- In\-context examples selection for machine translation\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 8857–8873\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.564),[Link](https://aclanthology.org/2023.findings-acl.564)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- P\. A\. Chitale, J\. Gala, and R\. Dabre \(2024\)An empirical study of in\-context learning in llms for machine translation\.Vol\.abs/2401\.12097\.External Links:[Link](https://arxiv.org/abs/2401.12097)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- T\. Griggs, X\. Liu, J\. Yu, D\. Kim, W\. Chiang, A\. Cheung, and I\. Stoica \(2024\)Mélange: cost efficient large language model serving by exploiting gpu heterogeneity\.Vol\.abs/2404\.14527\.External Links:[Link](https://arxiv.org/abs/2404.14527)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px3.p1.3)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.Vol\.abs/2001\.08361\.External Links:[Link](https://arxiv.org/abs/2001.08361)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px3.p1.3)\.
- P\. Koehn and R\. Knowles \(2017\)Six challenges for neural machine translation\.InProceedings of the First Workshop on Neural Machine Translation,T\. Luong, A\. Birch, G\. Neubig, and A\. Finch \(Eds\.\),Vancouver,pp\. 28–39\.External Links:[Document](https://dx.doi.org/10.18653/v1/W17-3204),[Link](https://aclanthology.org/W17-3204)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- X\. Li, K\. Lv, H\. Yan, T\. Lin, W\. Zhu, Y\. Ni, G\. Xie, X\. Wang, and X\. Qiu \(2023\)Unified demonstration retriever for in\-context learning\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 4644–4668\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.256),[Link](https://aclanthology.org/2023.acl-long.256)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- J\. Liu, D\. Shen, Y\. Zhang, B\. Dolan, L\. Carin, and W\. Chen \(2022\)What makes good in\-context examples for GPT\-3?\.InProceedings of Deep Learning Inside Out \(DeeLIO 2022\): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures,E\. Agirre, M\. Apidianaki, and I\. Vulić \(Eds\.\),Dublin, Ireland and Online,pp\. 100–114\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.deelio-1.10),[Link](https://aclanthology.org/2022.deelio-1.10)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,P\. Isabelle, E\. Charniak, and D\. Lin \(Eds\.\),Philadelphia, Pennsylvania, USA,pp\. 311–318\.External Links:[Document](https://dx.doi.org/10.3115/1073083.1073135),[Link](https://aclanthology.org/P02-1040)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px3.p1.3)\.
- K\. Peng, L\. Ding, Y\. Yuan, X\. Liu, M\. Zhang, Y\. Ouyang, and D\. Tao \(2024\)Revisiting demonstration selection strategies in in\-context learning\.Vol\.abs/2401\.12087\.External Links:[Link](https://arxiv.org/abs/2401.12087)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2),[§2](https://arxiv.org/html/2606.31087#S2.SS0.SSS0.Px2.p6.1),[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- R\. Rei, M\. Treviso, N\. M\. Guerreiro, C\. Zerva, A\. C\. Farinha, C\. Maroti, J\. G\. C\. de Souza, T\. Glushkova, D\. Alves, L\. Coheur, A\. Lavie, and A\. F\. T\. Martins \(2022\)CometKiwi: IST\-unbabel 2022 submission for the quality estimation shared task\.InProceedings of the Seventh Conference on Machine Translation \(WMT\),P\. Koehn, L\. Barrault, O\. Bojar, F\. Bougares, R\. Chatterjee, M\. R\. Costa\-jussà, C\. Federmann, M\. Fishel, A\. Fraser, M\. Freitag, Y\. Graham, R\. Grundkiewicz, P\. Guzman, B\. Haddow, M\. Huck, A\. Jimeno Yepes, T\. Kocmi, A\. Martins, M\. Morishita, C\. Monz, M\. Nagata, T\. Nakazawa, M\. Negri, A\. Névéol, M\. Neves, M\. Popel, M\. Turchi, and M\. Zampieri \(Eds\.\),Abu Dhabi, United Arab Emirates \(Hybrid\),pp\. 634–645\.External Links:[Link](https://aclanthology.org/2022.wmt-1.60)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px3.p1.3)\.
- S\. E\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Found\. Trends Inf\. Retr\.3,pp\. 333–389\.External Links:[Link](https://api.semanticscholar.org/CorpusID:207178704)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px2.p1.3)\.
- O\. Rubin, J\. Herzig, and J\. Berant \(2022\)Learning to retrieve prompts for in\-context learning\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 2655–2671\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.191),[Link](https://aclanthology.org/2022.naacl-main.191)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Ng, and C\. Potts \(2013\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,D\. Yarowsky, T\. Baldwin, A\. Korhonen, K\. Livescu, and S\. Bethard \(Eds\.\),Seattle, Washington, USA,pp\. 1631–1642\.External Links:[Link](https://aclanthology.org/D13-1170)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- R\. Steinberger, B\. Pouliquen, A\. Widiger, C\. Ignat, T\. Erjavec, D\. Tufiş, and D\. Varga \(2006\)The JRC\-Acquis: a multilingual aligned parallel corpus with 20\+ languages\.InProceedings of the Fifth International Conference on Language Resources and Evaluation \(LREC’06\),N\. Calzolari, K\. Choukri, A\. Gangemi, B\. Maegaard, J\. Mariani, J\. Odijk, and D\. Tapias \(Eds\.\),Genoa, Italy\.External Links:[Link](http://www.lrec-conf.org/proceedings/lrec2006/pdf/340_pdf.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- J\. Tiedemann \(2012\)Parallel data, tools and interfaces in OPUS\.InProceedings of the Eighth International Conference on Language Resources and Evaluation \(LREC’12\),N\. Calzolari, K\. Choukri, T\. Declerck, M\. U\. Doğan, B\. Maegaard, J\. Mariani, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Istanbul, Turkey,pp\. 2214–2218\.External Links:[Link](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2019\)GLUE: A multi\-task benchmark and analysis platform for natural language understanding\.In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\-9, 2019,External Links:[Link](https://openreview.net/forum?id=rJ4km2R5t7)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.ArXiv preprintabs/2212\.03533\.External Links:[Link](https://arxiv.org/abs/2212.03533)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px2.p1.3)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024a\)Multilingual e5 text embeddings: a technical report\.Vol\.abs/2402\.05672\.External Links:[Link](https://arxiv.org/abs/2402.05672)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px2.p1.3)\.
- L\. Wang, N\. Yang, and F\. Wei \(2024b\)Learning to retrieve in\-context examples for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1752–1767\.External Links:[Link](https://aclanthology.org/2024.eacl-long.105)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- Y\. Wang, L\. wei, and H\. Ling \(2025\)TARG: training\-free adaptive retrieval gating for efficient rag\.Vol\.abs/2511\.09803\.External Links:[Link](https://arxiv.org/abs/2511.09803)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px2.p1.3),[§3\.2](https://arxiv.org/html/2606.31087#S3.SS2.SSS0.Px4.p1.1),[§7\.2](https://arxiv.org/html/2606.31087#S7.SS2.SSS0.Px2.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1112–1122\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1101),[Link](https://aclanthology.org/N18-1101)Cited by:[§3\.1](https://arxiv.org/html/2606.31087#S3.SS1.SSS0.Px1.p1.1)\.
- Z\. Wu, Y\. Wang, J\. Ye, and L\. Kong \(2023\)Self\-adaptive in\-context learning: an information compression perspective for in\-context example selection and ordering\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1423–1436\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.79),[Link](https://aclanthology.org/2023.acl-long.79)Cited by:[§1](https://arxiv.org/html/2606.31087#S1.p1.2)\.
- Y\. Yu, W\. Ping, Z\. Liu, B\. Wang, J\. You, C\. Zhang, M\. Shoeybi, and B\. Catanzaro \(2024\)RankRAG: unifying context ranking with retrieval\-augmented generation in llms\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/db93ccb6cf392f352570dd5af0a223d3-Abstract-Conference.html)Cited by:[§7\.2](https://arxiv.org/html/2606.31087#S7.SS2.SSS0.Px2.p1.1)\.
## 7Appendix
### 7\.1Prompt Template
We utilized the following chat template for all Training\-Free Gated Reranking experiments\. Dynamic fields are denoted in brackets\.
MT Prompt Template\[System Message\]You are an expert translator\. Task: Translate \[Source Language\] to \[Target Language\]\.Rule 1: Output ONLY the translated text\.Rule 2: Do not engage in conversation or explain the translation\.Rule 3: CRITICAL: You have been provided with a "Reference Translation" in the context\. You MUST copy the terminology and structure from the Reference Translation exactly\. Do not rephrase\.\[User Message\]\#\#\# Reference Examples:Input \(\[Source Language\]\): \[Retrieved Source Text 1\]Output \(\[Target Language\]\): \[Retrieved Target Text 1\]Input \(\[Source Language\]\): \[Retrieved Source Text 2\]Output \(\[Target Language\]\): \[Retrieved Target Text 2\]…\#\#\# Current Task:Input \(\[Source Language\]\): \[Input Source Text\]Output \(\[Target Language\]\):
NLU Prompt Template\[System Message\]You are an AI assistant\. Complete the given task by outputting ONLY the required label\. Do not explain\.\[User Message\]\#\#\# Reference Examples:\[Retrieved Source Text 1\] \[Retrieved Target Label 1\]\[Retrieved Source Text 2\] \[Retrieved Target Label 2\]…\#\#\# Current Task:\[Input Source Text\]
### 7\.2Hyperparameter Selection
To select our architectural choices, we conducted a three\-stage ablation study on the KDE dataset using theLlama 3Bmodel\.
#### Retrieval Mechanism and Shot Count\.
First, we determined the optimal retrieval backbone\. As shown in Table[4](https://arxiv.org/html/2606.31087#S7.T4),Dense\-based retrievalcombined with reranking consistently outperforms lexical \(BM25\) and random selection\. Crucially, the performance gap widens as the number of shots \(kk\) increases, peaking atk=5k=5\. This validates our decision to fixk=5k=5for the main experiments, as it maximizes the context window’s utility\.
StrategykkCOMETBLEUChrFZero\-Shot069\.817\.1744\.14Random171\.2718\.7446\.05371\.2619\.1646\.07570\.9419\.1045\.82Dense173\.5623\.2850\.78373\.9824\.3551\.80573\.7924\.7452\.19BM25173\.0923\.8650\.65374\.0325\.1352\.15574\.2924\.9752\.33BM25 Full Reranking173\.8524\.4051\.38374\.2026\.0552\.54574\.5126\.2152\.81Dense Full Reranking173\.7624\.1551\.45374\.6526\.3053\.19574\.8026\.7953\.44
Table 4:Impact of Retrieval Strategy and Shot Count \(kk\)\.Ablation study on the KDE \(En→\\toDe\) dataset\. While static retrieval methods \(Random, BM25, E5\) show diminishing returns or saturation at higherkk, the reranking strategies continue to improve, withE5 \+ TopP\+ConEatk=5k=5establishing the upper bound for performance\. Bold values indicate the optimal shot count for each specific strategy\.
#### Comparison of Reranking Algorithms\.
To determine the optimal training\-free method, we evaluated the effectiveness of Conditional Entropy Rerank function against alternative strategies, including Source Entropy Reranking, TARG Reranking which is based onWanget al\.\([2025](https://arxiv.org/html/2606.31087#bib.bib7)\), RankRAG Reranking which followsYuet al\.\([2024](https://arxiv.org/html/2606.31087#bib.bib5)\)without training\. As shown in Table[5](https://arxiv.org/html/2606.31087#S7.T5), while some reranking methods offer some gains, utilizing the Conditional Entropy Rerank yields the most robust performance\.
StrategyCOMETBLEUChrFSource Entropy Reranking69\.8017\.1744\.14RankRAG Reranking71\.5118\.9246\.71TARG Reranking74\.5126\.3452\.88Conditional Entropy Rerank74\.8026\.7953\.44
Table 5:Re\-Ranking Strategies\.Evaluation on KDE \(En→\\toDe\) usingLlama\-3\.2\-3B\-Instruct\(Llama 3B\) \(k=5k=5\)\.
#### Sensitivity to Candidate Pool Size\.
Finally, we analyzed the computational trade\-off regarding the cardinality of the candidate pool\|𝒞\|=N\|\\mathcal\{C\}\|=N\. As detailed in Table[6](https://arxiv.org/html/2606.31087#S7.T6), increasing the search space from5×k5\\times kto20×k20\\times kyields steady improvements\. However, performance plateaus beyondN=100N=100\(20×k20\\times k\)\. Consequently, we selected a pool size of 100 for all experiments to balance maximum utility with inference latency\.
Dense Full RerankingNCOMETBLEUChrF174\.0424\.9552\.42574\.4326\.0152\.801074\.4226\.2053\.082074\.8826\.8053\.473074\.8327\.0853\.38
Table 6:Candidate Pool Sensitivity\.Increasing pool size \(NN\) yields gains up toN=100N=100\(MULT=20\), after which performance saturates\.
### 7\.3Detailed examples used in Analysis
Table[7](https://arxiv.org/html/2606.31087#S7.T7)shows full details for the examples shown in the analysis section\.
StrategyRetrieved In\-Context Example \(Shot\)Final Model OutputCase 1: Reranking Helped \(Resolves Terminology Mismatch\)Input \(PT\):As actualizações sobre o decurso do estudo e sobre o recrutamento serão apresentadas juntamente com os reavaliação anual/ RPS\.Reference \(EN\):Progress and recruitment updates will be provided alongside the Annual Re\-assessment/PSURs\.BaselineSrc:As actualizações do progresso e dos recrutamentos serão apresentadas juntamente com os PSURs normais\. Tgt:Progress and recruitment updates will be provided alongside the standard PSURs\.Updates on the study course and recruitment will be presented together with the annual review/RPS\.RerankSrc:O sumário da informação de segurança recolhida será apresentado juntamente com os reavaliação anual/ RPSs\. Tgt:Summary safety information gathered will be provided alongside the Annual Re\-assessment/ PSURs\.Updates regarding the course of the study and recruitment will be provided alongside the Annual Re\-assessment/PSURs\.Case 2: Baseline Was Better \(Reranking Caused Hallucination\)Input \(EN\):Like all medicines, Xenical can cause side effects, although not everybody gets them\.Reference \(PT\):Como todos os medicamentos, Xenical pode causar efeitos secundários, no entanto estes não se manifestam em todas as pessoas\.BaselineSrc:Like all medicines, Arixtra can cause side effects, although not everybody gets them\. Tgt:Como todos os medicamentos, Arixtra pode causar efeitos secundários, no entanto, estes não se manifestam em todas as pessoas\.Como todos os medicamentos, Xenical pode causar efeitos secundários, no entanto, estes não se manifestam em todas as pessoas\. Eng: Like all medicines, Xenical can cause side effects although not everybody gets them\.RerankSrc:Take special care with XENICAL Weight loss may also affect the dose of medicines taken for other conditions… Tgt:Tome especial cuidado com XENICAL A perda de peso pode também obrigar a alterar a dose de medicamentos…Tome especial cuidado com XENICAL A perda de peso pode também obrigar a alterar a dose de medicamentos que esteja a tomar para outras doenças \(p\. ex\. colesterol elevado ou diabetes\)\. Eng: Take special care with XENICAL Weight loss may also affect the dose of medicines taken for other conditions \(e\.g\. high cholesterol or diabetes\)\.Case 3: Baseline Was Better \(Reranking Caused Template Mismatch\)Input \(EN\):Like all medicines, STOCRIN can cause side effects, although not everybody gets them\.Reference \(PT\):Como todos os medicamentos, STOCRIN pode causarefeitos secundários, no entanto estes não se manifestam em todas as pessoas\.BaselineSrc:Like all medicines, Cetrotide can cause side effects, although not everybody gets them\. Tgt:Como os demais medicamentos, Cetrotide pode causar efeitos secundários, no entanto estes não se manifestam em todas as pessoas\.Como os demais medicamentos, STOCRIN pode causarefeitos secundários, no entanto estes não se manifestam em todas as pessoas\. Eng: Like the other medicines, STOCRIN can causesecondary effects, however these do not manifest in all people\.RerankSrc:Always notify your doctor immediately if you have these symptoms or any side effects while taking STOCRIN\. Tgt:Informe sempre o seu médico imediatamente se sentir qualquer um destes sintomas ou quaisquer efeitos indesejáveis…Como todos os medicamentos, STOCRIN pode causarefeitos colaterais, embora nem todo mundo os obtenha\. Eng: Like all medicines, STOCRIN can causecollateral effects, although not everyone obtains them\.Case 4: Baseline Was Better \(Reranking Lost Structural Template\)Input Task \(NLU\):Some rooms have balconies\. Can we know Some rooms have balconies off of them that overlook the ocean\.?Reference Label:Maybe\.BaselineTask:There are sea views from some rooms and there is also a swimming pool\. Can we know Half the rooms are oceanview\.? Label:Maybe\.Maybe\.RerankTask:Rooms have private lanais, refrigerators, and coffeemakers\. Can we know Rooms have refigerators\.? Label:Yes\.No\.Case 5: Reranking Helped \(Resolves Relational Mapping\)Input Task \(NLU\):In what borough is there a neighborhood called Huguenot? Can we know A small group of Huguenots also settled on the south shore of Staten Island along the New York Harbor, for which the current neighborhood of Huguenot was named\.?Reference Label:Yes\.BaselineTask:The borough of Staten Island is primarily located on what island? Can we know This situation of boroughs separated by water led to the development of an extensive infrastructure of bridges and tunnels\.? Label:No\.No\.RerankTask:In what borough is the Douglaston neighborhood located? Can we know In contrast, New York City also has neighborhoods that are less densely populated and feature free\-standing dwellings\.? Label:No\.Yes\.
Table 7:Comparison of Baseline vs\. Reranking In\-Context Retrieval on Translation and NLU Quality
### 7\.4Performance Across Tasks and Models
Table[8](https://arxiv.org/html/2606.31087#S7.T8)details the results across individual datasets, while Table[9](https://arxiv.org/html/2606.31087#S7.T9)provides a model\-by\-model breakdown\. Tables[11](https://arxiv.org/html/2606.31087#S7.T11)and[10](https://arxiv.org/html/2606.31087#S7.T10)shows Gated \(test\-set calibrated\) and Gated \(dev\-set calibrated\) results per model and per dataset\. The data indicate that our gating strategy generally yields higher performance than full reranking while simultaneously maintaining consistent computational efficiency\. Although empirical performance occasionally trails that of full reranking, the efficiency gains persist across all configurations\.
DatasetBaselineFull RerankGated \(Test\-Set Calibrated\)SavingsGated \(Dev\-Set Calibrated\)SavingsMachine Translation \(BLEU\)Medical\_EMEA38\.7739\.8940\.46†26%40\.12†29%Medical\_JRC\-Acquis39\.3940\.8341\.07†12%40\.87†16%Medical\_KDE33\.1134\.2634\.53†15%34\.27†18%Natural Language Understanding \(Accuracy\)AgNews91\.6692\.6392\.67†47%92\.34†58%CR93\.7693\.6294\.61†88%93\.7280%MNLI65\.9067\.5067\.75†15%67\.13†45%QNLI72\.1073\.5874\.04†34%73\.6347%SST\-293\.3794\.0294\.34†67%93\.8959%SST\-551\.8151\.3453\.16†62%52\.2560%Subj90\.5692\.4693\.27†17%92\.75†30%
Table 8:Performance breakdown by dataset using Uncertainty\. Gated \(test\-set calibrated\) represents the theoretical upper bound, while Gated \(dev\-set calibrated\) represents practical levels\. Gated values exceeding Full Reranking are in bold\.†\\daggerdenotes statistical significance against the baseline \(p<0\.05p<0\.05\)\.ModelBaselineFull RerankGated \(Test\-Set Calibrated\)SavingsGated \(Dev\-Set Calibrated\)SavingsMachine Translation \(BLEU\)Llama\-3\.2\-3B\-Instruct34\.2235\.6936\.00†19%35\.73†20%Llama\-3\.3\-70B\-Instruct43\.1644\.1644\.38†18%44\.16†24%Meta\-Llama\-3\.1\-8B\-Instruct37\.2838\.5739\.03†22%38\.73†26%Mistral\-7B\-Instruct\-v0\.335\.0036\.2336\.53†11%36\.31†15%Qwen2\.5\-32B\-Instruct41\.3842\.5342\.74†16%42\.52†21%Qwen2\.5\-3B\-Instruct32\.5333\.8634\.38†19%34\.16†21%aya\-expanse\-8b37\.6438\.5938\.84†14%38\.60†19%c4ai\-command\-r7b\-12\-202435\.5236\.9637\.58†21%37\.16†21%Natural Language Understanding \(Accuracy\)Llama\-3\.2\-3B\-Instruct76\.3277\.9978\.47†36%77\.70†46%Llama\-3\.3\-70B\-Instruct84\.1285\.0585\.55†36%84\.9348%Meta\-Llama\-3\.1\-8B\-Instruct79\.0679\.7480\.21†47%79\.6055%Mistral\-7B\-Instruct\-v0\.378\.3178\.5579\.76†62%79\.3068%Qwen2\.5\-32B\-Instruct85\.9186\.6787\.13†49%86\.59†54%Qwen2\.5\-3B\-Instruct78\.0178\.9180\.14†57%79\.4955%aya\-expanse\-8b80\.0680\.7481\.24†47%80\.7849%c4ai\-command\-r7b\-12\-202477\.2478\.2278\.76†43%78\.1460%
Table 9:Performance breakdown by model using Uncertainty\. Gated \(Test\-Set Calibrated\) represents the theoretical upper bound, while Gated \(Dev\-Set Calibrated\) represents practical levels\. Gated values exceeding Full Reranking are in bold\.†\\daggerdenotes statistical significance against the baseline \(p<0\.05p<0\.05\)\.DatasetQwen 3BLlama 3BCommandMistralAyaLlama 8BQwen 32BLlama 70BEMEA \(de→\\rightarrowen\)\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]D4EECE\\cellcolor\[HTML\]DBF1D5\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]A2D99C\\cellcolor\[HTML\]B1E0AB\\cellcolor\[HTML\]9FD899\\cellcolor\[HTML\]C0E6B9EMEA \(en→\\rightarrowde\)\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]D6EFD0\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]84CC83\\cellcolor\[HTML\]B5E1AE\\cellcolor\[HTML\]BEE5B8\\cellcolor\[HTML\]AADDA4\\cellcolor\[HTML\]A7DBA0EMEA \(en→\\rightarrowes\)\\cellcolor\[HTML\]97D492\\cellcolor\[HTML\]A2D99C\\cellcolor\[HTML\]D9F0D3\\cellcolor\[HTML\]DEF2D9\\cellcolor\[HTML\]84CC83\\cellcolor\[HTML\]5DB96B\\cellcolor\[HTML\]A7DBA0\\cellcolor\[HTML\]C2E7BBEMEA \(es→\\rightarrowen\)\\cellcolor\[HTML\]BAE3B3\\cellcolor\[HTML\]B8E3B2\\cellcolor\[HTML\]9ED798\\cellcolor\[HTML\]D7EFD1\\cellcolor\[HTML\]E5F5E1\\cellcolor\[HTML\]BAE3B3\\cellcolor\[HTML\]B1E0AB\\cellcolor\[HTML\]CBEBC5EMEA \(en→\\rightarrowpt\)\\cellcolor\[HTML\]C9EAC2\\cellcolor\[HTML\]D3EECD\\cellcolor\[HTML\]B6E2AF\\cellcolor\[HTML\]2A924A\+1\.77 BLEU\(72\.1%\)\\cellcolor\[HTML\]6ABF71\\cellcolor\[HTML\]91D28E\\cellcolor\[HTML\]C8E9C1\\cellcolor\[HTML\]A5DB9FEMEA \(pt→\\rightarrowen\)\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]DEF2D9\\cellcolor\[HTML\]D3EECD\\cellcolor\[HTML\]F0F9EC\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]C9EAC2\\cellcolor\[HTML\]7AC77BJRC\-Acquis \(de→\\rightarrowen\)\\cellcolor\[HTML\]D9F0D3\\cellcolor\[HTML\]BBE4B4\\cellcolor\[HTML\]CBEBC5\\cellcolor\[HTML\]BBE4B4\\cellcolor\[HTML\]E8F6E4\\cellcolor\[HTML\]BEE5B8\\cellcolor\[HTML\]EDF8EA\\cellcolor\[HTML\]DEF2D9JRC\-Acquis \(en→\\rightarrowde\)\\cellcolor\[HTML\]D3EECD\\cellcolor\[HTML\]C9EAC2\\cellcolor\[HTML\]B1E0AB\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]D2EDCC\\cellcolor\[HTML\]A7DBA0\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]C0E6B9JRC\-Acquis \(en→\\rightarrowes\)\\cellcolor\[HTML\]A9DCA3\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]E2F4DD\\cellcolor\[HTML\]D1EDCB\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]CEECC8\\cellcolor\[HTML\]BCE4B5\\cellcolor\[HTML\]ACDEA6JRC\-Acquis \(es→\\rightarrowen\)\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]DCF2D7\\cellcolor\[HTML\]BCE4B5\\cellcolor\[HTML\]FFF0E9\\cellcolor\[HTML\]D3EECD\\cellcolor\[HTML\]BDE5B6\\cellcolor\[HTML\]DDF2D8\\cellcolor\[HTML\]D6EFD0JRC\-Acquis \(en→\\rightarrowpt\)\\cellcolor\[HTML\]D7EFD1\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]E9F7E5\\cellcolor\[HTML\]FFEDE5\\cellcolor\[HTML\]EBF7E7\\cellcolor\[HTML\]E7F6E2\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]E3F4DEJRC\-Acquis \(pt→\\rightarrowen\)\\cellcolor\[HTML\]FFECE4\\cellcolor\[HTML\]FFECE4\\cellcolor\[HTML\]F4FBF2\\cellcolor\[HTML\]FFF2EC\\cellcolor\[HTML\]FFF1EA\\cellcolor\[HTML\]E4F5DF\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]E0F3DBKDE \(de→\\rightarrowen\)\\cellcolor\[HTML\]D7EFD1\\cellcolor\[HTML\]92D28F\\cellcolor\[HTML\]B8E3B2\\cellcolor\[HTML\]E8F6E4\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]B5E1AE\\cellcolor\[HTML\]E0F3DB\\cellcolor\[HTML\]DAF0D4KDE \(en→\\rightarrowde\)\\cellcolor\[HTML\]E1F3DC\\cellcolor\[HTML\]FFF4EE\\cellcolor\[HTML\]DCF2D7\\cellcolor\[HTML\]EFF9EB\\cellcolor\[HTML\]EFF9EB\\cellcolor\[HTML\]EEF8EA\\cellcolor\[HTML\]E6F5E1\\cellcolor\[HTML\]E3F4DEKDE \(en→\\rightarrowes\)\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]E5F5E1\\cellcolor\[HTML\]E0F3DB\\cellcolor\[HTML\]EDF8EA\\cellcolor\[HTML\]CDECC7\\cellcolor\[HTML\]E2F4DD\\cellcolor\[HTML\]DBF1D6\\cellcolor\[HTML\]E5F5E1KDE \(es→\\rightarrowen\)\\cellcolor\[HTML\]CCEBC6\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]CFECC9\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]E0F3DB\\cellcolor\[HTML\]BCE4B5\\cellcolor\[HTML\]A4DA9E\\cellcolor\[HTML\]D8F0D2KDE \(en→\\rightarrowpt\)\\cellcolor\[HTML\]C7E9C0\\cellcolor\[HTML\]C7E9C0\\cellcolor\[HTML\]F2FAF0\\cellcolor\[HTML\]F7FCF5\\cellcolor\[HTML\]F6FCF4\\cellcolor\[HTML\]DDF2D8\\cellcolor\[HTML\]E6F5E1\\cellcolor\[HTML\]BBE4B4KDE \(pt→\\rightarrowen\)\\cellcolor\[HTML\]9CD797\\cellcolor\[HTML\]AEDEA7\\cellcolor\[HTML\]ABDDA5\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]AFDFA8\\cellcolor\[HTML\]D6EFD0\\cellcolor\[HTML\]CAEAC3NLU\_AgNews\\cellcolor\[HTML\]289049\+0\.01 Acc\(72\.9%\)\\cellcolor\[HTML\]38A156\-0\.39 Acc\(66\.2%\)\\cellcolor\[HTML\]79C67A\\cellcolor\[HTML\]137D39\-0\.00 Acc\(80\.7%\)\\cellcolor\[HTML\]8BCF89\\cellcolor\[HTML\]52B365\\cellcolor\[HTML\]5DB96B\\cellcolor\[HTML\]97D492NLU\_CR\\cellcolor\[HTML\]38A156\+0\.11 Acc\(66\.2%\)\\cellcolor\[HTML\]2A924A\+1\.20 Acc\(71\.9%\)\\cellcolor\[HTML\]006027\+0\.43 Acc\(91\.2%\)\\cellcolor\[HTML\]1A843F\-0\.32 Acc\(78\.0%\)\\cellcolor\[HTML\]006529\+0\.13 Acc\(89\.6%\)\\cellcolor\[HTML\]17813D\-0\.61 Acc\(79\.0%\)\\cellcolor\[HTML\]0C7735\+0\.21 Acc\(83\.1%\)\\cellcolor\[HTML\]097532\+0\.05 Acc\(84\.0%\)NLU\_MNLI\\cellcolor\[HTML\]A9DCA3\\cellcolor\[HTML\]88CE87\\cellcolor\[HTML\]62BB6D\\cellcolor\[HTML\]90D18D\\cellcolor\[HTML\]83CB82\\cellcolor\[HTML\]43AC5E\-0\.23 Acc\(62\.0%\)\\cellcolor\[HTML\]B1E0AB\\cellcolor\[HTML\]90D18DNLU\_QNLI\\cellcolor\[HTML\]A8DCA2\\cellcolor\[HTML\]D6EFD0\\cellcolor\[HTML\]349D53\-0\.91 Acc\(67\.6%\)\\cellcolor\[HTML\]248C46\+1\.35 Acc\(74\.3%\)\\cellcolor\[HTML\]C0E6B9\\cellcolor\[HTML\]B7E2B1\\cellcolor\[HTML\]349D53\+0\.01 Acc\(67\.7%\)\\cellcolor\[HTML\]70C274NLU\_SST\-2\\cellcolor\[HTML\]AADDA4\\cellcolor\[HTML\]99D595\\cellcolor\[HTML\]005C25\+0\.09 Acc\(92\.5%\)\\cellcolor\[HTML\]3DA65A\+0\.05 Acc\(64\.3%\)\\cellcolor\[HTML\]4EB264\\cellcolor\[HTML\]258D47\-0\.22 Acc\(74\.0%\)\\cellcolor\[HTML\]3CA559\+0\.34 Acc\(64\.7%\)\\cellcolor\[HTML\]8BCF89NLU\_SST\-5\\cellcolor\[HTML\]2E964D\+1\.95 Acc\(70\.4%\)\\cellcolor\[HTML\]6BC072\\cellcolor\[HTML\]53B466\\cellcolor\[HTML\]4EB264\\cellcolor\[HTML\]56B567\\cellcolor\[HTML\]2E964D\+1\.51 Acc\(70\.4%\)\\cellcolor\[HTML\]46AE60\+0\.47 Acc\(61\.1%\)\\cellcolor\[HTML\]60BA6CNLU\_Subj\\cellcolor\[HTML\]359E53\+1\.25 Acc\(67\.3%\)\\cellcolor\[HTML\]C0E6B9\\cellcolor\[HTML\]E9F7E5\\cellcolor\[HTML\]1E8741\+4\.02 Acc\(76\.6%\)\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]E7F6E2\\cellcolor\[HTML\]E7F6E3\\cellcolor\[HTML\]D4EECE
Table 10:Performance gains \(BLEU/Accuracy\) and token savings \(%\) evaluated using the Gating \(dev\-set calibrated\) configuration\. Cell colors represent the percentage of tokens saved\.DatasetQwen 3BLlama 3BCommandMistralAyaLlama 8BQwen 32BLlama 70BEMEA \(de→\\rightarrowen\)\\cellcolor\[HTML\]DEF2D9\\cellcolor\[HTML\]CAEAC3\\cellcolor\[HTML\]E1F3DC\\cellcolor\[HTML\]EAF7E6\\cellcolor\[HTML\]CEECC8\\cellcolor\[HTML\]A8DCA2\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]E4F5DFEMEA \(en→\\rightarrowde\)\\cellcolor\[HTML\]C1E6BA\\cellcolor\[HTML\]CDECC7\\cellcolor\[HTML\]EFF9EB\\cellcolor\[HTML\]58B668\\cellcolor\[HTML\]A8DCA2\\cellcolor\[HTML\]CAEAC3\\cellcolor\[HTML\]D9F0D3\\cellcolor\[HTML\]EFF9ECEMEA \(en→\\rightarrowes\)\\cellcolor\[HTML\]C6E8BF\\cellcolor\[HTML\]86CC85\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]D1EDCB\\cellcolor\[HTML\]79C67A\\cellcolor\[HTML\]9FD899\\cellcolor\[HTML\]CCEBC6\\cellcolor\[HTML\]D9F0D3EMEA \(es→\\rightarrowen\)\\cellcolor\[HTML\]AFDFA8\\cellcolor\[HTML\]B2E0AC\\cellcolor\[HTML\]9FD899\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]AFDFA8\\cellcolor\[HTML\]C2E7BB\\cellcolor\[HTML\]D1EDCBEMEA \(en→\\rightarrowpt\)\\cellcolor\[HTML\]CAEAC3\\cellcolor\[HTML\]D6EFD0\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]309950\+1\.85 BLEU\(69\.1%\)\\cellcolor\[HTML\]2B934B\+0\.64 BLEU\(71\.5%\)\\cellcolor\[HTML\]6DC072\\cellcolor\[HTML\]C1E6BA\\cellcolor\[HTML\]B0DFAAEMEA \(pt→\\rightarrowen\)\\cellcolor\[HTML\]EFF9EC\\cellcolor\[HTML\]E7F6E2\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]FFF4EF\\cellcolor\[HTML\]FFF4EE\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]CDECC7\\cellcolor\[HTML\]6EC173JRC\-Acquis \(de→\\rightarrowen\)\\cellcolor\[HTML\]E6F5E1\\cellcolor\[HTML\]CBEBC5\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]A5DB9F\\cellcolor\[HTML\]F2FAF0\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]F7FCF5\\cellcolor\[HTML\]E8F6E3JRC\-Acquis \(en→\\rightarrowde\)\\cellcolor\[HTML\]D0EDCA\\cellcolor\[HTML\]CDECC7\\cellcolor\[HTML\]A9DCA3\\cellcolor\[HTML\]F0F9ED\\cellcolor\[HTML\]D4EECE\\cellcolor\[HTML\]B5E1AE\\cellcolor\[HTML\]CEECC8\\cellcolor\[HTML\]E7F6E2JRC\-Acquis \(en→\\rightarrowes\)\\cellcolor\[HTML\]ACDEA6\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]EBF7E7\\cellcolor\[HTML\]D8F0D2\\cellcolor\[HTML\]F4FBF2\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]D1EDCB\\cellcolor\[HTML\]CFECC9JRC\-Acquis \(es→\\rightarrowen\)\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]DBF1D6\\cellcolor\[HTML\]C1E6BA\\cellcolor\[HTML\]FFECE4\\cellcolor\[HTML\]EAF7E6\\cellcolor\[HTML\]A4DA9E\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]E8F6E4JRC\-Acquis \(en→\\rightarrowpt\)\\cellcolor\[HTML\]DBF1D5\\cellcolor\[HTML\]FFF2EB\\cellcolor\[HTML\]E0F3DB\\cellcolor\[HTML\]FEE8DD\\cellcolor\[HTML\]FFEDE5\\cellcolor\[HTML\]F5FBF2\\cellcolor\[HTML\]EEF8EA\\cellcolor\[HTML\]ECF8E8JRC\-Acquis \(pt→\\rightarrowen\)\\cellcolor\[HTML\]FFECE3\\cellcolor\[HTML\]FFECE3\\cellcolor\[HTML\]F4FBF1\\cellcolor\[HTML\]FFEDE5\\cellcolor\[HTML\]FFECE3\\cellcolor\[HTML\]FFECE3\\cellcolor\[HTML\]F2FAF0\\cellcolor\[HTML\]F1FAEEKDE \(de→\\rightarrowen\)\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]60BA6C\\cellcolor\[HTML\]C7E9C0\\cellcolor\[HTML\]FFF4EE\\cellcolor\[HTML\]EDF8EA\\cellcolor\[HTML\]CCEBC6\\cellcolor\[HTML\]DFF3DA\\cellcolor\[HTML\]D8F0D2KDE \(en→\\rightarrowde\)\\cellcolor\[HTML\]EBF7E7\\cellcolor\[HTML\]F5FBF2\\cellcolor\[HTML\]C6E8BF\\cellcolor\[HTML\]FFEFE8\\cellcolor\[HTML\]FFF5F0\\cellcolor\[HTML\]F1FAEE\\cellcolor\[HTML\]ECF8E8\\cellcolor\[HTML\]C3E7BCKDE \(en→\\rightarrowes\)\\cellcolor\[HTML\]C6E8BF\\cellcolor\[HTML\]DDF2D8\\cellcolor\[HTML\]E3F4DE\\cellcolor\[HTML\]FFF2EC\\cellcolor\[HTML\]DDF2D8\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]E5F5E1\\cellcolor\[HTML\]F0F9EDKDE \(es→\\rightarrowen\)\\cellcolor\[HTML\]CFECC9\\cellcolor\[HTML\]E0F3DB\\cellcolor\[HTML\]BCE4B5\\cellcolor\[HTML\]FFF5F0\\cellcolor\[HTML\]E1F3DC\\cellcolor\[HTML\]C3E7BC\\cellcolor\[HTML\]E2F4DD\\cellcolor\[HTML\]E1F3DCKDE \(en→\\rightarrowpt\)\\cellcolor\[HTML\]D4EECE\\cellcolor\[HTML\]C9EAC2\\cellcolor\[HTML\]EBF7E7\\cellcolor\[HTML\]FFEEE6\\cellcolor\[HTML\]F5FBF2\\cellcolor\[HTML\]F2FAF0\\cellcolor\[HTML\]EEF8EA\\cellcolor\[HTML\]B1E0ABKDE \(pt→\\rightarrowen\)\\cellcolor\[HTML\]C1E6BA\\cellcolor\[HTML\]EDF8E9\\cellcolor\[HTML\]9BD696\\cellcolor\[HTML\]F2FAF0\\cellcolor\[HTML\]EFF9EB\\cellcolor\[HTML\]C0E6B9\\cellcolor\[HTML\]DAF0D4\\cellcolor\[HTML\]D3EECDNLU\_AgNews\\cellcolor\[HTML\]087432\+0\.35 Acc\(84\.7%\)\\cellcolor\[HTML\]55B567\\cellcolor\[HTML\]78C679\\cellcolor\[HTML\]1D8640\+0\.00 Acc\(77\.0%\)\\cellcolor\[HTML\]359E53\+0\.00 Acc\(67\.6%\)\\cellcolor\[HTML\]86CC85\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]FFF0E8NLU\_CR\\cellcolor\[HTML\]05712F\+0\.90 Acc\(85\.9%\)\\cellcolor\[HTML\]1F8742\+2\.10 Acc\(76\.5%\)\\cellcolor\[HTML\]005A24\+1\.25 Acc\(93\.2%\)\\cellcolor\[HTML\]127C39\+0\.85 Acc\(81\.0%\)\\cellcolor\[HTML\]005522\+0\.80 Acc\(94\.9%\)\\cellcolor\[HTML\]006027\+0\.55 Acc\(91\.1%\)\\cellcolor\[HTML\]005924\+0\.85 Acc\(93\.6%\)\\cellcolor\[HTML\]067230\+0\.60 Acc\(85\.5%\)NLU\_MNLI\\cellcolor\[HTML\]A2D99C\\cellcolor\[HTML\]A0D99B\\cellcolor\[HTML\]DDF2D8\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]84CC83\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]FFEFE8NLU\_QNLI\\cellcolor\[HTML\]E9F7E5\\cellcolor\[HTML\]FFF5F0\\cellcolor\[HTML\]F4FBF2\\cellcolor\[HTML\]006D2C\+1\.40 Acc\(87\.2%\)\\cellcolor\[HTML\]D5EFCF\\cellcolor\[HTML\]EBF7E7\\cellcolor\[HTML\]077331\+0\.05 Acc\(84\.8%\)\\cellcolor\[HTML\]42AB5D\+1\.00 Acc\(62\.2%\)NLU\_SST\-2\\cellcolor\[HTML\]66BD6F\\cellcolor\[HTML\]88CE87\\cellcolor\[HTML\]005F26\+0\.60 Acc\(91\.5%\)\\cellcolor\[HTML\]65BD6F\\cellcolor\[HTML\]1C8540\+0\.10 Acc\(77\.4%\)\\cellcolor\[HTML\]2F974E\+0\.45 Acc\(70\.1%\)\\cellcolor\[HTML\]005522\+0\.80 Acc\(94\.8%\)\\cellcolor\[HTML\]72C375NLU\_SST\-5\\cellcolor\[HTML\]3BA458\+3\.10 Acc\(65\.2%\)\\cellcolor\[HTML\]98D594\\cellcolor\[HTML\]70C274\\cellcolor\[HTML\]4DB163\\cellcolor\[HTML\]46AE60\+2\.55 Acc\(61\.0%\)\\cellcolor\[HTML\]29914A\+2\.25 Acc\(72\.5%\)\\cellcolor\[HTML\]17813D\+1\.50 Acc\(78\.9%\)\\cellcolor\[HTML\]39A257\+1\.80 Acc\(65\.8%\)NLU\_Subj\\cellcolor\[HTML\]3AA357\+2\.00 Acc\(65\.3%\)\\cellcolor\[HTML\]FFEFE8\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]137D39\+4\.55 Acc\(80\.5%\)\\cellcolor\[HTML\]E5F5E1\\cellcolor\[HTML\]FFF0E8\\cellcolor\[HTML\]FFEFE8\\cellcolor\[HTML\]FFEEE7
Table 11:Performance gains \(BLEU/Accuracy\) and token savings \(%\) evaluated using the Gating \(test\-set calibrated\) configuration\. Cell colors represent the percentage of tokens saved\.Similar Articles
Active Learners as Efficient PRP Rerankers
This paper reframes pairwise ranking prompting as active learning from noisy comparisons, introducing a noise-robust framework with a randomized-direction oracle to improve ranking quality under call constraints and address position bias.
Active Learners as Efficient PRP Rerankers
Proposes reframing Pairwise Ranking Prompting (PRP) reranking as active learning from noisy pairwise comparisons, improving NDCG@10 per call under budget constraints, and introduces a randomized-direction oracle that reduces LLM calls per pair.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
This paper addresses the challenge of robust checkpoint selection for multimodal LLMs under evaluation uncertainty, proposing a multi-stage framework that integrates curated real-world data, LLM-based judgment, and ranking protocols with confidence estimation.