Active Learners as Efficient PRP Rerankers
Summary
Proposes reframing Pairwise Ranking Prompting (PRP) reranking as active learning from noisy pairwise comparisons, improving NDCG@10 per call under budget constraints, and introduces a randomized-direction oracle that reduces LLM calls per pair.
View Cached Full Text
Cached at: 05/15/26, 06:28 AM
# Active Learners as Efficient PRP Rerankers
Source: [https://arxiv.org/html/2605.14236](https://arxiv.org/html/2605.14236)
Jeremías Figueiredo Paschmann, Juan Kaplan, Francisco Nattero, Santiago Mauricio Barron Bucolo, Juan Wisznia, Luciano del Corro \{jfigueiredopaschmann,jkaplan,fnattero,sbarronbucolo,jwisznia,delcorrol\}@udesa\.edu\.ar ELIAS Lab, Departamento de Ingeniería, Universidad de San Andrés
###### Abstract
Pairwise Ranking Prompting \(PRP\) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms\. However, judgments are noisy, order\-sensitive, and sometimes intransitive, so sorting assumptions don’t match the setting\. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top\-KK\. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show active rankers are drop\-in replacements that improve NDCG@10 per call in the call\-constrained regime\. Our noise\-robust framework also introduces a randomized\-direction oracle that uses a single LLM call per pair\. This approach converts systematic position bias into zero\-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls\.111Code available at[https://github\.com/jerecoder/IReranker](https://github.com/jerecoder/IReranker)
Active Learners as Efficient PRP Rerankers
## 1Introduction
LLMs are increasingly used for reranking in Retrieval\-Augmented Generation \(RAG\): given a query and a candidate list, rerankers aggregate LLM pairwise preferences into an ordered top\-KKsubset that strongly affects downstream answer quality\(Zhou et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib19); Zhu et al\.,[2023](https://arxiv.org/html/2605.14236#bib.bib20); Dong et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib3); Sun et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib13)\)\. Major cloud providers now offer reranking as managed services, making call efficiency a first\-class concern: LLM invocations dominate cost and latency, and the goal is a sorted prefix, not the full list\.
Commonly, PRP is paired with classical sorting algorithms\(Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9); Sun et al\.,[2023](https://arxiv.org/html/2605.14236#bib.bib14)\): PRP supplies noisy preference judgments while sorting determines which pairs to query\. This is structurally mismatched: sorting assumes transitive comparisons, while LLM judgments are stochastic and can violate transitivity\. Sorting thus wastes budget polishing an unstable permutation rather than improving the top\-KK\.
LLM order effects, where swapping document presentation order can flip the judge’s choice between documents, further complicate matters\(Shi et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib12); Yin et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib18); Jeong et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib6)\)\. Standard PRP queries both prompt directions at 2 calls per pair\(Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9); Wu et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib17)\), yet preference cycles persist\.
We therefore frame PRP reranking as active learning from noisy pairwise comparisons, choosing adaptively which pairs to query to maximize top\-KKquality within a budget\. This connects to the literature on best\-KKidentification under stochastic feedback\(Mohajer et al\.,[2017](https://arxiv.org/html/2605.14236#bib.bib8); Heckel et al\.,[2016](https://arxiv.org/html/2605.14236#bib.bib4); Shah and Wainwright,[2018](https://arxiv.org/html/2605.14236#bib.bib11); Ren et al\.,[2020](https://arxiv.org/html/2605.14236#bib.bib10); Luo et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib7)\)\. We also evaluate a cheaper oracle: randomizing the prompt direction yields a one\-call estimate that converts position bias into zero\-mean noise\.
We study two questions:\(Q1\)*Does active ranking outperform state\-of\-the\-art PRP rerankers at fixed budget \(NDCG@10\)?*\(Q2\)*Does randomized\-direction prompting improve the NDCG@10–cost trade\-off beyond scheduling alone?*The best\-performing active scheduler in our experiments is the algorithm ofMohajer et al\. \([2017](https://arxiv.org/html/2605.14236#bib.bib8)\), which we callMohajer: it adaptively selects which pairs to query, concentrating comparisons near the top\-KKboundary\.Q1: On TREC DL2019/2020 with Flan\-T5\-XL, Mohajer outperforms the best sorting baseline by\+\+9\.7 NDCG@10 atB=300B\{=\}300calls \(66\.1 vs\. 56\.4\), under the same bidirectional oracle, with the advantage holding across the entire call\-constrained regime \(B=200B\{=\}200–450450\)\.Q2: Randomized\-direction prompting improves both strategies, but in different ways\. For PRP rerankers, it raises quality at fixed budget: BubbleSort gains\+\+5\.5 NDCG@10 atB=300B\{=\}300\(56\.4→\\to62\.0\) simply by halving the call cost per pair and covering more comparisons\. For active rankers, the effect is more pronounced: comparing Mohajer under both oracles, the randomized\-direction oracle raises the quality ceiling from 66\.96 to 68\.0 while reducing the calls needed to reach it fromB=450B\{=\}450toB=250B\{=\}250, a 44% reduction\. Across BEIR\-style tasks, active rankers reach NDCG@10 comparable to QuickSort \(Avg\. 56\.8 for Flan\-T5\-XL\) with up to 7×\\timesfewer calls\.
## 2Related Work
Pairwise LLM reranking\.PRP elicits pairwise preferences from an LLM and aggregates them into a ranking\(Sun et al\.,[2023](https://arxiv.org/html/2605.14236#bib.bib14); Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9)\), typically via sorting algorithms that assume transitivity and target an unbudgeted complete order\.
Order effects\.LLM comparisons are direction\-sensitive\(Shi et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib12); Yin et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib18); Jeong et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib6)\), so PRP often queries both prompt orders, doubling cost\(Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9); Wu et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib17)\)\. Our randomized\-direction oracle makes one call per pair, producing aggregate outcomes robust to direction bias\.
PRP beyond sorting\.PRP\-Graph uses adaptive pairings\(Luo et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib7)\)and tournament designs structure comparisons\(Chen et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib2)\)\. We reframe reranking as active learning from noisy feedback and evaluate active top\-KKidentifiers as drop\-in replacements for sorting\(Heckel et al\.,[2016](https://arxiv.org/html/2605.14236#bib.bib4); Shah and Wainwright,[2018](https://arxiv.org/html/2605.14236#bib.bib11); Ren et al\.,[2020](https://arxiv.org/html/2605.14236#bib.bib10); Mohajer et al\.,[2017](https://arxiv.org/html/2605.14236#bib.bib8)\), using adaptive pairings in the spirit of PRP\-Graph but grounded in noise\-tolerant active ranking theory\.
Complementary paradigms\.Setwise and listwise methods\(Zhuang et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib21); Huang et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib5); Wang et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib15)\)reduce cost by processing multiple documents per call, changing the prompting primitive itself; pairwise and listwise calls differ in token cost, context length, and bias, making raw call counts incommensurable across paradigms\. Our goal is to improve scheduling*within*pairwise PRP, which remains widely deployed for its fine\-grained signal and constrained\-output reliability\(Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9)\); the two directions are complementary\.
Table 1:Average NDCG@10 \(%\) on TREC DL 2019 and DL 2020 with Flan\-T5\-XL across LLM call budgets\. Bold = best per column; underline = second\-best \(within each oracle block\)\.†indicates the smallest budget at which a method completes; results at larger budgets are de\-emphasized\.Randomizedrows report mean±\\pm95% bootstrap CI half\-width over 8 oracle seeds \(10k resamples\);Bidirectionalis deterministic given outcomes so CIs are omitted\.
## 3Reranking from Noisy Comparisons
Given a queryqq, a first\-stage retriever returnsNNcandidates𝒟\(q\)=\{d1,…,dN\}\\mathcal\{D\}\(q\)=\\\{d\_\{1\},\\dots,d\_\{N\}\\\}\(N≥KN\\geq K\)\. The reranker outputs an ordered top\-KKlistℛK\(q\)=\(r1,…,rK\)\\mathcal\{R\}\_\{K\}\(q\)=\(r\_\{1\},\\dots,r\_\{K\}\)withrℓ∈𝒟\(q\)r\_\{\\ell\}\\in\\mathcal\{D\}\(q\)\.
Pairwise oracle interface\.Algorithms interact with candidates only via noisy pairwise outcomes: for each unordered pair\{i,j\}\\\{i,j\\\}, a call returnsXij\(q\)∈\{0,1\}X\_\{ij\}\(q\)\\in\\\{0,1\\\}, whereXij\(q\)=1X\_\{ij\}\(q\)=1meansdid\_\{i\}is preferred \(judged more relevant toqq\) overdjd\_\{j\}, i\.e\.di≻djd\_\{i\}\\succ d\_\{j\}, with win probabilitypij\(q\):=Pr\[Xij\(q\)=1\]p\_\{ij\}\(q\):=\\Pr\[X\_\{ij\}\(q\)=1\]\. We assume only*pair\-consistency*,pij\(q\)=1−pji\(q\)p\_\{ij\}\(q\)=1\-p\_\{ji\}\(q\)fori≠ji\\neq j\(this is enforced via oracle design\)\.
Call\-centric cost\.We count LLM inference calls: bidirectional uses two per pair, randomized\-direction uses one\. Since calls dominate PRP cost\(Wisznia et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib16)\), this changes which routines are optimal\.
### Oracles
LetLLM\(da,db\)∈\{1,0\}\\text\{LLM\}\(d\_\{a\},d\_\{b\}\)\\in\\\{1,0\\\}denote the outcome of one call, where11means the first document is preferred\.
Bidirectional \(two calls\)\.The standard PRP oracle\(Qin et al\.,[2024](https://arxiv.org/html/2605.14236#bib.bib9)\):Vij=1V\_\{ij\}=1iffLLM\(di,dj\)=1∧LLM\(dj,di\)=0\\text\{LLM\}\(d\_\{i\},d\_\{j\}\)=1\\wedge\\text\{LLM\}\(d\_\{j\},d\_\{i\}\)=0, elseVij=0V\_\{ij\}=0\.
Randomized\-direction \(one call\)\.We randomize input order:Vij=LLM\(di,dj\)V\_\{ij\}=\\text\{LLM\}\(d\_\{i\},d\_\{j\}\)with probability1/21/2, elseVij=1−LLM\(dj,di\)V\_\{ij\}=1\-\\text\{LLM\}\(d\_\{j\},d\_\{i\}\)\. This ensures reciprocity in expectation, i\.e\.Pr\[Vij=1\]=1−Pr\[Vji=1\]\\Pr\[V\_\{ij\}\{=\}1\]=1\-\\Pr\[V\_\{ji\}\{=\}1\]: each individual call may be position\-biased, but averaging over the random direction converts systematic bias into zero\-mean noise, preserving pair\-consistency \(proof in Appendix[E](https://arxiv.org/html/2605.14236#A5)\)\.
## 4Selecting Active Rankers for Call\-Budgeted Top\-KKReranking
Sorting treats every comparison as equally informative\. Under a budget, this uniformity is wasteful\. Active rankers concentrate comparisons on candidates whose relative order remains uncertain\. This is the key mechanism behind our gains: a better schedule for the same comparator, requiring only lightweight bookkeeping with no model training or forward passes\. The dominant cost remains the LLM call itself\.
Our goal is high\-quality top\-KKprefixes under a strict call budgetBBvia the pairwise\-oracle interface of §[3](https://arxiv.org/html/2605.14236#S3)\. We select algorithms with three criteria:\(C1\)*Top\-KKobjective*: targets best\-KK/prefix identification;\(C2\)*Noise tolerance*: well\-defined under pair\-consistency without assuming a global order;\(C3\)*Anytime behavior*: outputs a competitive top\-KKprefix as comparisons accrue\.
We focus on comparison scheduling gains and benchmark two complementary active rankers: tournament\-based vs\. anchor\-based\. Methods assuming transitivity or targeting a full global ranking are omitted\.
Tournament/heap extraction\.Mohajer et al\. \([2017](https://arxiv.org/html/2605.14236#bib.bib8)\)identifies best\-KKvia tournaments with heap extraction, focusing comparisons on likely contenders \(C1–C3\)\. We use one oracle call per match\.
Anchor\-based Probably Approximately Correct \(PAC\) best\-KK\.Agarwal et al\. \([2022](https://arxiv.org/html/2605.14236#bib.bib1)\)identifies best\-KKvia anchors and winner sets \(C1, C3\)\. We take anchors from a zero\-cost BM25 prior and restrict comparisons to the topK×mK\{\\times\}m\(m=3m\{=\}3\) BM25 prefix, keeping calls low\.
Ordered outputs\.PAC returns an unordered best\-KKset, so we apply BubbleSort on the final top\-KK\. Mohajer outputs an ordered prefix; BubbleSort polishing is optional and negligible\. Any added comparisons count toward the budget\.
RankerCovidRobust04ToucheSciFactDBPediaDL19DL20Avg\. NDCG@10Avg\. Calls/TaskBM2559\.540\.744\.267\.931\.950\.648\.049\.0\-Flan\-T5\-LBubbleSort@10 \(Bidirectional\)70\.944\.244\.769\.241\.763\.458\.656\.1679HeapSort \(Bidirectional\)76\.040\.433\.267\.541\.465\.062\.655\.21230QuickSort \(Bidirectional\)76\.241\.027\.460\.141\.164\.558\.552\.71954PAC\+Bubble \(Bidirectional\)69\.344\.041\.468\.539\.261\.757\.254\.5323PAC\+Bubble \(Randomized\)70\.241\.038\.267\.038\.160\.057\.353\.1184Mohajer\+Bubble \(Bidirectional\)76\.537\.826\.754\.940\.063\.056\.450\.7423Mohajer\+Bubble \(Randomized\)76\.937\.825\.758\.840\.061\.858\.851\.4354Mohajer \(Bidirectional\)76\.537\.526\.453\.839\.762\.656\.150\.4399Mohajer \(Randomized\)76\.236\.224\.457\.539\.160\.657\.250\.2232Flan\-T5\-XLBubbleSort@10 \(Bidirectional\)74\.855\.442\.871\.343\.168\.467\.060\.4941HeapSort \(Bidirectional\)78\.254\.928\.470\.641\.670\.668\.959\.01409QuickSort \(Bidirectional\)77\.253\.725\.861\.441\.670\.467\.256\.81669PAC\+Bubble \(Bidirectional\)71\.348\.941\.170\.439\.162\.658\.656\.0332PAC\+Bubble \(Randomized\)71\.348\.138\.868\.638\.861\.158\.555\.0184Mohajer\+Bubble \(Bidirectional\)76\.053\.725\.761\.940\.966\.667\.556\.0427Mohajer\+Bubble \(Randomized\)78\.554\.027\.963\.541\.269\.566\.357\.3345Mohajer \(Bidirectional\)76\.053\.625\.461\.240\.766\.667\.355\.8399Mohajer \(Randomized\)77\.653\.227\.262\.840\.468\.767\.656\.8232
Table 2:End\-to\-end BEIR\-style NDCG@10 \(%\) and average pairwise LLM calls\. Bold = best per column; underline = second\-best per column \(within each model block\)\.
## 5Results
### Setup\.
We rerank the topN=100N\{=\}100BM25 candidates into an ordered top\-KKlist \(K=10K\{=\}10\) and report NDCG@10 on BEIR\-style tasks \(Table[2](https://arxiv.org/html/2605.14236#S4.T2)\) and TREC DL2019/2020, capping each method atB∈\{100,150,…,500\}B\\in\\\{100,150,\\dots,500\\\}LLM calls\. The pairwise oracle uses Flan\-T5\-L/XL under \(i\) bidirectional and \(ii\) randomized\-direction prompting\. BubbleSort uses caching\(Wisznia et al\.,[2025](https://arxiv.org/html/2605.14236#bib.bib16)\)\. Additional Qwen results and code are in the appendix/repository\.
### Main findings\.
Table[1](https://arxiv.org/html/2605.14236#S2.T1)reports NDCG@10 on TREC DL2019/2020 \(Flan\-T5\-XL\) vs\. budgetBB\(CIs and bootstrap tests in Appendix[D](https://arxiv.org/html/2605.14236#A4)\)\.*\(i\)*In the call\-constrained regime \(B≈200B\\approx 200–450450\), Mohajer outperforms PRP rerankers under the same oracle\.*\(ii\)*Randomized\-direction compresses “time\-to\-quality”: Mohajer reaches peak quality byB=250B\{=\}250\.*\(iii\)*At high budgets, sorting catches up as global refinement pays off\. PAC lags because its two\-phase design splits budget across objectives; Mohajer’s tournament concentrates comparisons on likely top candidates\. PAC benefits when the BM25 prior is strong \(e\.g\., Touché\)\.
### Bidirectional oracle
Low budgets: sorting is preferable\.AtB∈\{100,150\}B\\in\\\{100,150\\\}, QuickSort reaches≈55\.9\\approx 55\.9NDCG@10 while Mohajer is in warm\-up \(30\.130\.1\)\. Below the warm\-up threshold \(∼\\sim100 calls forN=100N\{=\}100,K=10K\{=\}10\) sorting is preferable; above it, active ranking dominates\.
Call\-constrained regime: active reranking is better\.Mohajer leads fromB=200B\{=\}200toB=450B\{=\}450: atB=300B\{=\}300, 66\.09 vs\. 56\.42 \(\+9\.67\); atB=350B\{=\}350, 66\.28 vs\. 56\.98 \(\+9\.30\); atB=450B\{=\}450, Mohajer\+Bubble 67\.02 vs\. HeapSort 62\.81 \(\+4\.21\)\. Paired bootstrap tests \(10k query resamples,p<0\.05p\{<\}0\.05; Table[A\.8](https://arxiv.org/html/2605.14236#A4.T8)in the appendix\) confirm these gains are significant: under the randomized oracle, Mohajer\+Bubble significantly outperforms BubbleSort at every budget; under bidirectional, fromB=200B\{=\}200onward\.
High budgets: sorting can catch up\.AtB=500B\{=\}500, HeapSort \(68\.21\) narrowly exceeds Mohajer\+Bubble \(67\.02\) as global refinement pays off\.
### Randomized oracle: faster “time\-to\-quality” for active ranking
Using one call per pair, randomized\-direction covers∼\\sim2×2\\timesas many pairs as bidirectional at the same budget\. Mohajer is already strong atB=100B\{=\}100\(61\.4\) and converges at 68\.0 byB=250B\{=\}250, suggesting that broader pair coverage may outweigh spending two calls per pair to reduce order effects\. HeapSort surpasses Mohajer atB=300B\{=\}300\(68\.50 vs\. 68\.00\), reaching 68\.71 atB=500B\{=\}500; sorting remains preferable once budgets are large enough for global refinement to dominate\.
### End\-to\-end efficiency in the full pipeline
Table[2](https://arxiv.org/html/2605.14236#S4.T2)reports end\-to\-end results \(N=100N\{=\}100,K=10K\{=\}10\)\. For Flan\-T5\-XL, PRP baselines use 941–1669 calls/task \(Avg\. 56\.8–60\.4 NDCG@10\), while Mohajer and PAC use 184–345 calls \(Avg\. 55\.0–57\.3\), a∼\\sim3–5×\\timesreduction\. Randomized\-direction further cuts calls: Mohajer drops from 399 to 232 per task, making adaptive scheduling with randomized oracle a strong low\-cost design\. Touché is an outlier where BM25 is strong and neural reranking headroom is limited\.
### Order effects and comparisons\.
Bidirectional prompting reverses the preferred document on 20\.6% of pairs, confirming substantial order effects\. Mohajer\+Bubble achieves competitive or better NDCG@10 than PRP\-Graph with fewer comparisons as model size increases\. A top\-kksweep shows the crossover where global refinement becomes preferable moves earlier askkincreases\. Details in the appendix\.
### Latency
Latency is estimated as a sequential upper bound: inference time per comparison times total comparisons, ignoring parallelism \(see Appendix[A](https://arxiv.org/html/2605.14236#A1)\)\. Figure[1](https://arxiv.org/html/2605.14236#S5.F1)shows active ranking reaching strong quality earlier \(Mohajer at 23\.3s, PAC at 10\.1s\), with sorting overtaking only at long runtimes\. Both active rankers support within\-query parallelism \(independent tournaments / anchor comparisons\), potentially reducing wall\-clock time by an order of magnitude\. Qwen and H100/H200 results in the appendix\.
Figure 1:TREC DL 2019/2020 \(Flan\-T5\-XL, A100\): NDCG@10 vs\. avg\. time per task \(randomized oracle\)\. X marks the point at which a method has completed all its scheduled comparisons \(convergence\)\.
## 6Conclusion
We argue that PRP reranking is better modeled as budgeted learning from noisy pairwise comparisons than deterministic sorting\. Active rankers yield higher NDCG@10 at low budgets, while sorting helps mainly once large budgets make global refinement affordable\. Randomized\-direction prompting further improves efficiency: at the same budget, it covers roughly twice as many pairs as bidirectional\. For practitioners deploying PRP in RAG pipelines, our results suggest a simple recipe: use Mohajer with the randomized\-direction oracle when the call budget exceeds the warm\-up threshold \(∼\\simK×KK\{\\times\}Kcalls\), and fall back to sorting when budgets are either very small or large enough for global refinement\.
## Limitations
Our study focuses on settings where a reliable pairwise LLM comparator can be elicited with constrained outputs\. Results may vary with prompt design, model family, and decoding settings\. Our cost metric counts LLM calls but omits system\-level overheads \(batching, network latency\); latency measurements are not fully end\-to\-end\. Parallel execution was not implemented, though both algorithms naturally support it \(Appendix[A](https://arxiv.org/html/2605.14236#A1)\)\.
The NDCG@10 gains from randomized\-direction oracles are empirically consistent but not theoretically explained; the hypothesis that independent single\-direction samples benefit adaptive algorithms more than correlated bidirectional samples is plausible but unproven\.
Our PAC best\-KKmethod \(PAC\+Bubble\) introduces a candidate pool multiplier hyperparametermm\(defaultm=3m=3\), which controls the trade\-off between comparison cost and coverage of the prior ranking\. We did not perform a systematic ablation overmmdue to computational constraints, and the optimal value likely depends on prior quality and dataset characteristics\. Whilem=3m=3yields strong empirical results in our experiments, future work should investigate data\-driven or adaptive selection strategies for this parameter\.
Finally, active ranking theory often assumes conditional independence of oracle outputs; real LLM APIs can violate this through hidden state, caching, or nonstationarity\.
## References
- Agarwal et al\. \(2022\)Arpit Agarwal, Sanjeev Khanna, and Prathamesh Patil\. 2022\.[Pac top\-kkidentification under sst in limited rounds](https://proceedings.mlr.press/v151/agarwal22a.html)\.In*Proceedings of The 25th International Conference on Artificial Intelligence and Statistics*, volume 151 of*Proceedings of Machine Learning Research*, pages 6814–6839\. PMLR\.
- Chen et al\. \(2024\)Yiqun Chen, Qi Liu, Yi Zhang, Weiwei Sun, Daiting Shi, Jiaxin Mao, and Dawei Yin\. 2024\.[Tourrank: Utilizing large language models for documents ranking with a tournament\-inspired strategy](https://doi.org/10.48550/arXiv.2406.11678)\.*CoRR*, abs/2406\.11678\.
- Dong et al\. \(2024\)Jialin Dong, Bahare Fatemi, Bryan Perozzi, Lin F\. Yang, and Anton Tsitsulin\. 2024\.[Don’t forget to connect\! improving rag with graph\-based reranking](https://doi.org/10.48550/arXiv.2405.18414)\.*arXiv preprint arXiv:2405\.18414*\.
- Heckel et al\. \(2016\)Reinhard Heckel, Nihar B\. Shah, Kannan Ramchandran, and Martin J\. Wainwright\. 2016\.[Active ranking from pairwise comparisons and when parametric assumptions don’t help](https://doi.org/10.48550/arXiv.1606.08842)\.*arXiv preprint arXiv:1606\.08842*\.
- Huang et al\. \(2025\)Jerry Huang, Siddarth Madala, Cheng Niu, Julia Hockenmaier, and Tong Zhang\. 2025\.[Contextual relevance and adaptive sampling for LLM\-based document reranking](https://doi.org/10.48550/arXiv.2511.01208)\.*CoRR*, abs/2511\.01208\.
- Jeong et al\. \(2025\)Hawon Jeong, ChaeHun Park, Jimin Hong, Hojoon Lee, and Jaegul Choo\. 2025\.[The comparative trap: Pairwise comparisons amplifies biased preferences of LLM evaluators](https://doi.org/10.18653/v1/2025.blackboxnlp-1.5)\.In*Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP*, pages 79–108, Suzhou, China\. Association for Computational Linguistics\.
- Luo et al\. \(2024\)Jian Luo, Xuanang Chen, Ben He, and Le Sun\. 2024\.[PRP\-graph: Pairwise ranking prompting to LLMs with graph aggregation for effective text re\-ranking](https://doi.org/10.18653/v1/2024.acl-long.313)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 5766–5776, Bangkok, Thailand\. Association for Computational Linguistics\.
- Mohajer et al\. \(2017\)Soheil Mohajer, Changho Suh, and Adel Elmahdy\. 2017\.[Active learning for top\-kkrank aggregation from noisy comparisons](https://proceedings.mlr.press/v70/mohajer17a.html)\.In*Proceedings of the 34th International Conference on Machine Learning*, volume 70 of*Proceedings of Machine Learning Research*, pages 2488–2497\. PMLR\.
- Qin et al\. \(2024\)Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky\. 2024\.[Large language models are effective text rankers with pairwise ranking prompting](https://doi.org/10.18653/v1/2024.findings-naacl.97)\.In*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 1504–1518, Mexico City, Mexico\. Association for Computational Linguistics\.
- Ren et al\. \(2020\)Wenbo Ren, Jia Liu, and Ness Shroff\. 2020\.[The sample complexity of best\-kkitems selection from pairwise comparisons](https://proceedings.mlr.press/v119/ren20a.html)\.In*Proceedings of the 37th International Conference on Machine Learning*, volume 119 of*Proceedings of Machine Learning Research*, pages 8051–8072\.
- Shah and Wainwright \(2018\)Nihar B\. Shah and Martin J\. Wainwright\. 2018\.[Simple, robust and optimal ranking from pairwise comparisons](https://jmlr.org/papers/v18/16-206.html)\.*Journal of Machine Learning Research*, 18\(199\):1–38\.
- Shi et al\. \(2024\)Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi\. 2024\.[Judging the judges: A systematic study of position bias in llm\-as\-a\-judge](https://doi.org/10.48550/arXiv.2406.07791)\.*arXiv preprint arXiv:2406\.07791*\.
- Sun et al\. \(2025\)Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, and Jiawei Han\. 2025\.[Dynamicrag: Leveraging outputs of large language model as feedback for dynamic reranking in retrieval\-augmented generation](https://doi.org/10.48550/arXiv.2505.07233)\.*arXiv preprint arXiv:2505\.07233*\.
- Sun et al\. \(2023\)Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren\. 2023\.[Is ChatGPT good at search? investigating large language models as re\-ranking agents](https://doi.org/10.18653/v1/2023.emnlp-main.923)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 14918–14937, Singapore\. Association for Computational Linguistics\.
- Wang et al\. \(2025\)Pinhuan Wang, Zhiqiu Xia, Chunhua Liao, Feiyi Wang, and Hang Liu\. 2025\.[REALM: Recursive relevance modeling for LLM\-based document re\-ranking](https://doi.org/10.18653/v1/2025.emnlp-main.1218)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 23875–23889, Suzhou, China\. Association for Computational Linguistics\.
- Wisznia et al\. \(2025\)Juan Wisznia, Cecilia Bolaños, Juan Tollo, Giovanni Franco Gabriel Marraffini, Agustín Andrés Gianolini, Noe Fabian Hsueh, and Luciano Del Corro\. 2025\.[Are optimal algorithms still optimal? rethinking sorting in LLM\-based pairwise ranking with batching and caching](https://doi.org/10.18653/v1/2025.acl-short.83)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 1064–1072, Vienna, Austria\. Association for Computational Linguistics\.
- Wu et al\. \(2025\)Jingyu Wu, Aditya Shrivastava, Jing Zhu, Alfy Samuel, Anoop Kumar, and Daben Liu\. 2025\.[Llm optimization unlocks real\-time pairwise reranking](https://doi.org/10.48550/arXiv.2511.07555)\.*arXiv preprint arXiv:2511\.07555*\.
- Yin et al\. \(2025\)Haonan Yin, Shai Vardi, and Vidyanand Choudhary\. 2025\.[Fragile preferences: A deep dive into order effects in large language models](https://doi.org/10.48550/arXiv.2506.14092)\.*arXiv preprint arXiv:2506\.14092*\.
- Zhou et al\. \(2025\)Yinxin Zhou, Qin Luo, Bin Feng, and Bang Wang\. 2025\.[Large language models for reranking: A survey](https://doi.org/10.36227/techrxiv.176300630.01740917/v1)\.[https://doi\.org/10\.36227/techrxiv\.176300630\.01740917/v1](https://doi.org/10.36227/techrxiv.176300630.01740917/v1)\.TechRxiv preprint\.
- Zhu et al\. \(2023\)Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji\-Rong Wen\. 2023\.[Large language models for information retrieval: A survey](https://doi.org/10.48550/arXiv.2308.07107)\.*arXiv preprint arXiv:2308\.07107*\.
- Zhuang et al\. \(2024\)Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon\. 2024\.[A setwise approach for effective and highly efficient zero\-shot ranking with large language models](https://doi.org/10.1145/3626772.3657813)\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’24\)*, pages 38–47\. ACM\.
## Appendix AParallelization Opportunities
Both our proposed active ranking algorithms exhibit substantial parallelization potential that could significantly reduce wall\-clock latency in practice\.
### Mohajer\.
The Mohajer algorithm’s structure naturally lends itself to parallel execution at all levels\. First, theKKindependent tournaments used for champion selection can be executed concurrently, as each tournament operates on a disjoint subset of candidates with no inter\-group dependencies\. Second, the heapify operation itself is amenable to parallel construction techniques, allowing the initial heap formation to proceed inO\(logn\)O\(\\log n\)depth rather than sequentialO\(n\)O\(n\)time\. Finally, within each tournament, pairwise comparisons at the same tree depth are independent and can be batched for simultaneous LLM inference\.
### Optimized PAC\.
Our optimized PAC algorithm also presents parallelization opportunities, where comparisons of candidates against the selected anchors are entirely independent\. WithK/2K/2anchors and a candidate pool of sizeK×pool\_multK\\times\\text\{pool\\\_mult\}, theseO\(K2\)O\(K^\{2\}\)comparisons can be executed in parallel, subject only to the LLM inference throughput\. The subsequent winner\-set construction and greedy accumulation require only lightweight aggregation, which remains negligible compared to comparison costs\.
### Latency implications\.
Assuming an LLM inference API with high throughput \(e\.g\., batched GPU inference or distributed serving\), parallelization could theoretically reduce Mohajer’s latency fromO\(nlogK\)O\(n\\log K\)sequential comparisons toO\(logQ⋅logK\)O\(\\log Q\\cdot\\log K\)parallel rounds, and PAC’s latency fromO\(Kn\)O\(K\\sqrt\{n\}\)comparisons toO\(n\)O\(\\sqrt\{n\}\)rounds with batch sizeKK\. In our TREC DL experiments, where Mohajer performs∼\\sim350 comparisons and PAC∼\\sim185 comparisons, parallelization with batch size 10 could reduce round counts to∼\\sim35 and∼\\sim19 respectively, yielding substantial speedups on systems where comparison latency dominates over LLM throughput\.
## Appendix BSupplementary Tables
This section provides supplementary results and ablations referenced in the main text\.
\(a\)TREC DL2019
\(b\)TREC DL2020
Table A\.1:Budgeted NDCG@10 \(%\) on TREC DL2019 and DL2020 using Flan\-T5\-XL under bidirectional vs\. randomized\-direction prompting\. Budgets denote the number of LLM calls\. Within each oracle block, bold/underline indicate best/second\-best per budget column among the listed rankers\.Table A\.2:Top\-kksweep with randomized\-direction prompting \(Flan\-T5\-XL\)\. Values are average NDCG@kk\(%\) over TREC DL2019 and DL2020\. Budgets denote the number of LLM calls\. For each\(k,budget\)\(k,\\text\{budget\}\), bold marks the better of BubbleSort vs\. Mohajer\.†indicates the first budget where the method is treated as converged; subsequent budgets are de\-emphasized\.Note\.Mohajer dominates BubbleSort whenkkand the budget are low\. Askkincreases, the crossover point where BubbleSort overtakes Mohajer arrives sooner in terms of number of LLM calls\.
Table A\.3:Comparison to PRP\-Graph on TREC DL2019/2020\. NDCG@10 is reported in percent\. “\#Inf\.” is the average number of inferred pairwise comparisons used by the method \(lower is fewer comparisons\)\. Bold/underline indicate best/second\-best NDCG@10 within each \(DL year, LLM\) block\.Note\.Our results show that with larger Flan models, Mohajer \+ Bubble has increasingly better results than PRP\-Graph with fewer comparisons\. PAC \+ Bubble falls behind in performance but has significantly fewer comparisons\.
Table A\.4:End\-to\-end NDCG@10 \(%\) and average pairwise\-comparison calls per task\. For each reranker and column,boldindicates the better oracle variant \(higher NDCG; lower calls\)\. If the two oracle variants tie after rounding, both are bolded\.Note\.Active\-learning rankers are more robust to noisy comparisons and profit most from randomized\-direction sampling\. BubbleSort requires many more comparisons as cycles emerge, but said cycles allow the latent transitive ordering to surface more than with a deterministic oracle, yielding higher NDCG@10\.
Stratified by dataset
Stratified by BM25 rank distance\|r\(i\)−r\(j\)\|\|r\(i\)\-r\(j\)\|
Table A\.5:Bidirectional flip\-rate sanity check, Flan\-T5\-XL \(order effects\)\. For each candidate pair\{di,dj\}\\\{d\_\{i\},d\_\{j\}\\\}, we query the judge twice \(once as\(di,dj\)\(d\_\{i\},d\_\{j\}\)and once as\(dj,di\)\(d\_\{j\},d\_\{i\}\)\) and report the fraction of pairs whose winner flips\. Results are shown overall and stratified by dataset and BM25 rank distance\.Table A\.6:Average NDCG@10 \(%\) on TREC DL 2019 with Qwen3\-4B\-Instruct across comparison budgets\. Bold = best per column; underline = second\-best per column \(within each oracle block\)\.†indicates the smallest budget at which a method completes; results at larger budgets are visually de\-emphasized\.Note\.Though not always the best in performance, active ranking algorithms reach competitive NDCG@10 relative to the classic algorithms, with drastic improvements in terms of average calls per task\.
Table A\.7:TREC DL2019/2020 end\-to\-end NDCG@10 \(%\) and average pairwise\-comparison calls per task usingQwen3\-4B\-Instruct\. Baselines are reported with a bidirectional oracle; our methods \(PAC\+Bubble and Mohajer variants\) include both bidirectional and randomized comparisons\. Within the Qwen block, best is bold and second\-best is underlined per column \(higher NDCG; lower calls\)\.Note\.Though not always the best in performance, active ranking algorithms reach competitive NDCG@10 relative to the classic algorithms, with drastic improvements in terms of average calls per task\.
## Appendix CSupplementary Graphs
This section presents supplementary figures for the latency experiments referenced in the main text\.



Figure 2:TREC DL 2019 and DL 2020 \(Flan\-T5\-XL\): NDCG@10 vs estimated time per task across GPUs, with both oracles shown\. Colors denote rankers; solid lines are randomized and dotted lines are bidirectional oracles\. X marks show when an algorithm has converged\.


Figure 3:TREC DL 2019 and DL 2020 \(Qwen3\-4B\-Instruct\-2507\): NDCG@10 vs estimated time per task across GPUs, with both oracles shown\. Colors denote rankers; solid lines are randomized and dotted lines are bidirectional\. X marks show when an algorithm has converged\.
## Appendix DStatistical Significance
To quantify the stability and significance of the reported results, we conduct two complementary non\-parametric bootstrap analyses\. First, we quantify seed\-resampling uncertainty: we evaluate each method under 8 different oracle seeds and report the 95% confidence interval \(CI\) half\-width of the mean NDCG@10 across seeds \(10,000 resamples\)\. These CIs are reported directly in Table[1](https://arxiv.org/html/2605.14236#S2.T1)for the randomized\-direction oracle \(the bidirectional oracle is deterministic given pairwise outcomes\)\.
Second, we report paired bootstrap significance tests over queries\. For each budget, we resample queries with replacement \(10,000 resamples\) and measure whether the mean difference in NDCG@10 between two methods is significantly different from zero \(p<0\.05p<0\.05\)\. Table[A\.8](https://arxiv.org/html/2605.14236#A4.T8)reports tests comparing Mohajer\+Bubble against BubbleSort; Table[A\.9](https://arxiv.org/html/2605.14236#A4.T9)compares Mohajer\+Bubble against HeapSort\.
Table A\.8:Paired bootstrap significance over queries for Mohajer\+Bubble vs\. BubbleSort on TREC DL19\+DL20 \(Flan\-T5\-XL\)\. Each cell shows the direction and meanΔ\\DeltaNDCG@10 difference\.↑/↓\\uparrow/\\downarrowindicates a statistically significant difference \(paired bootstrap over queries, 10,000 resamples,p<0\.05p<0\.05\);==indicates not significant\.Table A\.9:Paired bootstrap significance over queries for Mohajer\+Bubble vs\. HeapSort on TREC DL19\+DL20 \(Flan\-T5\-XL\)\. Each cell shows the direction and meanΔ\\DeltaNDCG@10 \(A−\-B\)\.↑/↓\\uparrow/\\downarrowindicates a statistically significant difference \(paired bootstrap over queries, 10,000 resamples,p<0\.05p<0\.05\);==indicates not significant\.
## Appendix EProof of Aggregate Unbiasedness for Randomized\-Direction Oracle
Here we show that, despite the individual order bias that is present in each inference call, the randomized\-direction oracle still achieves comparison results that do not depend on input order\. Namely, we show that the probability of preferring document A over document B does not depend on the order in which it is fed to the comparator, and thus fulfills the propertyPr\[Vij=1\]=1−Pr\[Vji=1\]\\Pr\[V\_\{ij\}=1\]=1\-\\Pr\[V\_\{ji\}=1\]\. LetVijV\_\{ij\}be the output of the randomized\-direction oracle\. By definition, the oracle selects the input order\(di,dj\)\(d\_\{i\},d\_\{j\}\)or\(dj,di\)\(d\_\{j\},d\_\{i\}\)with equal probabilityp=0\.5p=0\.5\. The probability of the oracle preferringdid\_\{i\}overdjd\_\{j\}is given by:
Pr\[Vij=1\]\\displaystyle\\Pr\[V\_\{ij\}=1\]=12Pr\[LLM\(di,dj\)=1\]\+12Pr\[LLM\(dj,di\)=0\]\\displaystyle=\\frac\{1\}\{2\}\\Pr\[\\text\{LLM\}\(d\_\{i\},d\_\{j\}\)=1\]\+\\frac\{1\}\{2\}\\Pr\[\\text\{LLM\}\(d\_\{j\},d\_\{i\}\)=0\]=12\(1−Pr\[LLM\(di,dj\)=0\]\)\+12\(1−Pr\[LLM\(dj,di\)=1\]\)\\displaystyle=\\frac\{1\}\{2\}\\left\(1\-\\Pr\[\\text\{LLM\}\(d\_\{i\},d\_\{j\}\)=0\]\\right\)\+\\frac\{1\}\{2\}\\left\(1\-\\Pr\[\\text\{LLM\}\(d\_\{j\},d\_\{i\}\)=1\]\\right\)=1−12\(Pr\[LLM\(dj,di\)=1\]\+Pr\[LLM\(di,dj\)=0\]\)\\displaystyle=1\-\\frac\{1\}\{2\}\\left\(\\Pr\[\\text\{LLM\}\(d\_\{j\},d\_\{i\}\)=1\]\+\\Pr\[\\text\{LLM\}\(d\_\{i\},d\_\{j\}\)=0\]\\right\)=1−Pr\[Vji=1\]\\displaystyle=1\-\\Pr\[V\_\{ji\}=1\]This result confirms that the oracle is reciprocal in expectation\. While any single LLM inference may be biased toward a specific position, the randomization of the oracle ensures that the aggregate estimator is symmetric and unbiased with respect to the document order\.Similar Articles
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
MemReranker is a reasoning-aware reranking model family (0.6B/4B) designed for agent memory retrieval, addressing limitations in semantic similarity by incorporating LLM knowledge distillation for better temporal and causal reasoning.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.
Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.