Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv cs.LG 06/05/26, 04:00 AM Papers

Summary

This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.

arXiv:2606.05308v1 Announce Type: new Abstract: With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

Original Article

View Cached Full Text

Cached at: 06/05/26, 08:10 AM

# Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
Source: [https://arxiv.org/html/2606.05308](https://arxiv.org/html/2606.05308)
###### Abstract

With PRECISE111Extended abstract; see the full PRECISE paper at:[https://doi\.org/10\.1609/aaai\.v40i47\.41427](https://doi.org/10.1609/aaai.v40i47.41427), we extended Prediction\-Powered Inference to produce bias\-corrected estimates of ranking evaluation metrics by combining a small human\-labeled set with a large LLM\-judged set\. PPI is provably unbiased regardless of the LLM judge’s error profile\. We make it applicable to hierarchical metrics like Precision@K, where annotations are per\-document but the metric is per\-query, by reducing the output\-space computation fromO\(2\|C\|\)O\(2^\{\|C\|\}\)toO\(2K\)O\(2^\{K\}\)\. On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4\.45 to 3\.50 \(a 21% relative reduction\)\. In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain\-expert annotation; A/B testing confirmed this ranking with \+407 bps in daily sales\.

Statistically Reliable LLM\-Based Ranking Evaluation via Prediction\-Powered Inference

Abhishek DivekarAmazonadivekar@amazon\.com

## 1Introduction

Human evaluation is expensive, yet smaller labeled sets produce wide confidence intervals that cannot distinguish genuine system improvements from noise\. LLM\-as\-a\-Judge approaches attempt to address this, but carry systematic biases that distort evaluation metrics when used as substitutes for human annotation\(Chenet al\.,[2024](https://arxiv.org/html/2606.05308#bib.bib6)\)\.

Most prior work addresses this tension by building better judges through prompt engineering, fine\-tuning, or multi\-agent debate\. We take an orthogonal approach: accept that LLM judges are biased and*correct for the bias statistically*\. Our framework extends Prediction\-Powered Inference\(PPI; Angelopouloset al\.,[2023](https://arxiv.org/html/2606.05308#bib.bib3)\), a semi\-supervised estimation method that combines a small gold set \(human labels\) with a large LLM\-annotated set\. The gold set measures the judge’s systematic error and corrects for it\. The resulting estimate is provably unbiased, and each additional LLM\-judged example reduces the variance of the metric estimate without introducing new bias\.

A challenge arises for metrics that aggregate granular judgments into a higher\-level score: for Precision@K, human annotations are collected per\-document but the metric is computed per\-query\. Standard PPI cannot handle this granularity mismatch\. We resolve this through a sparse reformulation of the output space \(§[2](https://arxiv.org/html/2606.05308#S2)\), and validate on a public benchmark and a production A/B test \(§[3](https://arxiv.org/html/2606.05308#S3)\)\.

## 2Method

Let𝒟g=\{\(xg\(i\),yg\(i\)\)\}i=1n\\mathcal\{D\}\_\{g\}=\\\{\(x^\{\(i\)\}\_\{g\},y^\{\(i\)\}\_\{g\}\)\\\}\_\{i=1\}^\{n\}be a small gold set with human labels and𝒟u=\{xu\(i\)\}i=1N\\mathcal\{D\}\_\{u\}=\\\{x^\{\(i\)\}\_\{u\}\\\}\_\{i=1\}^\{N\}a large set \(N≫nN\\gg n\) annotated by an LLMMM\. The PPI\+\+ estimator\(Angelopouloset al\.,[2024](https://arxiv.org/html/2606.05308#bib.bib4)\)evaluates:

μ^PPI\\displaystyle\\hat\{\\mu\}\_\{\\text\{PPI\}\}=λN∑i=1Nμ~u\(i\)⏟LLM\-based estimate\+1n∑i=1n\[ϕi−λμ~g\(i\)\]⏟bias correction\\displaystyle=\\underbrace\{\\frac\{\\lambda\}\{N\}\\\!\\sum\_\{i=1\}^\{N\}\\tilde\{\\mu\}\_\{u\}^\{\(i\)\}\}\_\{\\text\{LLM\-based estimate\}\}\+\\underbrace\{\\frac\{1\}\{n\}\\\!\\sum\_\{i=1\}^\{n\}\\\!\\Big\[\\phi\_\{i\}\-\\lambda\\,\\tilde\{\\mu\}\_\{g\}^\{\(i\)\}\\Big\]\}\_\{\\text\{bias correction\}\}\(1\)whereϕi\\phi\_\{i\}is the human\-grounded metric on theii\-th gold query, andμ~u\(i\),μ~g\(i\)\\tilde\{\\mu\}\_\{u\}^\{\(i\)\},\\tilde\{\\mu\}\_\{g\}^\{\(i\)\}are the LLM\-based metric estimates obtained by marginalizing over the judge’s output distribution\. The parameterλ∈\[0,1\]\\lambda\\in\[0,1\]is tuned to minimize the variance ofμ^PPI\\hat\{\\mu\}\_\{\\text\{PPI\}\}; the estimator remains unbiased for anyλ\>0\\lambda\>0\.

The bias\-correction term \(second summand\) measures how the LLM judge deviates from human ground truth on the gold set, then subtracts this deviation from the LLM\-only estimate\. When the LLM is well\-calibrated, settingλ≃1\\lambda\\simeq 1allows the full unlabeled set to drive variance reduction\. When the LLM is heavily biased, we can shrinkλ≃0\\lambda\\simeq 0and the estimator relies on gold estimates\.

### Hierarchical metrics\.

For Precision@K, annotations are at the query\-document level but metrics are calculated per\-query\. The naive PPI output space is\{0,1\}\|C\|\\\{0,1\\\}^\{\|C\|\}\(one binary relevance variable per corpus document\), which is computationally intractable when\|C\|\|C\|is in the millions\.

We observe that Precision@K depends only on the top\-KKretrieved documents; thus, it reduces to a scaled dot product over sparse vectors:ϕ\(y^,y\)=y^⊤y/K\\phi\(\\hat\{y\},y\)=\\hat\{y\}^\{\\top\}y/K\. Because onlyKKpositions contribute, the probability mass of all non\-retrieved documents collapses into a single weight on the all\-zeroKK\-vector\. This reduces the output space to\{0,1\}K\\\{0,1\\\}^\{K\}\.

For each query, the LLM judge provides per\-document relevance probabilitiesp~′\(dk\)\\tilde\{p\}^\{\\prime\}\(d\_\{k\}\)for thekk\-th ranked result\. We form a joint distribution overKK\-length binary vectors assuming conditional independence across documents:

p~\(y\)=∏k=1Kp~′\(dk\)yk\(1−p~′\(dk\)\)\(1−yk\)\\displaystyle\\tilde\{p\}\(y\)=\\prod\_\{k=1\}^\{K\}\\tilde\{p\}^\{\\prime\}\(d\_\{k\}\)^\{y\_\{k\}\}\(1\{\-\}\\tilde\{p\}^\{\\prime\}\(d\_\{k\}\)\)^\{\(1\{\-\}y\_\{k\}\)\}\(2\)
whereyky\_\{k\}is thekk\-th element ofy∈\{0,1\}Ky\\in\\\{0,1\\\}^\{K\}\. The LLM\-based estimatesμ~\(i\)\\tilde\{\\mu\}^\{\(i\)\}in Eq\.[1](https://arxiv.org/html/2606.05308#S2.E1)are then computed by summingϕ\(y^,y\)⋅p~\(y\)\\phi\(\\hat\{y\},y\)\\cdot\\tilde\{p\}\(y\)over all2K2^\{K\}vectors\. For typicalK≤10K\\leq 10this sum is tractable\.

## 3Results

We validate on the ESCI retrieval benchmark\(Reddyet al\.,[2022](https://arxiv.org/html/2606.05308#bib.bib5)\)using Claude 3 Sonnet and Haiku as LLM judges, withn=30n\{=\}30gold annotations andN=60,000N\{=\}60\{,\}000unlabeled queries\.

![Refer to caption](https://arxiv.org/html/2606.05308v1/figure_esci_ppi_results.png)Figure 1:Sampling distributions for Precision@4 on ESCI \(N=60N\{=\}60K, Claude 3 Sonnet\)\. Top:n=30n\{=\}30; bottom:n=100n\{=\}100\. PPI \(green\) is tighter than gold\-only \(red\) and centered on the true value \(yellow dashed\)\. LLM\-only estimates \(cyan, blue\) are biased\.Table 1:Precision@4 estimation on ESCI \(n=30n\{=\}30gold,N=60N\{=\}60K LLM\-judged\)\. Sonnet reduces standard error from 4\.45 to 3\.50 \(21% relative reduction\); Haiku achieves the lowest bias at 12×\\timeslower cost\.### Variance reduction and cost\.

Table[1](https://arxiv.org/html/2606.05308#S3.T1)shows that PPI with Sonnet reduces standard error from 4\.45 to 3\.50 \(\-21% relative\) while maintaining low bias \(0\.70 vs\. 1\.04 for gold\-only\)\. Haiku achieves comparable quality \(SE: 3\.86, bias: 0\.29\) at 12×\\timeslower inference cost\. Figure[1](https://arxiv.org/html/2606.05308#S3.F1)illustrates why: the PPI sampling distribution \(green\) is narrower than gold\-only \(red\) because the LLM signal reduces variance, and it stays centered on the true value \(yellow dashed\) because the bias\-correction term removes the judge’s systematic error\. Separately, we found the framework plateaus at a 100×\\timesunlabeled\-to\-gold ratio:N=3,000N=3,000LLM queries provide nearly identical standard error toN=60,000N=60,000withn=30n=30labelled examples\.

### Production A/B test\.

In a production search system, our Precision@K formulation ranked three system variants \(C, T1, T2\) usingn=100n\{=\}100human labels andN=8,400N\{=\}8\{,\}400LLM judgments, produced in 2 hours of expert annotation\. The predicted ranking \(T1\>\>T2\>\>Control\) was confirmed by A/B testing: T1 yielded \+407 bps in daily sales and \+571 bps in click\-through rate\. Without PPI correction, LLM\-only estimates could not distinguish between variants, because systematic upward bias inflated all estimates similarly; introducing semi\-supervised estimation restored discriminative power by correcting for this bias\.

Though we validate it on Precision@K, the hierarchical formulation applies in principle to any metric that aggregates fine\-grained judgments \(e\.g\., per\-claim factuality, per\-turn dialogue quality\)\.

## 4Future Work

Several promising directions remain for future work\. We describe a few of them\.

Synthetic covariates\.An over\-reliance on human labels is a major drawback of our estimation method\. LLM\-generated synthetic datasets can begin from fixed but gold labels and provide good textual covariates, which may nevertheless be usable for estimationYuet al\.\([2023](https://arxiv.org/html/2606.05308#bib.bib7)\); Kowshiket al\.\([2024](https://arxiv.org/html/2606.05308#bib.bib12)\)\.

Doubly robust estimationOosterhuis \([2023](https://arxiv.org/html/2606.05308#bib.bib11)\)shares a theoretical grounding with LLM bias and could offer a pathway toward real\-time, bias\-corrected metric inference\. Adopting this paradigm would enable online evaluation for our method\.

Multiple Judges\.A complementary line of work aggregates verdicts across several LLM judges, which may match human ratings more closely than a single modelZhenget al\.\([2023](https://arxiv.org/html/2606.05308#bib.bib2)\)\. However, the alternative of folding several rubrics into one judge prompt and tuning it jointly turns out to be brittle in practiceDarshan and Divekar \([2026](https://arxiv.org/html/2606.05308#bib.bib10)\)\. A natural extension of our method, then, is to adopt multi\-objective optimization procedures, rather than depending on a single all\-purpose evaluator\.

Agentic critics\.Agentic systems increasingly rely on LLM\-based critics to score and refine their own outputsYuksekgonulet al\.\([2025](https://arxiv.org/html/2606.05308#bib.bib15)\); Rudmanet al\.\([2026](https://arxiv.org/html/2606.05308#bib.bib13)\), yet these critics inherit the same biases PRECISE corrects\. Extending our approach to produce calibrated critic signals from minimal human labels is a promising direction for more reliable agent optimization\.

## Acknowledgments

Anirban Majumder contributed to the original version of this work\(Divekar and Majumder,[2026](https://arxiv.org/html/2606.05308#bib.bib1)\)\.

## Ethics Statement

All user queries in the production A/B test were anonymized to remove personally identifiable information before being processed by LLM judges or human annotators\. Human annotation was performed by domain experts during normal working hours as part of their regular responsibilities; no crowdworker labor was used\. The framework reduces \(but does not eliminate\) the need for human annotation; it is not intended to replace human evaluation entirely, but to make small human annotation budgets go further\.

## Limitations

Our framework has three limitations\. First, we have validated the hierarchical PPI extension only on Precision@K for retrieval; generalization to other hierarchical metrics \(e\.g\., per\-claim factuality, per\-turn dialogue quality\) remains untested\. Second, the conditional independence assumption across documents in Eq\.[2](https://arxiv.org/html/2606.05308#S2.E2)may not hold when relevance of one document depends on the presence of another \(e\.g\., diversity\-sensitive ranking\); relaxing this assumption is left to future work\. Third, the framework requires a small gold set from the same distribution as the unlabeled set; distribution shift between the gold and unlabeled queries \(e\.g\., from temporal drift\) could degrade the bias correction\.

## References

- A\. N\. Angelopoulos, S\. Bates, C\. Fannjiang, M\. I\. Jordan, and T\. Zrnic \(2023\)Prediction\-powered inference\.Science382\(6671\),pp\. 669–674\.External Links:[Link](https://doi.org/10.1126/science.adi6000)Cited by:[§1](https://arxiv.org/html/2606.05308#S1.p2.1)\.
- A\. N\. Angelopoulos, J\. C\. Duchi, and T\. Zrnic \(2024\)PPI\+\+: efficient prediction\-powered inference\.External Links:2311\.01453,[Link](https://arxiv.org/abs/2311.01453)Cited by:[§2](https://arxiv.org/html/2606.05308#S2.p1.4)\.
- G\. H\. Chen, S\. Chen, Z\. Liu, F\. Jiang, and B\. Wang \(2024\)Humans or LLMs as the judge? a study on judgement biases\.InProc\. of EMNLP 2024,External Links:[Link](https://aclanthology.org/2024.emnlp-main.474/)Cited by:[§1](https://arxiv.org/html/2606.05308#S1.p1.1)\.
- P\. Darshan and A\. Divekar \(2026\)When gradients collide: failure modes of multi\-objective prompt optimization for llm judges\.External Links:2605\.26046,[Link](https://arxiv.org/abs/2605.26046)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p4.1)\.
- A\. Divekar and A\. Majumder \(2026\)PRECISE: reducing the bias of llm evaluations using prediction\-powered ranking estimation\.Proc\. of AAAI 2026\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/41427),[Document](https://dx.doi.org/10.1609/aaai.v40i47.41427)Cited by:[Acknowledgments](https://arxiv.org/html/2606.05308#Sx1.p1.1)\.
- S\. S\. Kowshik, A\. Divekar, and V\. Malik \(2024\)CorrSynth \- a correlated sampling method for diverse dataset generation from LLMs\.InProc\. of EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 16076–16095\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.899/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.899)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p2.1)\.
- H\. Oosterhuis \(2023\)Doubly robust estimation for correcting position bias in click feedback for unbiased learning to rank\.ACM Trans\. Inf\. Syst\.41\(3\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3569453),[Document](https://dx.doi.org/10.1145/3569453)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p3.1)\.
- C\. K\. Reddy, L\. Màrquez, F\. Valero, N\. Rao, H\. Zaragoza, S\. Bandyopadhyay, A\. Biswas, A\. Xing, and K\. Subbian \(2022\)Shopping queries dataset: a large\-scale ESCI benchmark for improving product search\.External Links:2206\.06588,[Link](https://arxiv.org/abs/2206.06588)Cited by:[§3](https://arxiv.org/html/2606.05308#S3.p1.2)\.
- W\. Rudman, A\. Divekar, K\. Jain, S\. Joseph, S\. S\. R\. Offner, M\. Lease, K\. Mahowald, G\. Durrett, and J\. J\. Li \(2026\)VESTA: visual exploration with statistical tool agents\.External Links:2606\.00384,[Link](https://arxiv.org/abs/2606.00384)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p5.1)\.
- Y\. Yu, Y\. Zhuang, J\. Zhang, Y\. Meng, A\. Ratner, R\. Krishna, J\. Shen, and C\. Zhang \(2023\)Large language model as attributed training data generator: a tale of diversity and bias\.InProc\. of NeurIPS 2023,External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3668555)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p2.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative AI by backpropagating language model feedback\.Nature639,pp\. 609–616\.External Links:[Link](https://www.nature.com/articles/s41586-025-08661-4)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p5.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InProc\. of NeurIPS 2023,External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3668142)Cited by:[§4](https://arxiv.org/html/2606.05308#S4.p4.1)\.

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Similar Articles

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Quantifying Ranking Uncertainty in LLM Benchmarks

On the Stability of Prompt Ranking in Large Language Model Evaluation

Benchmarking Different Methods of LLM Confidence Estimation

Submit Feedback

Similar Articles

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Quantifying Ranking Uncertainty in LLM Benchmarks

On the Stability of Prompt Ranking in Large Language Model Evaluation

Benchmarking Different Methods of LLM Confidence Estimation