Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability
Summary
This paper introduces Metric Match, a method for selecting a subset of samples for human annotation to estimate LLM judge reliability more efficiently, reducing annotation costs by 32.5% and achieving a win-rate of 0.838 against random selection.
View Cached Full Text
Cached at: 06/16/26, 11:43 AM
# Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability
Source: [https://arxiv.org/html/2606.15029](https://arxiv.org/html/2606.15029)
Alyssa Unell Department of Computer Science Stanford University aunell@stanford\.edu &Natalie Dullerud††footnotemark: Department of Computer Science Stanford University ndulleru@stanford\.edu &Naomi Boneh Department of Computer Science Stanford University naomicyb@stanford\.edu &Meena Jagadeesan Department of Computer Science Stanford University meenaj@seas\.upenn\.edu &Tatsu Hashimoto Department of Computer Science Stanford University thashim@stanford\.edu &Nigam Shah Department of Medicine Stanford University nigam@stanford\.edu &Sanmi Koyejo Department of Computer Science Stanford University sanmi@stanford\.edu
###### Abstract
LLM judges are used to reduce the need for costly human labor in evaluating open\-ended text generation\. However, the reliability of these judges depends critically on their alignment with human raters — a property that itself depends on costly human annotations\. In this work, we develop a method \(Metric Match\) for estimating correlation\-based reliability metrics of LLM judges from limited annotations\.Metric Matchselects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels\. We empirically show thatMetric Matchachieves a win\-rate of 0\.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18\.7% decrease in average estimation error and reduces annotation needs by 32\.5%\. We provide a cost model and highlight a medical case study where our method saves $1,041\.67 compared to random selection for expert annotation\. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection withMetric Match\. All project code is publicly available111[https://github\.com/som\-shahlab/MetricMatch](https://github.com/som-shahlab/MetricMatch), and we additionally provide an installable package for ease of use\.
## 1Introduction
Large language models \(LLMs\) are increasingly used for text generation tasks, but their rapid adoption has outpaced our ability to evaluate them at scale\[[47](https://arxiv.org/html/2606.15029#bib.bib17),[36](https://arxiv.org/html/2606.15029#bib.bib25),[51](https://arxiv.org/html/2606.15029#bib.bib18),[10](https://arxiv.org/html/2606.15029#bib.bib24)\]\. Accordingly, the LLM judge framework\[[22](https://arxiv.org/html/2606.15029#bib.bib23)\], in which one LLM evaluates the outputs of another model, has emerged as a scalable alternative to human annotation\. Scalability gains are particularly relevant in expert\-oriented contexts such as healthcare\[[5](https://arxiv.org/html/2606.15029#bib.bib1)\], where human labeling is slow and expensive\. Recent work has explored this direction through human\-labeled benchmarks\[[18](https://arxiv.org/html/2606.15029#bib.bib52)\]and reference\-free evaluation methods\[[45](https://arxiv.org/html/2606.15029#bib.bib22)\]\.
To responsibly deploy LLM judges in high\-stakes domains such as healthcare\[[30](https://arxiv.org/html/2606.15029#bib.bib110)\], it is necessary to evaluate the reliability of LLM\-generated annotations with respect to human labels\. Specifically, this evaluation serves as a signal to practitioners on whether the LLM judge can reliably replace a costly human annotator\. LLM judge reliability\[[7](https://arxiv.org/html/2606.15029#bib.bib67),[31](https://arxiv.org/html/2606.15029#bib.bib79),[17](https://arxiv.org/html/2606.15029#bib.bib77)\]is often measured using statistical measures in the inter\-rater reliability literature\[[43](https://arxiv.org/html/2606.15029#bib.bib19),[29](https://arxiv.org/html/2606.15029#bib.bib2)\]and standard correlation coefficients\[[44](https://arxiv.org/html/2606.15029#bib.bib3),[26](https://arxiv.org/html/2606.15029#bib.bib4)\]\. However, a key challenge is that these metrics require human annotations in order to calculate the associated reliability score for the target LLM\-judge system, creating a bottleneck for evaluation\. In fact, LLM judge evaluation encounters the same scalability problem that the LLM\-as\-a\-judge framework aims to solve in the first place: human annotations on the full datasets are required to exactly compute reliability\.
In order to mitigate this scalability issue, standard approaches aim to estimate judge reliability using a limited budget for human annotations, focusing on producing an unbiased estimator\. One approach is to collect annotations on a randomly chosen subset for estimation of the reliability metric, which produces an unbiased estimate\[[15](https://arxiv.org/html/2606.15029#bib.bib66)\]\. Random selection or classical statistical sampling permits straightforward finite\-sample analysis of the metric estimates\. Another approach is to generate synthetic labels using another LLM judge and then to correct the systemic bias incurred by these labels\. The bias is typically estimated on a randomly chosen subset, and modern approaches such as Prediction\-Powered Inference \(PPI\)\[[2](https://arxiv.org/html/2606.15029#bib.bib39)\]combine bias correction with the construction of a confidence set\.
In this paper, we take a different perspective: rather than aiming to obtain an unbiased estimator for judge reliability, our goal is to predict judge reliability in order to minimize estimation error\. In low\-annotation regimes, high variance can render an unbiased estimate practically uninformative, making minimization of total estimation error the more relevant objective\. We find that for this estimation task, it is useful to move outside of the set of unbiased estimators\. Specifically, while standard approaches collect human annotations on a randomly chosen subset, we leverage the structure of the reliability metric on the synthetic labels across items\. This structure in turn determines which subset of items to annotate\.
Our main contribution is a new estimation approach \(Metric Match\) for evaluating LLM judge reliability, which combines limited human annotations with access to synthetic labels from other LLMs\.Metric Matchleverages a novel subset selection approach: we collect human annotations from a carefully constructed subset which most closely matches the full population in terms of inter\-model reliability between the LLM judge scores and the synthetic labels\. This approach uses the inter\-model metric to guide subset selection for the estimate of the human\-model metric of interest\.
We empirically evaluateMetric Matchacross a diverse set of models \(Claude\-3\.5\-Sonnet\[[3](https://arxiv.org/html/2606.15029#bib.bib101)\], GPT\-4\.1\[[38](https://arxiv.org/html/2606.15029#bib.bib103)\], GPT\-5\[[39](https://arxiv.org/html/2606.15029#bib.bib102)\], Deepseek\-R1\[[23](https://arxiv.org/html/2606.15029#bib.bib99)\], and Gemini\-2\.5\-pro\[[16](https://arxiv.org/html/2606.15029#bib.bib100)\]\), and datasets \(HANNA\[[12](https://arxiv.org/html/2606.15029#bib.bib7)\], MedVAL\[[1](https://arxiv.org/html/2606.15029#bib.bib78)\], SummEval\[[19](https://arxiv.org/html/2606.15029#bib.bib5)\], and MSLR\[[49](https://arxiv.org/html/2606.15029#bib.bib8)\]\)\. In each context, we consider different sizes of sampling budgets, and different correlation metrics \(ICC\[[43](https://arxiv.org/html/2606.15029#bib.bib19)\], Krippendorff’sα\\alpha\[[29](https://arxiv.org/html/2606.15029#bib.bib2)\], Spearman’sρ\\rhorank correlation\[[44](https://arxiv.org/html/2606.15029#bib.bib3)\], and Kendall’sτ\\taurank correlation\[[26](https://arxiv.org/html/2606.15029#bib.bib4)\]\)\.Our results are as follows:
1. 1\.Estimation error: We empirically show thatMetric Matchoutperforms baselines such as annotating on randomly collected subsets, bias correction, and stratified sampling \(Figure[2](https://arxiv.org/html/2606.15029#S4.F2)\)\. We decrease estimation error by an average of 18\.7% when compared to random selection\. This results in a reduced annotation requirement of 32\.5% \(Figure[3](https://arxiv.org/html/2606.15029#S4.F3)\)\. We provide a cost model to calculate the impact of improved estimation ability, showing cost savings of up to $1,041\.67 for a given dataset, MedVAL\[[1](https://arxiv.org/html/2606.15029#bib.bib78)\]\.
2. 2\.Win\-rate evaluation: We then perform a systematic comparison ofMetric Matchagainst random selection, the de facto approach in practice\. We observe an average win rate of 0\.838 against random selection, when estimation error is averaged across contexts and budgets\. We also find the average consistently exceeds0\.650\.65for each distinct budget and metric \(Table[1\(a\)](https://arxiv.org/html/2606.15029#S4.T1.st1)\)\. We perform a similar analysis for a more fine\-grained win\-rate at a per\-trial level \(Table[1\(b\)](https://arxiv.org/html/2606.15029#S4.T1.st2)\)\.
3. 3\.Reliability classification: Finally, we turn to the downstream task of reliability classification\. Specifically, a practitioner may use an LLM judge only if the estimated reliability coefficient is above a pre\-specified deployment threshold\. When we shift our task from reliability estimation to reliability classification,Metric Matchhas a win rate of 0\.652 compared to random selection \(Table[2](https://arxiv.org/html/2606.15029#S4.T2)\)\.
Accordingly, our work constitutes a substantial step towards scalable LLM judge evaluation, allowing for practitioners to accelerate judge development and improve early failure detection, while aligning with human preferences and maintaining reliability estimation accuracy with fewer annotations\.
## 2Related Works
### 2\.1LLM Evaluation and Human Annotation
Evaluating the outputs of large language models is a longstanding challenge, particularly as these systems are deployed in open\-ended and domain\-specific settings\[[47](https://arxiv.org/html/2606.15029#bib.bib17),[36](https://arxiv.org/html/2606.15029#bib.bib25)\]\. As model success moves away from traditional multiple\-choice evaluation style and into messier, real world environments, evaluation of model success becomes more nuanced than current lexical approaches can capture\[[41](https://arxiv.org/html/2606.15029#bib.bib26),[33](https://arxiv.org/html/2606.15029#bib.bib27),[37](https://arxiv.org/html/2606.15029#bib.bib47)\]\. Human evaluation has historically served as the gold standard in these instances due to its sensitivity to nuanced qualities such as coherence, factuality, and appropriateness while also capturing a notion of target quality that we aim to leverage in model alignment\[[51](https://arxiv.org/html/2606.15029#bib.bib18),[40](https://arxiv.org/html/2606.15029#bib.bib48)\]\. However, human annotation is costly and difficult to scale, especially in domains such as medicine and law, where annotators require expert knowledge\[[7](https://arxiv.org/html/2606.15029#bib.bib67)\]\. Expert oncologist labeling can be up to $500 per hour, and with some tasks taking multiple hours with multiple modes of disagreement, it quickly highlights the limitations of human reliance for LLM output evaluation\[[48](https://arxiv.org/html/2606.15029#bib.bib29)\]\. Prior work has examined various dimensions of annotation quality, including inter\-annotator agreement, annotator bias, and the reliability of crowd\-sourced labels, highlighting the problems that arise even if we had unlimited access to human annotations regarding internal human disagreement\[[31](https://arxiv.org/html/2606.15029#bib.bib79),[14](https://arxiv.org/html/2606.15029#bib.bib28)\]\. Recent work has explored how practitioners can extend these metrics to allow weaker annotators to enhance the scale of stronger annotators\[[8](https://arxiv.org/html/2606.15029#bib.bib30),[9](https://arxiv.org/html/2606.15029#bib.bib32),[27](https://arxiv.org/html/2606.15029#bib.bib31)\]\.
### 2\.2LLM\-as\-a\-Judge for Scalable Evaluation
The*LLM\-as\-a\-judge*paradigm, popularized byGuet al\.\[[22](https://arxiv.org/html/2606.15029#bib.bib23)\], employs LLMs as automated evaluators to score or rank the outputs of other models\. While this approach offers high throughput, subsequent research has identified critical failure modes, including position bias, verbosity bias, and self\-preference bias\[[22](https://arxiv.org/html/2606.15029#bib.bib23),[31](https://arxiv.org/html/2606.15029#bib.bib79)\]\. To mitigate these issues, recent benchmarks such as AlpacaEval\[[18](https://arxiv.org/html/2606.15029#bib.bib52)\]introduce length\-controlled metrics via regression\-based debiasing, while JudgeBench\[[45](https://arxiv.org/html/2606.15029#bib.bib22)\]focuses on objective correctness in knowledge\-heavy domains where human stylistic preferences may mislead automated judges\. Further, comparison of judge responses to a subset of human responses with associated reliability metrics, such as ICC, Krippendorf’sα\\alpha, Kendall’sτ\\tau, and Spearman’sρ\\rho, serves as signals regarding aforementioned judge biases and shortcomings\[[28](https://arxiv.org/html/2606.15029#bib.bib33),[4](https://arxiv.org/html/2606.15029#bib.bib34),[26](https://arxiv.org/html/2606.15029#bib.bib4),[44](https://arxiv.org/html/2606.15029#bib.bib3)\]\. LLM\-as\-a\-judge are frequently compared to human outputs in order to decide whether a system is acceptable to serve as a human proxy or whether further iteration is needed\[[13](https://arxiv.org/html/2606.15029#bib.bib37)\]\. This approach leaves open the question of how to use judge outputs effectively when human annotations are scarce\.
### 2\.3Efficient Sampling and Annotation
A body of work on bias correction uses a small set of human labels to “rectify” the bias of automated predictions\. Modern approaches such as Prediction\-Powered Inference \(PPI\) use a small set of human labels to “rectify” the bias of automated predictions, providing valid statistical guarantees on population parameters\[[2](https://arxiv.org/html/2606.15029#bib.bib39),[20](https://arxiv.org/html/2606.15029#bib.bib46)\]\. Bias correction is typically performed on a random subset of data on which human labels are collected\. The choice of which points to label is itself a variance\-reduction problem, and a related line of work frames it through importance sampling: rather than sampling uniformly, points are drawn \(or reweighted\) according to their expected contribution to the estimator’s variance, so that the corrected estimate is more accurate for a fixed labeling budget\[[11](https://arxiv.org/html/2606.15029#bib.bib43),[52](https://arxiv.org/html/2606.15029#bib.bib42)\]\.
A handful of works \(e\.g\.,\[[53](https://arxiv.org/html/2606.15029#bib.bib44),[50](https://arxiv.org/html/2606.15029#bib.bib45)\]\) have explored how to move beyond random selection, and instead construct this subset in an active manner\. Such works build a latent factor model on synthetic labels to determine model uncertainty, which then guides active annotation selection\. Our approach is similar in motivation, as we use synthetic labels to construct a subset of data points on which to collect human annotations\. However, these approaches use model uncertainty estimates to sample from uncertain regions of the distribution as well as learns parameters of latent factor models\. These bias correction methods, in addition to PPI and importance sampling, aim to improve the confidence intervals around the estimation, as opposed to optimizing for the point estimation itself\.
Our work contributes to a broader body of work on sampling and annotation\. To maximize the information of limited annotation budgets, researchers have looked to methodological antecedents in active learning \(AL\) and optimal experimental design\[[15](https://arxiv.org/html/2606.15029#bib.bib66),[35](https://arxiv.org/html/2606.15029#bib.bib40),[24](https://arxiv.org/html/2606.15029#bib.bib41)\]\. Active learning focuses on iteratively selecting samples with the highest uncertainty or diversity for labeling\[[42](https://arxiv.org/html/2606.15029#bib.bib38)\], while experimental design approaches optimize the one\-step selection of samples to maximize information gain\[[53](https://arxiv.org/html/2606.15029#bib.bib44),[50](https://arxiv.org/html/2606.15029#bib.bib45)\]\.
## 3Preliminaries
Figure 1:Overview of the LLM judge evaluation framework and our approach\.Given text samples𝒳\\mathcal\{X\}, and judge modelMMto evaluate, we obtain scores\{M\(x\),M′\(x\)\}x∈𝒳\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}forM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}\. We selectbbtexts for human annotation\. Equipped with a measure of reliability, functionTT, we estimate a target parameterρ^\\hat\{\\rho\}\. We show that our improved estimation ofρ^\\hat\{\\rho\}leads to downstream impacts\.In this section, we outline the problem setup, describe our proposed method \(Metric Match\) and provide intuition for it, and discuss the experimental setup\.
### 3\.1Problem Statement
We study the problem of LLM judge evaluation under limited annotation budgets\. Assume we have a large set of text items𝒳\\mathcal\{X\}, and letH:𝒳→𝒴:=\[1,K\]H:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}:=\[1,K\], denote the ground\-truth human scoring function from items to scores, whereKKis the greatest possible score\. The human scoring function may correspond to a single human rater or an average of multiple raters\. LetM:𝒳→𝒴M:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}denote the scoring function produced by the LLM judge\. To generate synthetic labels, we have access to a family of modelsM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, each of which yields a synthetic scoring functionM′:𝒳→𝒴M^\{\\prime\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}\. Note that the average human rater may produce non\-integers, whereas the LLM judge will always produce integer scores\.
We wish to measure the reliability of judgeMMwith respect to human annotatorHHon textual data\. We consider a reliability metricT:\(𝒴×𝒴\)\|𝒳\|→ℝT:\(\\mathcal\{Y\}\\times\\mathcal\{Y\}\)^\{\|\\mathcal\{X\}\|\}\\rightarrow\\mathbb\{R\}, which operates on pairs of scores on a set of text samples and outputs a number quantifying the reliability between scorers\. Given the chosen metric, we want an estimate of the true \(population\-level\) relationship betweenMMandHHon𝒳\\mathcal\{X\}, as captured byρ=T\(\{\(M\(x\),H\(x\)\)\}x∈𝒳\)\\rho=T\(\\left\\\{\(M\(x\),H\(x\)\)\\right\\\}\_\{x\\in\\mathcal\{X\}\}\)\. With a budget of sizeb≥1b\\geq 1, we can construct anyS⊆𝒳S\\subseteq\\mathcal\{X\}of size\|S\|=b\|S\|=band obtain human annotations on the subset,\{H\(x\)\}x∈S\\left\\\{H\(x\)\\right\\\}\_\{x\\in S\}\. Using the selected human annotations onSSand unlimited query access to annotations fromM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, we provide an estimateρ^S\\widehat\{\\rho\}\_\{S\}ofρ\\rho\.
To exploit the availability of synthetic scoring functionsM′:𝒳→𝒴M^\{\\prime\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\},M′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, we introduceinter\-model\(IM\) reliability quantities as measured byTT, which informMetric Match\. Denote the inter\-model population reliability betweenMMandM′M^\{\\prime\}asρIM\\rho^\{\\text\{IM\}\}:
ρIM=T\(\{M\(x\),M′\(x\)\}x∈𝒳\)\\rho^\{\\text\{IM\}\}=T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}\\big\)Unlike the human\-model population metric, the inter\-model population metric is directly calculable, as we can cheaply obtain scoresM\(x\)M\(x\),M′\(x\)M^\{\\prime\}\(x\)for everyx∈𝒳x\\in\\mathcal\{X\}\. Similarly, given any subsetS⊆𝒳S\\subseteq\\mathcal\{X\}, letρ^SIM\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\}signify the finite sample estimate ofρIM\\rho^\{\\text\{IM\}\}onSS, i\.e\.
ρ^SIM=T\(\{M\(x\),M′\(x\)\}x∈S\)\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\}=T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in S\}\\big\)
### 3\.2Method: Metric Matching
We introduce our method, which we termMetric Match, for subset selection in order to reduce estimation error in reliability evaluation of LLM judges with respect to human raters\. We focus on careful subset selection in order to minimize finite sample error given a budgetbb, relying on synthetic scorersM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}to inform our subset construction\. The key insight in the Metric Match algorithm \(Algorithm[1](https://arxiv.org/html/2606.15029#algorithm1)\) is to construct sets of text examplesS⊆𝒳S\\subseteq\\mathcal\{X\}whose inter\-model empirical estimateρ^SIM\\widehat\{\\rho\}\_\{S\}^\{\\text\{IM\}\}closely resembles the inter\-model population metricρIM\\rho^\{\\text\{IM\}\}\. We characterize the intuition for and present details of our algorithm in the rest of this section\.
Data:Metric function
TT; budget
bb; population text data
𝒳\\mathcal\{X\}; score functions
MM,
M′M^\{\\prime\}, from LLM judges
MMand
M′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, respectively;
CC, number of candidate subsets over which to search
Result:Subset
S∗⊆XS^\{\*\}\\subseteq X,
\|S∗\|=b\\lvert S^\{\*\}\\rvert=b; Estimator
ρ^S∗\\widehat\{\\rho\}\_\{S^\{\*\}\}
1Initialize
δmin←∞\\delta\_\{\\min\}\\leftarrow\\infty
2Initialize
S∗←∅S^\{\*\}\\leftarrow\\emptyset
3
ρIM←1\|ℳ\|∑M′∈ℳT\(\{M\(x\),M′\(x\)\}x∈𝒳\)\\rho^\{\\text\{IM\}\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M^\{\\prime\}\\in\\mathcal\{M\}\}T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}\\big\)
4for*j=1,…,Cj=1,\\dots,C*do
5Sample
SjS\_\{j\}s\.t\.
\|Sj\|=b\|S\_\{j\}\|=bfrom
X\{X\}without replacement
6
ρ^SjIM←1\|ℳ\|∑M′∈ℳT\(\{M\(x\),M′\(x\)\}x∈Sj\)\\widehat\{\\rho\}\_\{S\_\{j\}\}^\{\\text\{IM\}\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M^\{\\prime\}\\in\\mathcal\{M\}\}T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in S\_\{j\}\}\\big\)
7
δj=\|ρ^SjIM−ρIM\|\\delta\_\{j\}=\\lvert\\widehat\{\\rho\}\_\{S\_\{j\}\}^\{\\text\{IM\}\}\-\\rho^\{\\text\{IM\}\}\\rvert
8if*δj<δmin\\delta\_\{j\}<\\delta\_\{\\min\}*then
9
S∗←SjS^\{\*\}\\leftarrow S\_\{j\}
10
δmin←δj\\delta\_\{\\min\}\\leftarrow\\delta\_\{j\}
11
12Calculate human\-model metric estimate
ρ^S∗=T\(YS∗\(M\),YS∗\(H\)\)\\widehat\{\\rho\}\_\{S^\{\*\}\}=T\\big\(Y\_\{S^\{\*\}\}^\{\(M\)\},Y\_\{S^\{\*\}\}^\{\(H\)\}\\big\)
return*S∗S^\{\*\},ρ^S∗\\widehat\{\\rho\}\_\{S^\{\*\}\}*
Algorithm 1Metric MatchThe intuition is as follows\. In an oracle setting, whereH\(x\)H\(x\)is available for allx∈𝒳x\\in\\mathcal\{X\}, the ideal subsetS⊆𝒳S\\subseteq\\mathcal\{X\}of sizebbto construct for estimation is simply the subset that minimizes the estimation error\|ρ^S−ρ\|\|\\widehat\{\\rho\}\_\{S\}\-\\rho\|\. While we do not operate in an oracle setting, we have access to judgesM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, and we can computeM\(x\),M′\(x\)M\(x\),M^\{\\prime\}\(x\)for everyx∈𝒳x\\in\\mathcal\{X\}\. This motivates constructing a subset that minimizes the proxy \(inter\-model\) estimation errorS∗:=argminS⊆𝒳,\|S\|=b\|ρ^SIM−ρIM\|S^\{\*\}:=\\arg\\min\_\{S\\subseteq\\mathcal\{X\},\|S\|=b\}\\lvert\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\}\-\\rho^\{\\text\{IM\}\}\\rvert\. The success of this method implicitly assumes that therelativeinter\-model estimation errors are good enough proxies for the relative human\-model estimation errors: we formalize this in Section[4\.2\.1](https://arxiv.org/html/2606.15029#S4.SS2.SSS1)\. Given this subset, we compute the estimator on the human\-model metricρ^S∗:=T\(\{M\(x\),H\(x\)\}x∈S∗\)\\widehat\{\\rho\}\_\{S^\{\*\}\}:=T\\big\(\\\{M\(x\),H\(x\)\\\}\_\{x\\in S^\{\*\}\}\\big\)\.
There remain two important details that represent parametric choices about our algorithm:
- •Aggregation across synthetic labels:Our selection procedure thus far has assumed there is a single synthetic scorer\. However, we have access to a full ensemble of scorersℳ\\mathcal\{M\}\. We compute an inter\-model metric for each pair\(M,M′\)\(M,M^\{\\prime\}\)overM′∈ℳM^\{\\prime\}\\in\\mathcal\{M\}, and average the pairwise computations, resulting in: ρIM=1\|ℳ\|∑M′∈ℳT\(\{M\(x\),M′\(x\)\}x∈𝒳\)ρSIM=1\|ℳ\|∑M′∈ℳT\(\{M\(x\),M′\(x\)\}x∈S\)\\rho^\{\\text\{IM\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M^\{\\prime\}\\in\\mathcal\{M\}\}T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}\\big\)\\qquad\\rho\_\{S\}^\{\\text\{IM\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M^\{\\prime\}\\in\\mathcal\{M\}\}T\\big\(\\\{M\(x\),M^\{\\prime\}\(x\)\\\}\_\{x\\in S\}\\big\)We ablate with an alternate aggregation approach in Appendix[E\.1](https://arxiv.org/html/2606.15029#A5.SS1)\.
- •Candidate Subset Selection:To solveS∗:=argminS⊆𝒳,\|S\|=b\|ρ^SIM−ρIM\|S^\{\*\}:=\\arg\\min\_\{S\\subseteq\\mathcal\{X\},\|S\|=b\}\\lvert\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\}\-\\rho^\{\\text\{IM\}\}\\rvert, we would need to search over\(\|𝒳\|b\)\{\|\\mathcal\{X\}\|\\choose b\}subsets, which is computationally expensive\. To reduce the computational overhead, we sampleCCi\.i\.d\. candidate subsets,S1,…,SC∼𝒳bS\_\{1\},\\dots,S\_\{C\}\\sim\\mathcal\{X\}^\{b\}, from whichS∗S^\{\*\}is selected as followsS∗:=argminS1,…,SC\|ρ^SjIM−ρIM\|S^\{\*\}:=\\arg\\min\_\{S\_\{1\},\\dots,S\_\{C\}\}\\lvert\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\_\{j\}\}\-\\rho^\{\\text\{IM\}\}\\rvert\. The number of candidate subsets to consider constitutes a hyperparameter ofMetric Match\. In our experiments, we setC=20C=20\. We varyCCand report performance results in Appendix[E\.2](https://arxiv.org/html/2606.15029#A5.SS2)\. We find that even at relatively smallCC\(C=10C=10\), we see gains in performance as measured by win\-rate and estimation error\. See further analysis ofCCin Section[4\.2\.1](https://arxiv.org/html/2606.15029#S4.SS2.SSS1)\.
### 3\.3Experimental Setup
##### Baselines\.
We compare to the following simple baselines: random selection, stratified sampling, and random selection with a bias correction component\.222Due to the structure of the reliability metrics that we consider, we note that even random sampling and bias correction may not be unbiased \(Appendix[A](https://arxiv.org/html/2606.15029#A1)\)\. However, we take these estimators to be baselines in the context of our goal of minimizing estimation error\.
- •Random selectionA classic approach to estimation ofρ\\rhois to selectbbsamples i\.i\.d\. from𝒳\\mathcal\{X\}to constructSS, and estimateρ^S=T\(\{M\(x\),H\(x\)\}x∈S\)\\widehat\{\\rho\}\_\{S\}=T\\big\(\\\{M\(x\),H\(x\)\\\}\_\{x\\in S\}\\big\)\. Random selection remains the de facto method in practice\. However, random selection incurs high variance whenbbis small through finite sample error\.
- •Stratified samplingStratified sampling induces a uniform distribution over the judge score space in order to constructSS\. In other words, givenbb, sampleb/Kb/Ktext examples from each of\{x∈𝒳∣M\(x\)=k\}\\\{x\\in\\mathcal\{X\}\\mid M\(x\)=k\\\}fork=1,…,Kk=1,\\dots,Kto ensure full coverage of the score set𝒴\\mathcal\{Y\}in the sampled judge scores\.333In practice, we partition𝒴\\mathcal\{Y\}into quantiles, and sample uniformly from the quantiles\. As LLM judge scores are from a discrete set of integers, the procedure is effectively equivalent\. See further implementation details in Appendix[D](https://arxiv.org/html/2606.15029#A4)\.
- •Bias correction \(BC\)In bias correction, we leverage the human annotations for the purpose of debiasing the synthetic labels\. We use finite sample error calculated on the inter\-model reliability metric as a stand\-in for finite sample error in the human\-model metric\. Formally, we sampleS∼𝒳bS\\sim\\mathcal\{X\}^\{b\}, and calculate estimateρ^:=ρIM⏟Inter\-model metric\+ρ^S−ρ^SIM⏟Bias correction term\\widehat\{\\rho\}:=\\underbrace\{\\rho^\{\\text\{IM\}\}\}\_\{\\text\{Inter\-model metric\}\}\+\\underbrace\{\\widehat\{\\rho\}\_\{S\}\-\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\}\}\_\{\\text\{Bias correction term\}\}\.
We useρ^\\widehat\{\\rho\}to denote a generic estimator forρ\\rhounder any method \(implicitly with access tobbsamples\)\. Note that in stratified sampling, the important component of the estimator is the construction of subsetSS, and givenSS,ρ^=ρ^S\\widehat\{\\rho\}=\\widehat\{\\rho\}\_\{S\}\. The bias correction estimator has a different structure, due to incorporation of synthetic labels and an additive correction term\.
##### Evaluation procedure\.
To evaluateMetric Match\(Section[3\.2](https://arxiv.org/html/2606.15029#S3.SS2)\), we measure theestimation errorcompared to the baselines and also measure thewin\-rateofMetric Matchcompared to the random selection baseline\.
- •Estimation error\.Given a methodAA, we directly compute estimation error asϵA=\|ρ^−ρ\|\\epsilon^\{A\}=\|\\widehat\{\\rho\}\-\\rho\|, whereρ^\\widehat\{\\rho\}indicates the estimator given by the method to be evaluated\. We compute the average estimation error overN=40N=40trials\. Letϵrand\\epsilon^\{\\text\{rand\}\}be the estimation error incurred by random selection\.
- •Win\-rate\.A “win” for any method corresponds to an estimation error smaller than random estimation error\. Due to the ubiquitous use of random selection in practice, we focus on the win\-rate against random selection\. We consider two variants of the win\-rate\. - –Micro\-average win\-rate:OverNNtrials \(with estimation errorsϵi\\epsilon\_\{i\},i=1,…,Ni=1,\\dots,N\), the micro\-average win\-rate is equal to1N∑i=1N𝟙\[ϵimethod<ϵirand\]\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbbm\{1\}\[\\epsilon^\{\\text\{method\}\}\_\{i\}<\\epsilon^\{\\text\{rand\}\}\_\{i\}\]\. We compute this statistic for each of the 75 dataset/evaluation axis/judge model contexts\. We also report the average of the micro\-average win\-rate over the 75 contexts as a summary metric\. - –Macro\-average win\-rate:This variant captures the win signal over averaged errors, calculated as𝟙\[1N∑i=1Nϵimethod<1N∑i=1Nϵirand\]∈\{0,1\}\\mathbbm\{1\}\\Big\[\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\epsilon^\{\\text\{method\}\}\_\{i\}<\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\epsilon^\{\\text\{rand\}\}\_\{i\}\\Big\]\\in\\\{0,1\\\}\. We compute a win signal for each of the 75 contexts, and average the win signal to produce the macro\-average win\-rate as a summary metric\.
##### Reliability metrics\.
To capture notions of reliability for evaluation of an LLM judge, we consider the following reliability metrics: intraclass correlation coefficient \(ICC\)\[[43](https://arxiv.org/html/2606.15029#bib.bib19)\], Krippendorff’sα\\alpha\[[29](https://arxiv.org/html/2606.15029#bib.bib2)\], Spearman’sρ\\rhorank correlation\[[44](https://arxiv.org/html/2606.15029#bib.bib3)\], and Kendall’sτ\\taurank correlation\[[26](https://arxiv.org/html/2606.15029#bib.bib4)\]\. Based on previous literature\[[7](https://arxiv.org/html/2606.15029#bib.bib67),[13](https://arxiv.org/html/2606.15029#bib.bib37),[34](https://arxiv.org/html/2606.15029#bib.bib85)\], these metrics encompass a simple suite of correlative measures between LLM judge scores and human scores\. We provide formulae for each metric computation and comparative exposition on these metrics in Appendix[A](https://arxiv.org/html/2606.15029#A1)\.
##### Datasets and axes\.
We provide empirical evaluations on 4 diverse real\-world datasets: MSLR\[[49](https://arxiv.org/html/2606.15029#bib.bib8)\], HANNA\[[12](https://arxiv.org/html/2606.15029#bib.bib7)\], MedVAL\[[1](https://arxiv.org/html/2606.15029#bib.bib78)\], and SummEval\[[19](https://arxiv.org/html/2606.15029#bib.bib5)\]\. Each dataset is annotated along multiple axes \(e\.g\., coherence, relevance, surprise\) with a total of 15 sets of human annotations by which to evaluate sampling performance, as shown in Table[3](https://arxiv.org/html/2606.15029#A2.T3)\. All datasets contain samples that are annotated by more than one human; we take the average score of a subject between total raters as the true score and compare it to the LLM judge score\. All datasets use Likert scaling with a range of 1–5 for SummEval and HANNA, 0–2 for MSLR, and 1–4 for MedVAL\[[25](https://arxiv.org/html/2606.15029#bib.bib83)\]\. The full range of datasets, axes, and raters per dataset is presented in Appendix[B](https://arxiv.org/html/2606.15029#A2)\. The evaluation axes span from more objective to more subjective scores, capturing a broad spectrum of evaluation types\. We truncate each dataset to contain300300human\-labeled samples and perform evaluations on subsets of this annotation set\. We perform the evaluations over4040trials\. Therefore, we evaluate in a total of5×15=755\\times 15=75contexts over all datasets, axes and target judge models\.
##### Judges and synthetic labels\.
Our model suite consists of Claude\-3\.5\-Sonnet\[[3](https://arxiv.org/html/2606.15029#bib.bib101)\], GPT\-4\.1\[[38](https://arxiv.org/html/2606.15029#bib.bib103)\], GPT\-5\[[39](https://arxiv.org/html/2606.15029#bib.bib102)\], Deepseek\-R1\[[23](https://arxiv.org/html/2606.15029#bib.bib99)\], and Gemini\-2\.5\-pro\[[16](https://arxiv.org/html/2606.15029#bib.bib100)\], with additional model ablations in Appendix[I](https://arxiv.org/html/2606.15029#A9)\. We run experiments where each model serves as the reference judge modelMMand the remaining models comprise the synthetic label collectionℳ\\mathcal\{M\}\.
## 4Estimation Evaluation
In this section, we discuss empirical results for metric matching\. First, we compare the estimation error ofMetric Matchagainst baselines and compute annotation savings of usingMetric Matchinstead of random selection \(Section[4\.1](https://arxiv.org/html/2606.15029#S4.SS1)\)\. Then, we compute a win\-rate ofMetric Matchagainst random selection and discuss a connection between win\-rate and subset estimation error rank \(Section[4\.2](https://arxiv.org/html/2606.15029#S4.SS2)\)\. Finally, we evaluateMetric Matchon an alternate task of reliability classification given a deployment threshold, highlighting the practical benefit of improved reliability estimation \(Section[4\.3](https://arxiv.org/html/2606.15029#S4.SS3)\)\.




Figure 2:Metric matched selection outperforms baselines\.We report the estimation error by annotation budget across our suite of target metrics, averaged over all datasets\. The relative improved estimation error across all metrics is 18\.7%\.### 4\.1Estimation Error
We quantify the impact ofMetric Matchon reliability estimation against random selection, showcasing raw improvement in estimation across metrics, datasets, and budgets in Section[4\.1\.1](https://arxiv.org/html/2606.15029#S4.SS1.SSS1)\. From this improvement, we calculate relative annotation savings in Section[4\.1\.2](https://arxiv.org/html/2606.15029#S4.SS1.SSS2)and the associated cost savings in Section[4\.1\.3](https://arxiv.org/html/2606.15029#S4.SS1.SSS3)\.
#### 4\.1\.1Raw Estimation Improvement
We highlight in Figure[2](https://arxiv.org/html/2606.15029#S4.F2), the average estimation error of metric matching and the suite of comparable baselines\. Across metrics, we see improvement in estimation error with metric matching, most significantly at low budgets, demonstrating the utility of metric matching in labor\-constrained domains, where human annotation is extremely costly\. Notably, we do not observe a degradation in our performance at higher budgets, which we see in the bias\-correction baseline \(Random\_bc\)\. We calculate the relative improved estimation error across all metrics as 18\.7%, with disagreggated metrics as follows: 26\.7% for ICC, 9\.1% for Krippendorff’sα\\alpha, 21\.0% for Spearman’sρ\\rho, and 17\.9% for Kendall’sτ\\tau\.
Our results illustrate that for each of the chosen reliability metrics,Metric Matchachieves a lower estimation error than baselines which naively collect human annotations\. We display disaggregated results for dataset/axis contexts in Appendix[H](https://arxiv.org/html/2606.15029#A8)\. In dataset/axis contexts in which metric matching decays performance relative to random selection, we hypothesize that the key assumption on rank correlation between inter\-model estimation error and human\-model estimation error is violated\. We validate this hypothesis empirically in Appendix[C](https://arxiv.org/html/2606.15029#A3), and provide intuitive analysis in Section[4\.2\.1](https://arxiv.org/html/2606.15029#S4.SS2.SSS1)\.
Figure 3:Metric matched selection achieves the same estimation error as random with less annotations required\.We highlight the average relative savings over each reliability metric between metric matched selection and random selection, with an average relative improvement of 32\.5%\.
#### 4\.1\.2Relative Annotation Savings
We highlight in Figure[3](https://arxiv.org/html/2606.15029#S4.F3)the average annotations saved from using metric matching for sample selection\. For each budget given to random selection, we compute the required budget needed forMetric Matchto match the estimation error of random selection according to Figure[2](https://arxiv.org/html/2606.15029#S4.F2)\. We plot the relationship between the random selection budget and our required budget\. As such, a line with a slope of 1 would indicate thatMetric Matchperforms as well as random, while points below that line indicateMetric Matchoutperforming random selection\. We see gains across all metrics observed, with an average relative improvement of 32\.5%\. Disaggregated results are reported in Appendix[H](https://arxiv.org/html/2606.15029#A8)\.
#### 4\.1\.3Annotation Cost Savings
LetRRbe the random sampling budget,ttbe the estimated time per annotation \(in hours\), andccbe the cost per unit time of annotation \(in dollars per hour\)\. GivenRR, we determine the required annotation budgetMMby identifying the point at which the estimation error ofMetric Matchselection equals the annotation error atRR\. We quantify cost savings as:
C\(R\)=\(R−Em\(R\)\)⋅t⋅cC\(R\)=\(R\-E\_\{m\}\(R\)\)\\cdot t\\cdot c\(1\)
whereEm\(R\)E\_\{m\}\(R\)denotes the Metric Matched annotation budget that achieves equivalent estimation accuracy to random sampling at budgetRR\. For MedVAL, assuming 5 minutes per annotation at a cost of $500 per annotator hour \(as described in\[[48](https://arxiv.org/html/2606.15029#bib.bib29)\]\), this becomes for ICC:
$1,041\.67=\(50−25\)⋅112⋅500\\mathdollar 1\{,\}041\.67=\(50\-25\)\\cdot\\frac\{1\}\{12\}\\cdot 500\(2\)
### 4\.2Win\-rate against random selection baseline
We show in Table[1](https://arxiv.org/html/2606.15029#S4.T1)the win\-rate over trials betweenMetric Match’s estimation error with the target parameter and random’s estimation error\. In terms of macro win\-rate, we beat random with an average win rate across metrics of 0\.838, indicating robustness of metric matching for estimation error across metrics\. In micro win\-rate, we beat random in 53\.8% of empirical trials\. We include disaggregated win rate tables for individual axis performance in Appendix[H](https://arxiv.org/html/2606.15029#A8), as well as win rates against the other two baselines in Appendix[D](https://arxiv.org/html/2606.15029#A4)\.
Table 1:Metric matched selection beats random selection for minimizing estimation error\.Comparison of Macro \(left\) and Micro \(right\) win rates across different budgets\. Macro win rates achieve an average of 0\.838, while micro win rates achieve an average of 0\.538\.\(a\)Macro Win RateBudgetα\\alphaICCρ\\rhoτ\\tau50\.8531\.0000\.9470\.987100\.8270\.9870\.9470\.973150\.7470\.9470\.9070\.907200\.7330\.9200\.9600\.920250\.7200\.9070\.8800\.907300\.7330\.8930\.8670\.893350\.6530\.8530\.8930\.773400\.6530\.8400\.8000\.760450\.6670\.8000\.8000\.747500\.6530\.7730\.7870\.693Avg0\.7240\.8920\.8790\.856
\(b\)Micro Win RateBudgetα\\alphaICCρ\\rhoτ\\tau50\.5740\.6410\.5800\.575100\.5690\.6160\.5950\.584150\.5370\.5910\.5620\.552200\.5070\.5590\.5560\.546250\.5110\.5540\.5600\.555300\.5010\.5480\.5510\.535350\.4760\.5220\.5400\.521400\.4670\.5110\.5290\.505450\.4760\.5050\.5270\.509500\.4650\.4840\.5280\.509Avg0\.5080\.5530\.5530\.539
Importantly, while macro win\-rate effectively constitutes a measure of the frequency \(across contexts\) with which ourexpectedestimation error beatsexpectedrandom error, micro win\-rate is an empirical estimate of the probability of dominating random selection in a single trial\. We see that in both cases, across budgets, and metrics, our win rate is greater than0\.50\.5\.
Improved performance withMetric Matchgeneralizes across model scales \(see Appendix[I](https://arxiv.org/html/2606.15029#A9)for small model results\) and outperforms random sampling on alternative metrics including mean squared error \(Appendix[G](https://arxiv.org/html/2606.15029#A7)\)\. We additionally show results when matching on complementary metrics in Appendix[F](https://arxiv.org/html/2606.15029#A6), highlighting that alternative signals of inter\-model reliability may be helpful for optimal subset selection, as well\.
#### 4\.2\.1Analysis of win\-rate and expected subset rank
As noted in Section[3\.2](https://arxiv.org/html/2606.15029#S3.SS2), the performance of metric matching relies on the joint distribution over the ranks of the inter\-model and human\-model estimation errors in subsetsS⊆𝒳S\\subseteq\\mathcal\{X\}\. Here, we clarify the precise relationship between micro\-average win\-rate and the expected rank of subsetS∗S^\{\*\}over possible subsets with respect to the human\-model estimation error\. Consider the candidate subsetsS1,…,SC∼𝒳S\_\{1\},\\dots,S\_\{C\}\\sim\\mathcal\{X\}: letϵj:=\|ρ^Sj−ρ\|\\epsilon\_\{j\}:=\|\\widehat\{\\rho\}\_\{S\_\{j\}\}\-\\rho\|andϵjIM:=\|ρ^SjIM−ρIM\|\\epsilon\_\{j\}^\{\\text\{IM\}\}:=\|\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\_\{j\}\}\-\\rho^\{\\text\{IM\}\}\|, andj∗j^\{\*\}be the index ofS∗S^\{\*\}in\{1,…,C\}\\\{1,\\dots,C\\\}\. The expected micro\-average win\-rate of metric matching is exactly𝔼\[𝟙\[ϵj∗<ϵrand\]\]=Pr\(ϵj∗<ϵrand\)\\mathbb\{E\}\\Big\[\\mathbbm\{1\}\\big\[\\epsilon\_\{j^\{\*\}\}<\\epsilon^\{\\text\{rand\}\}\\big\]\\Big\]=\\Pr\(\\epsilon\_\{j^\{\*\}\}<\\epsilon^\{\\text\{rand\}\}\)\. LetRRdenote the rank operator with respect to human\-model estimation error\. Recall that the rank ofS∗S^\{\*\}with respect to inter\-model estimation error is11by definition\. The expected micro\-average win\-rate of metric matching depends linearly on the expected rank of thej∗j^\{\*\}th subset in the human\-model,𝔼\[R\[ϵj∗\]\]\\mathbb\{E\}\[R\[\\epsilon\_\{j^\{\*\}\}\]\]\. We provide an explicit formula, derivation and empirical validation of this relationship in Appendix[C](https://arxiv.org/html/2606.15029#A3)\.
### 4\.3Reliability Classification
In practice, a key motivation for estimating judge reliability is to determine whether the judge is sufficient to be used for the downstream task\. This corresponds to a classification task of determining whether judge reliability is above a certain thresholdtt\[[20](https://arxiv.org/html/2606.15029#bib.bib46)\]\. We evaluate the performance ofMetric Matchrelative to random selection for the purposes of this classification task in Table[2](https://arxiv.org/html/2606.15029#S4.T2)\. Given an estimateρ\\rho, we derive a prediction based on1\[ρ\>t\]1\[\\rho\>t\]\. We compute a macro\-averaged win\-rate and micro\-averaged win\-rate forMetric Matchrelative to random selection at thresholds oft∈\{0\.6,0\.7,0\.8\}t\\in\\\{0\.6,0\.7,0\.8\\\}\. Here, a win is whenMetric Matchproduces a correct classification while random selection produces an incorrect classification\. We average win rates across thresholds\.
Table 2:Metric matched selection beats random selection for reliability classification task\.Comparison of Macro \(left\) and Micro \(right\) win rates across different budgets\. Macro win rates achieve an average of 0\.652, while micro win rates achieve an average of 0\.570\.\(a\)Macro Win RateBudgetα\\alphaICCρ\\rhoτ\\tau50\.6580\.5190\.8250\.832100\.7630\.5660\.7940\.907150\.7670\.5580\.7310\.788200\.5350\.5340\.6440\.451250\.6120\.5950\.8050\.742300\.6590\.5470\.7330\.678350\.6060\.5560\.7890\.684400\.5270\.5420\.7020\.647450\.4580\.5350\.6900\.786500\.5450\.5630\.4680\.739Avg0\.6130\.5510\.7180\.725
\(b\)Micro Win RateBudgetα\\alphaICCρ\\rhoτ\\tau50\.5850\.5070\.6480\.700100\.6030\.5260\.6540\.724150\.5980\.5210\.5990\.618200\.4960\.5120\.5790\.504250\.5420\.5470\.6460\.582300\.5810\.5170\.5740\.608350\.5790\.5020\.5860\.586400\.4730\.4730\.6020\.617450\.4680\.4830\.5280\.744500\.4840\.4870\.5030\.682Avg0\.5410\.5080\.5920\.637
## 5Discussion
We develop a new method,Metric Match, for estimating the reliability of LLM judges using synthetic labels to best select a representative subset for human annotation\. Empirically,Metric Matchreduces estimation error by an average of 18\.7% relative to random selection, outperforming baselines including bias correction and stratified sampling\. This improved estimation ability reduces the number of annotations required to reach a target error by an average of 32\.5%\. In a case study on the MedVAL dataset, our cost model translates this reduction into savings of up to $1,041\.67\. To characterize the consistency of these gains, we conduct a systematic win\-rate comparison against random selection, the de facto approach in practice, observing an average win rate of 0\.838 on estimation error when averaged across contexts and budgets\. We provide theoretical justification for this estimation approach, proving that under rank alignment assumptions between human\-model and inter\-model metrics on subsets, metric matching provably outperforms random selection\. Finally, we consider an alternative task to reliability estimation as reliability classification, which is defined as a practitioner using the estimated parameter against a deployment threshold to decide whether a judge is fit for use\. For this task,Metric Matchattains a win rate of 0\.652 on the binary accept\-versus\-reject decision\. Together, these results demonstrate that improved estimation of the population\-level target parameter enables practitioners to make more accurate decisions using fewer human annotations\.
While this work highlights the performance of metric matching for reliability estimation of correlation\-based metrics, future work may engage with a wider set of metrics\. We include preliminary analysis for Mean Squared Error as a target parameter in Appendix[G](https://arxiv.org/html/2606.15029#A7), showing that metric\-matching outperforms random selection on non\-correlation based metrics, although with smaller annotation gains and absolute estimation error savings\. Quantifying improvement a priori remains an open question\. Reliability is use\-case dependent, and this estimation approach should not be conflated with accuracy\. Additionally, we do not address how to improve the judge score nor how to select the optimal judge from the ensemble\. These features are assumed static in our set\-up, but are more fluid in real world implementation\. Finally, this work doesn’t address how to incorporate human feedback online to improve estimation actively\. Integrating active learning approaches is likely to improve performance beyond our current batch setting, though this constitutes a different problem formulation\.
More broadly,Metric Matchoffers a promising approach for practitioners to validate LLM judges more efficiently, reducing both the cost and time required for expert annotation\. We hope that our work serves as a starting point for organizations to iterate faster on judge development, identify failure modes earlier, and to detect misalignment with human preferences while achieving equivalent estimation accuracy with fewer labeled examples\.
## 6Acknowledgments
AU acknowledges support from the Stanford Graduate Fellowship and NSF GRFP\. MJ acknowledges partial support from a Stanford AI Lab Postdoctoral fellowship\. SK acknowledges support by NSF 2046795 and 2205329, IES R305C240046, ARPA\-H, the MacArthur Foundation, Schmidt Sciences, and HAI\. TH was supported by a grant by HAI, DSO labs, gifts from Open Philanthropy, Amazon, Schmidt Sciences, the Tianqiao and Chrissy Chen Foundation and a grant under the NSF CAREER IIS\-2338866, ONR N00014\-24\-1\-2609, and DARPA Cooperative Agreement HR00112520013\. This work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred\.
## References
- \[1\]A\. Aali, V\. Bikia, M\. Varma, N\. Chiou, S\. Ostmeier, A\. Singhvi, M\. Paschali, A\. Kumar, A\. Johnston, K\. Amador\-Martinez, E\. J\. P\. Guerrero, P\. N\. C\. Rivera, S\. Gatidis, C\. Bluethgen, E\. P\. Reis, E\. D\. Z\. van Rilland, P\. L\. Hosamani, K\. R\. Keet, M\. Go, E\. Ling, D\. B\. Larson, C\. Langlotz, R\. Daneshjou, J\. Hom, S\. Koyejo, E\. Alsentzer, and A\. S\. Chaudhari\(2025\)MedVAL: toward expert\-level medical text validation with language models\.External Links:2507\.03152Cited by:[item 1](https://arxiv.org/html/2606.15029#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px4.p1.3)\.
- \[2\]A\. N\. Angelopoulos, S\. Bates, C\. Fannjiang, M\. I\. Jordan, and T\. Zrnic\(2023\)Prediction\-powered inference\.Science382\(6671\),pp\. 669–674\.External Links:[Document](https://dx.doi.org/10.1126/science.adi6000),[Link](https://www.science.org/doi/abs/10.1126/science.adi6000),https://www\.science\.org/doi/pdf/10\.1126/science\.adi6000Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p1.1)\.
- \[3\]Anthropic\(2024\)Claude 3\.5 sonnet model card addendum\.Technical reportAnthropic\.Note:Accessed: 2026\-05\-05External Links:[Link](https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px5.p1.2)\.
- \[4\]R\. Artstein and M\. Poesio\(2008\-12\)Inter\-coder agreement for computational linguistics\.Comput\. Linguist\.34\(4\),pp\. 555–596\.External Links:ISSN 0891\-2017,[Link](https://doi.org/10.1162/coli.07-034-R2),[Document](https://dx.doi.org/10.1162/coli.07-034-R2)Cited by:[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[5\]E\. Asgari, N\. Montaña\-Brown, M\. Dubois,et al\.\(2025\)A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation\.npj Digital Medicine8,pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1)\.
- \[6\]J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang, B\. Hui, L\. Ji, M\. Li, J\. Lin, R\. Lin, D\. Liu, G\. Liu, C\. Lu, K\. Lu, J\. Ma, R\. Men, X\. Ren, X\. Ren, C\. Tan, S\. Tan, J\. Tu, P\. Wang, S\. Wang, W\. Wang, S\. Wu, B\. Xu, J\. Xu, A\. Yang, H\. Yang, J\. Yang, S\. Yang, Y\. Yao, B\. Yu, H\. Yuan, Z\. Yuan, J\. Zhang, X\. Zhang, Y\. Zhang, Z\. Zhang, C\. Zhou, J\. Zhou, X\. Zhou, and T\. Zhu\(2023\)Qwen technical report\.External Links:2309\.16609,[Link](https://arxiv.org/abs/2309.16609)Cited by:[§I\.1](https://arxiv.org/html/2606.15029#A9.SS1.p1.1)\.
- \[7\]S\. Bedi, H\. Cui, M\. Fuentes, A\. Unell, M\. Wornow, J\. M\. Banda, N\. Kotecha, T\. Keyes, Y\. Mai, M\. Oez, H\. Qiu, S\. Jain, L\. Schettini, M\. Kashyap, J\. A\. Fries, A\. Swaminathan, P\. Chung, F\. Nateghi, A\. Aali, A\. Nayak, S\. Vedak, S\. S\. Jain, B\. Patel, O\. Fayanju, S\. Shah, E\. Goh, D\. Yao, B\. Soetikno, E\. Reis, S\. Gatidis, V\. Divi, R\. Capasso, R\. Saralkar, C\. Chiang, J\. Jindal, T\. Pham, F\. Ghoddusi, S\. Lin, A\. S\. Chiou, C\. Hong, M\. Roy, M\. F\. Gensheimer, H\. Patel, K\. Schulman, D\. Dash, D\. Char, L\. Downing, F\. Grolleau, K\. Black, B\. Mieso, A\. Zahedivash, W\. Yim, H\. Sharma, T\. Lee, H\. Kirsch, J\. Lee, N\. Ambers, C\. Lugtu, A\. Sharma, B\. Mawji, A\. Alekseyev, V\. Zhou, V\. Kakkar, J\. Helzer, A\. Revri, Y\. Bannett, R\. Daneshjou, J\. Chen, E\. Alsentzer, K\. Morse, N\. Ravi, N\. Aghaeepour, V\. Kennedy, A\. Chaudhari, T\. Wang, S\. Koyejo, M\. P\. Lungren, E\. Horvitz, P\. Liang, M\. Pfeffer, and N\. H\. Shah\(2025\)MedHELM: holistic evaluation of large language models for medical tasks\.External Links:2505\.23802,[Link](https://arxiv.org/abs/2505.23802)Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p4.7),[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[8\]S\. R\. Bowman, J\. Hyun, E\. Perez, E\. Chen, C\. Pettit, S\. Heiner, K\. Lukošiūtė, A\. Askell, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon, C\. Olah, D\. Amodei, D\. Amodei, D\. Drain, D\. Li, E\. Tran\-Johnson, J\. Kernion, J\. Kerr, J\. Mueller, J\. Ladish, J\. Landau, K\. Ndousse, L\. Lovitt, N\. Elhage, N\. Schiefer, N\. Joseph, N\. Mercado, N\. DasSarma, R\. Larson, S\. McCandlish, S\. Kundu, S\. Johnston, S\. Kravec, S\. E\. Showk, S\. Fort, T\. Telleen\-Lawton, T\. Brown, T\. Henighan, T\. Hume, Y\. Bai, Z\. Hatfield\-Dodds, B\. Mann, and J\. Kaplan\(2022\)Measuring progress on scalable oversight for large language models\.External Links:2211\.03540,[Link](https://arxiv.org/abs/2211.03540)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[9\]C\. Burns, P\. Izmailov, J\. H\. Kirchner, B\. Baker, L\. Gao, L\. Aschenbrenner, Y\. Chen, A\. Ecoffet, M\. Joglekar, J\. Leike, I\. Sutskever, and J\. Wu\(2024\)Weak\-to\-strong generalization: eliciting strong capabilities with weak supervision\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[10\]A\. Celikyilmaz, E\. Clark, and J\. Gao\(2021\)Evaluation of text generation: a survey\.External Links:2006\.14799,[Link](https://arxiv.org/abs/2006.14799)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1)\.
- \[11\]A\. T\. Chaganty, A\. Paranjape, P\. Liang, and C\. D\. Manning\(2017\)Importance sampling for unbiased on\-demand evaluation of knowledge base population\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p1.1)\.
- \[12\]C\. Chhun, F\. M\. Suchanek, and C\. Clavel\(2024\)Do language models enjoy their own stories? Prompting large language models for automatic story evaluation\.Transactions of the Association for Computational Linguistics12,pp\. 1122–1142\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00689),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/tacl\_a\_00689/2470807/tacl\_a\_00689\.pdf,ISSN 2307\-387X,[Link](https://doi.org/10.1162/tacl%5C_a%5C_00689)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px4.p1.3)\.
- \[13\]C\. Chiang and H\. Lee\(2023\)Can large language models be an alternative to human evaluations?\.External Links:2305\.01937,[Link](https://arxiv.org/abs/2305.01937)Cited by:[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[14\]E\. Clark, T\. August, S\. Serrano, N\. Haduong, S\. Gururangan, and N\. A\. Smith\(2021\-08\)All that’s ‘human’ is not gold: evaluating human evaluation of generated text\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 7282–7296\.External Links:[Link](https://aclanthology.org/2021.acl-long.565/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.565)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[15\]W\. G\. Cochran\(1977\)Sampling techniques\.3 edition,John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
- \[16\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen, L\. Marris, S\. Petulla, C\. Gaffney, A\. Aharoni, N\. Lintz, T\. C\. Pais, H\. Jacobsson, I\. Szpektor, N\. Jiang, K\. Haridasan, A\. Omran, N\. Saunshi, D\. Bahri, G\. Mishra, E\. Chu, T\. Boyd, B\. Hekman, A\. Parisi, C\. Zhang, K\. Kawintiranon, T\. Bedrax\-Weiss, O\. Wang, Y\. Xu, O\. Purkiss, U\. Mendlovic, I\. Deutel, N\. Nguyen, A\. Langley, F\. Korn, L\. Rossazza, A\. Ramé, S\. Waghmare, H\. Miller, N\. Byrd, A\. Sheshan, R\. Hadsell, S\. Bhardwaj, P\. Janus, T\. Rissa, D\. Horgan, A\. Abdagic, L\. Belenki, J\. Allingham, A\. Singh, T\. Guidroz, S\. Srinivasan, H\. Schmit, K\. Chiafullo, A\. Elisseeff, N\. Jha, P\. Kolhar, L\. Berrada, F\. Ding, X\. Si, S\. B\. Mallick, F\. Och, S\. Erell, E\. Ni, T\. Latkar, S\. Yang, P\. Sirkovic, Z\. Feng, R\. Leland, R\. Hornung, G\. Wu, C\. Blundell, H\. Alvari, P\. Huang, C\. Yip, S\. Deur, L\. Liu, G\. Surita, P\. Duque, D\. Damen, J\. Jia, A\. Guez, M\. Mircea, A\. Sinha, A\. Magni, P\. Stradomski, T\. Marian, V\. Galić, W\. Chen, H\. Husain, A\. Singhal, D\. Grewe, F\. Aubet, S\. Song, L\. Blanco, L\. Rechis, L\. Ho, R\. Munoz, K\. Zheng, J\. Hamrick, K\. Mather, H\. Taitelbaum, E\. Rutherford, Y\. Lei, K\. Chen, A\. Shukla, E\. Moreira, E\. Doi, B\. Isik, N\. Shabat, D\. Rogozińska, K\. Kolipaka, J\. Chang, E\. Vušak, S\. Venkatachary, S\. Noghabi, T\. Bharti, Y\. Jun, A\. Zaks, S\. Green, J\. Challagundla, W\. Wong, M\. Mohammad, D\. Hirsch, Y\. Cheng, I\. Naim, L\. Proleev, D\. Vincent, A\. Singh, M\. Krikun, D\. Krishnan, Z\. Ghahramani, A\. Atias, R\. Aggarwal, C\. Kirov, D\. Vytiniotis, C\. Koh, A\. Chronopoulou, P\. Dogra, V\. Ion, G\. Tyen, J\. Lee, F\. Weissenberger, T\. Strohman, A\. Balakrishna, J\. Rae, M\. Velic, R\. de Liedekerke, O\. Elyada, W\. Yuan, C\. Liu, L\. Shani, S\. Kishchenko, B\. Alessio, Y\. Li, R\. Song, S\. Kwei, O\. Jankowski, A\. Pappu, Y\. Namiki, Y\. Ma, N\. Tripuraneni, C\. Cherry, M\. Ikonomidis, Y\. Ling, C\. Ji, B\. Westberg, A\. Wright, D\. Yu, D\. Parkinson, S\. Ramaswamy, J\. Connor, S\. H\. Yeganeh, S\. Grover, G\. Kenwright, L\. Litchev, C\. Apps, A\. Tomala, F\. Halim, A\. Castro\-Ros, Z\. Li, A\. Boral, P\. Sho, M\. Yarom, E\. Malmi, D\. Klinghoffer, R\. Lin, A\. Ansell, P\. K\. S, S\. Zhao, S\. Zuo, A\. Santoro, H\. Cheng, S\. Demmessie, Y\. Liu, N\. Brichtova, A\. Culp, N\. Braun, D\. Graur, W\. Ng, N\. Mehta, A\. Phillips, P\. Sundberg, V\. Godbole, F\. Liu, Y\. Katariya, D\. Rim, M\. Seyedhosseini, S\. Ammirati, J\. Valfridsson, M\. Malihi, T\. Knight, A\. Toor, T\. Lampe, A\. Ittycheriah, L\. Chiang, C\. Yeung, A\. Fréchette, J\. Rao, H\. Wang, H\. Srivastava, R\. Zhang, R\. Rhodes, A\. Brand, D\. Weesner, I\. Figotin, F\. Gimeno, R\. Fellinger, P\. Marcenac, J\. Leal, E\. Marcus, V\. Cotruta, R\. Cabrera, S\. Luo, D\. Garrette, V\. Axelrod, S\. Baltateanu, D\. Barker, D\. Chen, H\. Toma, B\. Ingram, J\. Riesa, C\. Kulkarni, Y\. Zhang, H\. Liu, C\. Wang, M\. Polacek, W\. Wu, K\. Hui, A\. N\. Reyes, Y\. Su, M\. Barnes, I\. Malhi, A\. Siddiqui, Q\. Feng, M\. Damaschin, D\. Pighin, A\. Steiner, S\. Yang, R\. S\. Boppana, S\. Ivanov, A\. Kandoor, A\. Shah, A\. Mujika, D\. Huang, C\. A\. Choquette\-Choo, M\. Patel, T\. Yu, T\. Creswell, Jerry, Liu, C\. Barros, Y\. Razeghi, A\. Roy, P\. Culliton, B\. Xiong, J\. Pan, T\. Strohmann, T\. Powell, B\. Seal, D\. DeCarlo, P\. Shyam, K\. Katircioglu, X\. Wang, C\. Hardin, I\. Odisho, J\. Broder, O\. Chang, A\. Nair, A\. Shtefan, M\. O’Brien, M\. Agarwal, S\. Potluri, S\. Goyal, A\. Jhindal, S\. Thakur, Y\. Stuken, J\. Lyon, K\. Toutanova, F\. Feng, A\. Wu, B\. Horn, A\. Wang, A\. Cullum, G\. Taubman, D\. Shrivastava, C\. Shi, H\. Tomlinson, R\. Patel, T\. Tu, A\. M\. Oflazer, F\. Pongetti, M\. Yang, A\. A\. Taïga, V\. Perot, N\. W\. Pierse, F\. Han, Y\. Drori, I\. Iturrate, A\. Chakrabarti, L\. Yeung, D\. Dopson, Y\. Chen, A\. Kulshreshtha, T\. Guo, P\. Pham, T\. Schuster, J\. Chen, A\. Polozov, J\. Xing, H\. Zhou, P\. Kacham, D\. Kukliansky, A\. Miech, S\. Yaroshenko, E\. Chi, S\. Douglas, H\. Fei, M\. Blondel, P\. Myla, L\. Madmoni, X\. Wu, D\. Keysers, K\. Kjems, I\. Albuquerque, L\. Yu, J\. D’sa, M\. Plantan, V\. Ionescu, J\. S\. Elias, A\. Gupta, M\. R\. Vuyyuru, F\. Alcober, T\. Zhou, K\. Ji, F\. Hartmann, S\. Puttagunta, H\. Song, E\. Amid, A\. Stefanoiu, A\. Lee, P\. Pucciarelli, E\. Wang, A\. Raul, S\. Petrov, I\. Tian, V\. Anklin, N\. Nti, V\. Gomes, M\. Schumacher, G\. Vesom, A\. Panagopoulos, K\. Bousmalis, D\. Andor, J\. Jacob, Y\. Zhang, B\. Rosgen, M\. Kecman, M\. Tung, A\. Belias, N\. Goodman, P\. Covington, B\. Wieder, N\. Saxena, E\. Davoodi, M\. Huang, S\. Maddineni, V\. Roulet, F\. Campbell\-Ajala, P\. G\. Sessa, Xintian, Wu, G\. Lai, P\. Collins, A\. Haig, V\. Sakenas, X\. Xu, M\. Giustina, L\. E\. Shafey, P\. Charoenpanit, S\. Garg, J\. Ainslie, B\. Severson, M\. G\. Arenas, S\. Pathak, S\. Rajayogam, J\. Feng, M\. Bakker, S\. Li, N\. Wichers, J\. Rogers, X\. Geng, Y\. Li, R\. Jagerman, C\. Jia, N\. Olmert, D\. Sharon, M\. Mauger, S\. Mariserla, H\. Ma, M\. Mohabey, K\. Kim, A\. Andreev, S\. Pollom, J\. Love, V\. Jain, P\. Agrawal, Y\. Schroecker, A\. Fortin, M\. Warmuth, J\. Liu, A\. Leach, I\. Blok, G\. P\. Girirajan, R\. Aharoni, B\. Uria, A\. Sozanschi, D\. Goldberg, L\. Ionita, M\. T\. Ribeiro, M\. Zlocha, V\. Birodkar, S\. Lachgar, L\. Yuan, H\. Choudhury, M\. Ginsberg, F\. Zheng, G\. Dibb, E\. Graves, S\. Lokhande, G\. Rasskin, G\. Muraru, C\. Quick, S\. Tata, P\. Sermanet, A\. Chawla, I\. Karo, Y\. Wang, S\. Zhang, O\. Keller, A\. Dragan, G\. Su, I\. Chou, X\. Liu, Y\. Tao, S\. Prabhakara, M\. Wilson, R\. Liu, S\. Wang, G\. Evans, D\. Du, A\. Castaño, G\. Prasad, M\. E\. Mahdy, S\. Gerlach, M\. Reid, J\. Kahn, A\. Zait, T\. S\. Pillai, T\. Ulrich, G\. Wang, J\. Wassenberg, E\. Farkash, K\. Yalasangi, C\. Wang, M\. Bauza, S\. Bucher, T\. Liu, J\. Yan, G\. Leung, V\. Sindhwani, P\. Barnes, A\. Singh, I\. Jurin, J\. Chang, N\. K\. Bhumihar, S\. Eiger, G\. Citovsky, B\. Withbroe, Z\. Li, S\. Xue, N\. D\. Santo, G\. Stoyanov, Y\. Raimond, S\. Zheng, Y\. Gao, V\. Listík, S\. Kwasiborski, R\. Saputro, A\. Ozturel, G\. Mallya, K\. Majmundar, R\. West, P\. Caron, J\. Wei, L\. Castrejon, S\. Vikram, D\. Ramachandran, N\. Dhawan, J\. Park, S\. Smoot, G\. van den Driessche, Y\. Blau, C\. Malik, W\. Liang, R\. Hirsch, C\. N\. dos Santos, E\. Weinstein, A\. van den Oord, S\. Lall, N\. FitzGerald, Z\. Jiang, X\. Yang, D\. Webster, A\. Elqursh, A\. Pope, G\. Rotival, D\. Raposo, W\. Zhu, J\. Dean, S\. Alabed, D\. Tran, A\. Gupta, Z\. Gleicher, J\. Austin, E\. Rosseel, M\. Umekar, D\. Das, Y\. Sun, K\. Chen, K\. Misiunas, X\. Zhou, Y\. Di, A\. Loo, J\. Newlan, B\. Li, V\. Ramasesh, Y\. Xu, A\. Chen, S\. Gandhe, R\. Soricut, N\. Gupta, S\. Hu, S\. El\-Sayed, X\. Garcia, I\. Brusilovsky, P\. Chen, A\. Bolt, L\. Huang, A\. Gurney, Z\. Zhang, A\. Pritzel, J\. Wilkiewicz, B\. Seybold, B\. K\. Shamanna, F\. Fischer, J\. Dean, K\. Gill, R\. Mcilroy, A\. Bhowmick, J\. Selier, A\. Yang, D\. Cheng, V\. Magay, J\. Tan, D\. Varma, C\. Walder, T\. Kocisky, R\. Nakashima, P\. Natsev, M\. Kwong, I\. Gog, C\. Zhang, S\. Dieleman, T\. Jimma, A\. Ryabtsev, S\. Brahma, D\. Steiner, D\. Du, A\. Žužul, M\. Žanić, M\. Raghavachari, W\. Gierke, Z\. Zheng, D\. Petrova, Y\. Dauphin, Y\. Liu, I\. Kessler, S\. Hand, C\. Duvarney, S\. Kim, H\. Lee, L\. Hussenot, J\. Hui, J\. Smith, D\. Jain, J\. Xia, G\. S\. Tomar, K\. Amiri, D\. Phan, F\. Fuchs, T\. Weyand, N\. Tomasev, A\. Cordell, X\. Liu, J\. Mallinson, P\. Joshi, A\. Crawford, A\. Suggala, S\. Chien, N\. Fernando, M\. Sanchez\-Vargas, D\. Williams, P\. Crone, X\. Luo, I\. Karpov, J\. Shan, T\. Thurk, R\. Strudel, P\. Voigtlaender, P\. Patil, T\. Dozat, A\. Khodaei, S\. Singla, P\. Ambroszczyk, Q\. Wu, Y\. Chang, B\. Roark, C\. Hegde, T\. Ding, A\. Filos, Z\. Wu, A\. S\. Pinto, S\. Liu, S\. Khanna, A\. Pandey, S\. Mcloughlin, Q\. Li, S\. Haves, A\. Zhou, E\. Buchatskaya, I\. Leal, P\. de Boursac, N\. Akazawa, N\. Anderson, T\. Chen, K\. Somandepalli, C\. Liang, S\. Goenka, S\. Winkler, A\. Grushetsky, Y\. Ding, J\. Smith, F\. Ye, J\. Pont\-Tuset, E\. Li, R\. Li, T\. Golany, D\. Wegner, T\. Jiang, O\. Barak, Y\. Shangguan, E\. Vértes, R\. Wong, J\. Bornschein, A\. Tudor, M\. Bevilacqua, T\. Schaul, A\. S\. Rawat, Y\. Zhao, K\. Axiotis, L\. Meng, C\. McLean, J\. Lai, J\. Beattie, N\. Kushman, Y\. Liu, B\. Kutzman, F\. Lang, J\. Ye, P\. Netrapalli, P\. Mishra, M\. Khan, M\. Goel, R\. Willoughby, D\. Tian, H\. Zhuang, J\. Chen, Z\. Tsai, T\. Kementsietsidis, A\. Khare, J\. Keeling, K\. Xu, N\. Waters, F\. Altché, A\. Popat, B\. Mittal, D\. Saxton, D\. E\. Badawy, M\. Mathieu, Z\. Zheng, H\. Zhou, N\. Ranka, R\. Shin, Q\. Duan, T\. Salimans, I\. Mihailescu, U\. Shaham, M\. Chang, Y\. Assael, N\. Dikkala, M\. Izzard, V\. Cohen\-Addad, C\. Graves, V\. Feinberg, G\. Chung, D\. Strouse, D\. Karmon, S\. Sharifzadeh, Z\. Ashwood, K\. Pham, J\. Blanton, A\. Vasiloff, J\. Barber, M\. Geller, A\. Zhou, F\. Zubach, T\. Huang, L\. Zhang, H\. Gupta, M\. Young, J\. Proskurnia, R\. Votel, V\. Gabeur, G\. Barcik, A\. Tripathi, H\. Yu, G\. Yan, B\. Changpinyo, F\. Pavetić, A\. Coyle, Y\. Fujii, J\. G\. Mendez, T\. Zhou, H\. Rajamani, B\. Hechtman, E\. Cao, D\. Juan, Y\. Tan, V\. Dalibard, Y\. Du, N\. Clay, K\. Yao, W\. Jia, D\. Vijaykumar, Y\. Zhou, X\. Bai, W\. Hung, S\. Pecht, G\. Todorov, N\. Khadke, P\. Gupta, P\. Lahoti, A\. Autef, K\. Duddu, J\. Lee\-Thorp, A\. Bykovsky, T\. Misiunas, S\. Flennerhag, S\. Thangaraj, J\. McGiffin, Z\. Nado, M\. Kunesch, A\. Noever, A\. Hertz, M\. Liang, V\. Stone, E\. Palmer, S\. Daruki, A\. Pramanik, S\. Põder, A\. Kyker, M\. Khan, E\. Sluzhaev, M\. Ritter, A\. Ruderman, W\. Zhou, C\. Nagpal, K\. Vodrahalli, G\. Necula, P\. Barham, E\. Pavlick, J\. Hartford, I\. Shafran, L\. Zhao, M\. Mikuła, T\. Eccles, H\. Shimokawa, K\. Garg, L\. Vilnis, H\. Chen, I\. Shumailov, K\. Lee, A\. Abdelhamed, M\. Xie, V\. Cohen, E\. Hlavnova, D\. Malkin, C\. Sitawarin, J\. Lottes, P\. Coquinot, T\. Yu, S\. Kumar, J\. Zhang, A\. Mahendru, Z\. Ahmed, J\. Martens, T\. Chen, A\. Boag, D\. Peng, C\. Devin, A\. Klimovskiy, M\. Phuong, D\. Vainstein, J\. Xie, B\. Ramabhadran, N\. Howard, X\. Yu, G\. Goswami, J\. Cui, S\. Shleifer, M\. Pinto, C\. Yeh, M\. Yang, S\. Javanmardi, D\. Ethier, C\. Lee, J\. Orbay, S\. Kotecha, C\. Bromberg, P\. Shaw, J\. Thornton, A\. G\. Rosenthal, S\. Gu, M\. Thomas, I\. Gemp, A\. Ayyar, A\. Ushio, A\. Selvan, J\. Wee, C\. Liu, M\. Majzoubi, W\. Yu, J\. Abernethy, T\. Liechty, R\. Pan, H\. Nguyen, Qiong, Hu, S\. Perrin, A\. Arora, E\. Pitler, W\. Wang, K\. Shivakumar, F\. Prost, B\. Limonchik, J\. Wang, Y\. Gao, T\. Cour, S\. Buch, H\. Gui, M\. Ivanova, P\. Neubeck, K\. Chan, L\. Kim, H\. Chen, N\. Goyal, D\. Chung, L\. Liu, Y\. Su, A\. Petrushkina, J\. Shen, A\. Joulin, Y\. Xu, S\. X\. Lin, Y\. Kulizhskaya, C\. Chelba, S\. Vasudevan, E\. Collins, V\. Bashlovkina, T\. Lu, D\. Fritz, J\. Park, Y\. Zhou, C\. Su, R\. Tanburn, M\. Sushkov, M\. Rasquinha, J\. Li, J\. Prendki, Y\. Li, P\. LV, S\. Sharma, H\. Fitoussi, H\. Huang, A\. Dai, P\. Dao, M\. Burrows, H\. Prior, D\. Qin, G\. Pundak, L\. L\. Sjoesund, A\. Khurshudov, Z\. Zhu, A\. Webson, E\. Kemp, T\. Tan, S\. Agrawal, S\. Sargsyan, L\. Cheng, J\. Stephan, T\. Kwiatkowski, D\. Reid, A\. Byravan, A\. H\. Michaely, N\. Heess, L\. Zhou, S\. Goenka, V\. Carpenter, A\. Levskaya, B\. Wang, R\. Roberts, R\. Leblond, S\. Chikkerur, S\. Ginzburg, M\. Chang, R\. Riachi, Chuqiao, Xu, Z\. Borsos, M\. Pliskin, J\. Pawar, M\. Lustman, H\. Kirkwood, A\. Anand, A\. Chaudhary, N\. Kalb, K\. Milan, S\. Augenstein, A\. Goldie, L\. Prince, K\. Raman, Y\. Sun, V\. Xia, A\. Cohen, Z\. Huo, J\. Camp, S\. Ellis, L\. Zilka, D\. V\. Torres, L\. Patel, S\. Arora, B\. Chan, J\. Adler, K\. Ayoub, J\. Liang, F\. Jamil, J\. Jiang, S\. Baumgartner, H\. Sun, Y\. Karov, Y\. Akulov, H\. Zheng, I\. Cai, C\. Fantacci, J\. Rubin, A\. R\. Acha, M\. Wang, N\. D’Souza, R\. Sathyanarayana, S\. Dai, S\. Rowe, A\. Simanovsky, O\. Goldman, Y\. Kuang, X\. Pan, A\. Rosenberg, T\. Rojas\-Esponda, P\. Dutta, A\. Zeng, I\. Jurenka, G\. Farquhar, Y\. Bansal, S\. Iqbal, B\. Roelofs, G\. Joung, P\. Beak, C\. Ryu, R\. Poplin, Y\. Wu, J\. Alayrac, S\. Buthpitiya, O\. Ronneberger, C\. Habtegebriel, W\. Li, P\. Cavallaro, A\. Wei, G\. Bensky, T\. Denk, H\. Ganapathy, J\. Stanway, P\. Joshi, F\. Bertolini, J\. Lo, O\. Ma, Z\. Charles, G\. Sampemane, H\. Sahni, X\. Chen, H\. Askham, D\. Gaddy, P\. Young, J\. Tan, M\. Eyal, A\. Bražinskas, L\. Zhong, Z\. Wu, M\. Epstein, K\. Bailey, A\. Hard, K\. Lee, S\. Goldshtein, A\. Ruiz, M\. Badawi, M\. Lochbrunner, J\. Kearns, A\. Brown, F\. Pardo, T\. Weber, H\. Yang, P\. Jiang, B\. Akin, Z\. Fu, M\. Wainwright, C\. Zou, M\. Gaba, P\. Manzagol, W\. Kan, Y\. Song, K\. Zainullina, R\. Lin, J\. Ko, S\. Deshmukh, A\. Jindal, J\. Svensson, D\. Tyam, H\. Zhao, C\. Kaeser\-Chen, S\. Baird, P\. Moradi, J\. Hall, Q\. Guo, V\. Tsang, B\. Liang, F\. Pereira, S\. Ganesh, I\. Korotkov, J\. Adamek, S\. Thiagarajan, V\. Tran, C\. Chen, C\. Tar, S\. Jain, I\. Dasgupta, T\. Bilal, D\. Reitter, K\. Zhao, G\. Vezzani, Y\. Gehman, P\. Mehta, L\. Beltrone, X\. Dotiwalla, S\. Guadarrama, Z\. Abbas, S\. Karp, P\. Georgiev, C\. Ferng, M\. Brockschmidt, L\. Peng, C\. Hirnschall, V\. Verma, Y\. Bi, Y\. Xiao, A\. Dabush, K\. Xu, P\. Wallis, R\. Parker, Q\. Wang, Y\. Xu, I\. Safarli, D\. Tewari, Y\. Zhang, S\. Kim, A\. Gesmundo, M\. Thomas, S\. Levi, A\. Chowdhury, K\. Rao, P\. Garst, S\. Conway\-Rahman, H\. Ran, K\. McKinney, Z\. Xiao, W\. Yu, R\. Agrawal, A\. Stjerngren, C\. Ionescu, J\. Chen, V\. Sharma, J\. Chiu, F\. Liu, K\. Franko, C\. Sanford, X\. Cai, P\. Michel, S\. Ganapathy, J\. Labanowski, Z\. Garrett, B\. Vargas, S\. Sun, B\. Gale, T\. Buschmann, G\. Desjardins, N\. Ghelani, P\. Jain, M\. Verma, C\. Asawaroengchai, J\. Eisenschlos, J\. Harlalka, H\. Kazawa, D\. Metzler, J\. Howland, Y\. Jian, J\. Ades, V\. Shah, T\. Gangwani, S\. Lee, R\. Ring, S\. M\. Hernandez, D\. Reich, A\. Sinha, A\. Sathe, J\. Kovac, A\. Gill, A\. Kannan, A\. D’olimpio, M\. Sevenich, J\. Whang, B\. Kim, K\. C\. Sim, J\. Chen, J\. Zhang, S\. Lall, Y\. Matias, B\. Jia, A\. Friesen, S\. Nasso, A\. Thapliyal, B\. Perozzi, T\. Yu, A\. Shekhawat, S\. Huda, P\. Grabowski, E\. Wang, A\. Sreevatsa, H\. Dib, M\. Hassen, P\. Schuh, V\. Milutinovic, C\. Welty, M\. Quinn, A\. Shah, B\. Wang, G\. Barth\-Maron, J\. Frye, N\. Axelsson, T\. Zhu, Y\. Ma, I\. Giannoumis, H\. Sedghi, C\. Ye, Y\. Luan, K\. Aydin, B\. Chandra, V\. Sampathkumar, R\. Huang, V\. Lavrenko, A\. Eleryan, Z\. Hong, S\. Hansen, S\. M\. Carthy, B\. Samanta, D\. Ćevid, X\. Wang, F\. Li, M\. Voznesensky, M\. Hoffman, A\. Terzis, V\. Sehwag, G\. Fidel, L\. He, M\. Cai, Y\. He, A\. Feng, M\. Nikoltchev, S\. Phatale, J\. Chase, R\. Lawton, M\. Zhang, T\. Ouyang, M\. Tragut, M\. H\. Manshadi, A\. Narayanan, J\. Shen, X\. Gao, T\. Bolukbasi, N\. Roy, X\. Li, D\. Golovin, L\. Panait, Z\. Qin, G\. Han, T\. Anthony, S\. Kudugunta, V\. Patraucean, A\. Ray, X\. Chen, X\. Yang, T\. Bhatia, P\. Talluri, A\. Morris, A\. Ražnatović, B\. Brownfield, J\. An, S\. Peng, P\. Kane, C\. Zheng, N\. Duduta, J\. Kessinger, J\. Noraky, S\. Liu, K\. Rong, P\. Veličković, K\. Rush, A\. Goldin, F\. Wei, S\. M\. R\. Garlapati, C\. Pantofaru, O\. Kwon, J\. Ni, E\. Noland, J\. D\. Trapani, F\. Beaufays, A\. G\. Roy, Y\. Chow, A\. Turker, G\. Cideron, L\. Mei, J\. Clark, Q\. Dou, M\. Bošnjak, R\. Leith, Y\. Du, A\. Yazdanbakhsh, M\. Nasr, C\. Kwak, S\. S\. Sheth, A\. Kaskasoli, A\. Anand, B\. Lakshminarayanan, S\. Jerome, D\. Bieber, C\. Chu, A\. Senges, T\. Shen, M\. Sridhar, N\. Ndebele, B\. Beyret, S\. Mohamed, M\. Chen, M\. Freitag, J\. Guo, L\. Liu, P\. Roit, H\. Chen, S\. Yan, T\. Stone, J\. Co\-Reyes, J\. Cole, S\. Scellato, S\. Azizi, H\. Hashemi, A\. Jin, A\. Iyer, M\. Valentine, A\. György, A\. Ahuja, D\. H\. Diaz, C\. Lee, N\. Clement, W\. Kong, D\. Garmon, I\. Watts, K\. Bhatia, K\. Gupta, M\. Miecnikowski, H\. Vallet, A\. Taly, E\. Loper, S\. Joshi, J\. Atwood, J\. Chick, M\. Collier, F\. Iliopoulos, R\. Trostle, B\. Gunel, R\. Leal\-Cavazos, A\. M\. Hrafnkelsson, M\. Guzman, X\. Ju, A\. Forbes, J\. Emond, K\. Chauhan, B\. Caine, L\. Xiao, W\. Zeng, A\. Moufarek, D\. Murphy, M\. Meng, N\. Gupta, F\. Riedel, A\. Das, E\. Lawal, S\. Narayan, T\. Sosea, J\. Swirhun, L\. Friso, B\. Neyshabur, J\. Lu, S\. Girgin, M\. Wunder, E\. Yvinec, A\. Pyne, V\. Carbune, S\. Rijhwani, Y\. Guo, T\. Doshi, A\. Briukhov, M\. Bain, A\. Hitron, X\. Wang, A\. Gupta, K\. Chen, C\. Du, W\. Zhang, D\. Shah, A\. Akula, M\. Dylla, A\. Kachra, W\. Kuo, T\. Zou, L\. Wang, L\. Xu, J\. Zhu, J\. Snyder, S\. Menon, O\. Firat, I\. Mordatch, Y\. Yuan, N\. Ponomareva, R\. Blevins, L\. Moore, W\. Wang, P\. Chen, M\. Scholz, A\. Dwornik, J\. Lin, S\. Li, D\. Antognini, T\. I, X\. Song, M\. Miller, U\. Kalra, A\. Raveret, O\. Akerlund, F\. Wu, A\. Nystrom, N\. Godbole, T\. Liu, H\. DeBalsi, J\. Zhao, B\. Liu, A\. Caciularu, L\. Lax, U\. Khandelwal, V\. Langston, E\. Bailey, S\. Lattanzi, Y\. Wang, N\. Kovelamudi, S\. Mondal, G\. Guruganesh, N\. Hua, O\. Roval, P\. Wesołowski, R\. Ingale, J\. Halcrow, T\. Sohn, C\. Angermueller, B\. Raad, E\. Stickgold, E\. Lu, A\. Kosik, J\. Xie, T\. Lillicrap, A\. Huang, L\. L\. Zhang, D\. Paulus, C\. Farabet, A\. Wertheim, B\. Wang, R\. Joshi, C\. Ko, Y\. Wu, S\. Agrawal, L\. Lin, X\. Sheng, P\. Sung, T\. Breland\-King, C\. Butterfield, S\. Gawde, S\. Singh, Q\. Zhang, R\. Apte, S\. Shetty, A\. Hutter, T\. Li, E\. Salesky, F\. Lebron, J\. Kanerva, M\. Paganini, A\. Nguyen, R\. Vallu, J\. Peter, S\. Velury, D\. Kao, J\. Hoover, A\. Bortsova, C\. Bishop, S\. Jakobovits, A\. Agostini, A\. Agarwal, C\. Liu, C\. Kwong, S\. Tavakkol, I\. Bica, A\. Greve, A\. GP, J\. Marcus, L\. Hou, T\. Duerig, R\. Moroshko, D\. Lacey, A\. Davis, J\. Amelot, G\. Wang, F\. Kim, T\. Strinopoulos, H\. Wan, C\. L\. Lan, S\. Krishnan, H\. Tang, P\. Humphreys, J\. Bai, I\. H\. Shtacher, D\. Machado, C\. Pang, K\. Burke, D\. Liu, R\. Aravamudhan, Y\. Song, E\. Hirst, A\. Singh, B\. Jou, L\. Bai, F\. Piccinno, C\. K\. Fu, R\. Alazard, B\. Meiri, D\. Winter, C\. Chen, M\. Zhang, J\. Heitkaemper, J\. Lambert, J\. Lee, A\. Frömmgen, S\. Rogulenko, P\. Nair, P\. Niemczyk, A\. Bulyenov, B\. Xu, H\. Shemtov, M\. Zadimoghaddam, S\. Toropov, M\. Wirth, H\. Dai, S\. Gollapudi, D\. Zheng, A\. Kurakin, C\. Lee, K\. Bullard, N\. Serrano, I\. Balazevic, Y\. Li, J\. Schalkwyk, M\. Murphy, M\. Zhang, K\. Sequeira, R\. Datta, N\. Agrawal, C\. Sutton, N\. Attaluri, M\. Chiang, W\. Farhan, G\. Thornton, K\. Lin, T\. Choma, H\. Nguyen, K\. Dasgupta, D\. Robinson, I\. Comşa, M\. Riley, A\. Pillai, B\. Mustafa, B\. Golan, A\. Zandieh, J\. Lespiau, B\. Porter, D\. Ross, S\. Rajayogam, M\. Agarwal, S\. Venugopalan, B\. Shahriari, Q\. Yan, H\. Xu, T\. Tobin, P\. Dubov, H\. Shi, A\. Recasens, A\. Kovsharov, S\. Borgeaud, L\. Dery, S\. Vasanth, E\. Gribovskaya, L\. Qiu, M\. Mahdieh, W\. Skut, E\. Nielsen, C\. Zheng, A\. Yu, C\. G\. Bostock, S\. Gupta, A\. Archer, C\. Rawles, E\. Davies, A\. Svyatkovskiy, T\. Tsai, Y\. Halpern, C\. Reisswig, B\. Wydrowski, B\. Chang, J\. Puigcerver, M\. H\. Taege, J\. Li, E\. Schnider, X\. Li, D\. Dena, Y\. Xu, U\. Telang, T\. Shi, H\. Zen, K\. Kastner, Y\. Ko, N\. Subramaniam, A\. Kumar, P\. Blois, Z\. Dai, J\. Wieting, Y\. Lu, Y\. Zeldes, T\. Xie, A\. Hauth, A\. Ţifrea, Y\. Li, S\. El\-Husseini, D\. Abolafia, H\. Zhou, W\. Ding, S\. Ghalebikesabi, C\. Guía, A\. Maksai, Á\. Weisz, S\. Arik, N\. Sukhanov, A\. Świetlik, X\. Jia, L\. Yu, W\. Wang, M\. Brand, D\. Bloxwich, S\. Kirmani, Z\. Chen, A\. Go, P\. Sprechmann, N\. Kannen, A\. Carin, P\. Sandhu, I\. Edkins, L\. Nooteboom, J\. Gupta, L\. Maggiore, J\. Azizi, Y\. Pritch, P\. Yin, M\. Gupta, D\. Tarlow, D\. Smith, D\. Ivanov, M\. Babaeizadeh, A\. Goel, S\. Kambala, G\. Chu, M\. Kastelic, M\. Liu, H\. Soltau, A\. Stone, S\. Agrawal, M\. Kim, K\. Soparkar, S\. Tadepalli, O\. Bunyan, R\. Soh, A\. Kannan, D\. Kim, B\. J\. Chen, A\. Halumi, S\. Roy, Y\. Wang, O\. Sercinoglu, G\. Gibson, S\. Bhatnagar, M\. Sano, D\. von Dincklage, Q\. Ren, B\. Mitrevski, M\. Olšák, J\. She, C\. Doersch, Jilei, Wang, B\. Liu, Q\. Tan, T\. Yakar, T\. Warkentin, A\. Ramirez, C\. Lebsack, J\. Dillon, R\. Mathews, T\. Cobley, Z\. Wu, Z\. Chen, J\. Simon, S\. Nath, T\. Sainath, A\. Bendebury, R\. Julian, B\. Mankalale, D\. Ćurko, P\. Zacchello, A\. R\. Brown, K\. Sodhia, H\. Howard, S\. Caelles, A\. Gupta, G\. Evans, A\. Bulanova, L\. Katzen, R\. Goldenberg, A\. Tsitsulin, J\. Stanton, B\. Schillings, V\. Kovalev, C\. Fry, R\. Shah, K\. Lin, S\. Upadhyay, C\. Li, S\. Radpour, M\. Maggioni, J\. Xiong, L\. Haas, J\. Brennan, A\. Kamath, N\. Savinov, A\. Nagrani, T\. Yacovone, R\. Kappedal, K\. Andriopoulos, L\. Lao, Y\. Li, G\. Rozhdestvenskiy, K\. Hashimoto, A\. Audibert, S\. Austin, D\. Rodriguez, A\. Ruoss, G\. Honke, D\. Karkhanis, X\. Xiong, Q\. Wei, J\. Huang, Z\. Leng, V\. Premachandran, S\. Bileschi, G\. Evangelopoulos, T\. Mensink, J\. Pavagadhi, D\. Teplyashin, P\. Chang, L\. Xue, G\. Tanzer, S\. Goldman, K\. Patel, S\. Li, J\. Wiesner, I\. Zheng, I\. Stewart\-Binks, J\. Han, Z\. Li, L\. Luo, K\. Lenc, M\. Lučić, F\. Xue, R\. Mullins, A\. Guseynov, C\. Chang, I\. Galatzer\-Levy, A\. Zhang, G\. Bingham, G\. Hu, A\. Hartman, Y\. Ma, J\. Griffith, A\. Irpan, C\. Radebaugh, S\. Yue, L\. Fan, V\. Ungureanu, C\. Sorokin, H\. Teufel, P\. Li, R\. Anil, D\. Paparas, T\. Wang, C\. Lin, H\. Peng, M\. Shum, G\. Petrovic, D\. Brady, R\. Nguyen, K\. Macherey, Z\. Li, H\. Singh, M\. Yenugula, M\. Iinuma, X\. Chen, K\. Kopparapu, A\. Stern, S\. Dave, C\. Thekkath, F\. Perot, A\. Kumar, F\. Li, Y\. Xiao, M\. Bilotti, M\. H\. Bateni, I\. Noble, L\. Lee, A\. Vázquez\-Reina, J\. Salazar, X\. Yang, B\. Wang, E\. Gruzewska, A\. Rao, S\. Raghuram, Z\. Xu, E\. Ben\-David, J\. Mei, S\. Dalmia, Z\. Zhang, Y\. Liu, G\. Bansal, H\. Pankov, S\. Schwarcz, A\. Burns, C\. Chan, S\. Sanghai, R\. Liang, E\. Liang, A\. He, A\. Stuart, A\. Narayanan, Y\. Zhu, C\. Frank, B\. Fatemi, A\. Sabne, O\. Lang, I\. Bhattacharya, S\. Settle, M\. Wang, B\. McMahan, A\. Tacchetti, L\. B\. Soares, M\. Hadian, S\. Cabi, T\. Chung, N\. Putikhin, G\. Li, J\. Chen, A\. Tarango, H\. Michalewski, M\. Kazemi, H\. Masoom, H\. Sheftel, R\. Shivanna, A\. Vadali, R\. Comanescu, D\. Reid, J\. Moore, A\. Neelakantan, M\. Sander, J\. Herzig, A\. Rosenberg, M\. Dehghani, J\. Choi, M\. Fink, R\. Hayes, E\. Ge, S\. Weng, C\. Ho, J\. Karro, K\. Krishna, L\. N\. Thiet, A\. Skerry\-Ryan, D\. Eppens, M\. Andreetto, N\. Sarma, S\. Bonacina, B\. K\. Ayan, M\. Nawhal, Z\. Shan, M\. Dusenberry, S\. Thakoor, S\. Gubbi, D\. D\. Nguyen, R\. Tsarfaty, S\. Albanie, J\. Mitrović, M\. Gandhi, B\. Chen, A\. Epasto, G\. Stephanov, Y\. Jin, S\. Gehman, A\. Amini, J\. Weber, F\. Behbahani, S\. Xu, M\. Allamanis, X\. Chen, M\. Ott, C\. Sha, M\. Jastrzebski, H\. Qi, D\. Greene, X\. Wu, A\. Toki, D\. Vlasic, J\. Shapiro, R\. Kotikalapudi, Z\. Shen, T\. Saeki, S\. Xie, A\. Cassirer, S\. Bharadwaj, T\. Kiyono, S\. Bhojanapalli, E\. Rosenfeld, S\. Ritter, J\. Mao, J\. G\. Oliveira, Z\. Egyed, B\. Bandemer, E\. Parisotto, K\. Kinoshita, J\. Pluto, P\. Maniatis, S\. Li, Y\. Guo, G\. Ghiasi, J\. Tarbouriech, S\. Chatterjee, J\. Jin, Katrina, Xu, J\. Palomaki, S\. Arnold, M\. Sewak, F\. Piccinini, M\. Sharma, B\. Albrecht, S\. Purser\-haskell, A\. Vaswani, C\. Chen, M\. Wisniewski, Q\. Cao, J\. Aslanides, N\. M\. Phu, M\. Sieb, L\. Agubuzu, A\. Zheng, D\. Sohn, M\. Selvi, A\. Andreassen, K\. Subudhi, P\. Eruvbetine, O\. Woodman, T\. Mery, S\. Krause, X\. Ren, X\. Ma, J\. Luo, D\. Chen, W\. Fan, H\. Griffiths, C\. Schuler, A\. Li, S\. Zhang, J\. Sarr, S\. Luo, R\. Patana, M\. Watson, D\. Naboulsi, M\. Collins, S\. Sidhwani, E\. Hoogeboom, S\. Silver, E\. Caveness, X\. Zhao, M\. Rodriguez, M\. Deines, L\. Bai, P\. Griffin, M\. Tagliasacchi, E\. Xue, S\. R\. Babbula, B\. Pang, N\. Ding, G\. Shen, E\. Peake, R\. Crocker, S\. S\. Raghvendra, D\. Swisher, W\. Han, R\. Singh, L\. Wu, V\. Pchelin, T\. Munkhdalai, D\. Alon, G\. Bacon, E\. Robles, J\. Bulian, M\. Johnson, G\. Powell, F\. T\. Ferreira, Y\. Li, F\. Benzing, M\. Velimirović, H\. Soyer, W\. Kong, Tony, Nguyên, Z\. Yang, J\. Liu, J\. van Amersfoort, D\. Gillick, B\. Sun, N\. Rauschmayr, K\. Zhang, S\. Zhan, T\. Zhou, A\. Frolov, C\. Yang, D\. Vnukov, L\. Rouillard, H\. Li, A\. Mandhane, N\. Fallen, R\. Venkataraman, C\. H\. Hu, J\. Brennan, J\. Lee, J\. Chang, M\. Sundermeyer, Z\. Pan, R\. Ke, S\. Tong, A\. Fabrikant, W\. Bono, J\. Gu, R\. Foley, Y\. Mao, M\. Delakis, D\. Bhaswar, R\. Frostig, N\. Li, A\. Zipori, C\. Hope, O\. Kozlova, S\. Mishra, J\. Djolonga, C\. Schiff, M\. A\. Merey, E\. Briakou, P\. Morgan, A\. Wan, A\. Hassidim, R\. Skerry\-Ryan, K\. Sengupta, M\. Jasarevic, P\. Kallakuri, P\. Kunkle, H\. Brennan, T\. Lieber, H\. Mansoor, J\. Walker, B\. Zhang, A\. Xie, G\. Žužić, A\. Chukwuka, A\. Druinsky, D\. Cho, R\. Yao, F\. Naeem, S\. Butt, E\. Kim, Z\. Jia, M\. Jordan, A\. Lelkes, M\. Kurzeja, S\. Wang, J\. Zhao, A\. Over, A\. Chakladar, M\. Prasetya, N\. Jha, S\. Ganapathy, Y\. Cong, P\. Shroff, C\. Saroufim, S\. Miryoosefi, M\. Hammad, T\. Nasir, W\. Xi, Y\. Gao, Y\. Maeng, B\. Hora, C\. Cheng, P\. Haghani, Y\. Lewenberg, C\. Lu, M\. Matysiak, N\. Raisinghani, H\. Wang, L\. Baugher, R\. Sukthankar, M\. Giang, J\. Schultz, N\. Fiedel, M\. Chen, C\. Lee, T\. Dey, H\. Zheng, S\. Paul, C\. Smith, A\. Ly, Y\. Wang, R\. Bansal, B\. Perz, S\. Ricco, S\. Blank, V\. Keshava, D\. Sharma, M\. Chow, K\. Lad, K\. Jalan, S\. Osindero, C\. Swanson, J\. Scott, A\. Ilić, X\. Li, S\. R\. Jonnalagadda, A\. S\. Soudagar, Y\. Xiong, B\. Batsaikhan, D\. Jarrett, N\. Kumar, M\. Shah, M\. Lawlor, A\. Waters, M\. Graham, R\. May, S\. Ramos, S\. Lefdal, Z\. Cankara, N\. Cano, B\. O’Donoghue, J\. Borovik, F\. Liu, J\. Grimstad, M\. Alnahlawi, K\. Tsihlas, T\. Hudson, N\. Grigorev, Y\. Jia, T\. Huang, T\. P\. Igwe, S\. Lebedev, X\. Tang, I\. Krivokon, F\. Garcia, M\. Tan, E\. Jia, P\. Stys, S\. Vashishth, Y\. Liang, B\. Venkatraman, C\. Gu, A\. Kementsietsidis, C\. Zhu, J\. Jung, Y\. Bai, M\. J\. Hosseini, F\. Ahmed, A\. Gupta, X\. Yuan, S\. Ashraf, S\. Nigam, G\. Vasudevan, P\. Awasthi, A\. M\. Gilady, Z\. Mariet, R\. Eskander, H\. Li, H\. Hu, G\. Garrido, P\. Schlattner, G\. Zhang, R\. Saxena, P\. Dević, K\. Muralidharan, A\. Murthy, Y\. Zhou, M\. Choi, A\. Wongpanich, Z\. Wang, P\. Shah, Y\. Xu, Y\. Huang, S\. Spencer, A\. Chen, J\. Cohan, J\. Wang, J\. Tompson, J\. Wu, R\. Haroun, H\. Li, B\. Huergo, F\. Yang, T\. Yin, J\. Wendt, M\. Bendersky, R\. Chaabouni, J\. Snaider, J\. Ferret, A\. Jindal, T\. Thompson, A\. Xue, W\. Bishop, S\. M\. Phal, A\. Sharma, Y\. Sung, P\. Radhakrishnan, M\. Shomrat, R\. Ingle, R\. Vij, J\. Gilmer, M\. D\. Istin, S\. Sobell, Y\. Lu, E\. Nottage, D\. Sadigh, J\. Willcock, T\. Zhang, S\. Xu, S\. Brown, K\. Lee, G\. Wang, Y\. Zhu, Y\. Tay, C\. Kim, A\. Gutierrez, A\. Sharma, Y\. Xian, S\. Seo, C\. Cui, E\. Pochernina, C\. Baetu, K\. Jastrzębski, M\. Ly, M\. Elhawaty, D\. Suh, E\. Sezener, P\. Wang, N\. Yuen, G\. Tucker, J\. Cai, Z\. Yang, C\. Wang, A\. Muzio, H\. Qian, J\. Yoo, D\. Lockhart, K\. R\. McKee, M\. Guo, M\. Mehrotra, A\. Mendonça, S\. V\. Mehta, S\. Ben, C\. Tekur, J\. Mu, M\. Zhu, V\. Krakovna, H\. Lee, A\. Maschinot, S\. Cevey, H\. Choe, A\. Bai, H\. Srinivasan, D\. Gasaway, N\. Young, P\. Siegler, D\. Holtmann\-Rice, V\. Piratla, K\. Baumli, R\. Yogev, A\. Hofer, H\. van Hasselt, S\. Grant, Y\. Chervonyi, D\. Silver, A\. Hogue, A\. Agarwal, K\. Wang, P\. Singh, F\. Flynn, J\. Lipschultz, R\. David, L\. Bellot, Y\. Yang, L\. Le, F\. Graziano, K\. Olszewska, K\. Hui, A\. Maurya, N\. Parotsidis, W\. Chen, T\. Oguntebi, J\. Kelley, A\. Baddepudi, J\. Mauerer, G\. Shaw, A\. Siegman, L\. Yang, S\. Shetty, S\. Roy, Y\. Song, W\. Stokowiec, R\. Burnell, O\. Savant, R\. Busa\-Fekete, J\. Miao, S\. Ghosh, L\. MacDermed, P\. Lippe, M\. Dektiarev, Z\. Behrman, F\. Mentzer, K\. Nguyen, M\. Wei, S\. Verma, C\. Knutsen, S\. Dasari, Z\. Yan, P\. Mitrichev, X\. Wang, V\. Shejwalkar, J\. Austin, S\. Sunkara, N\. Potti, Y\. Virin, C\. Wright, G\. Liu, O\. Riva, E\. Pot, G\. Kochanski, Q\. Le, G\. Balasubramaniam, A\. Dhar, Y\. Liao, A\. Bloniarz, D\. Shukla, E\. Cole, J\. Lee, S\. Zhang, S\. Kafle, S\. Vashishtha, P\. Mahmoudieh, G\. Chen, R\. Hoffmann, P\. Srinivasan, A\. D\. Lago, Y\. B\. Shalom, Z\. Wang, M\. Elabd, A\. Sharma, J\. Oh, S\. Kothawade, M\. Le, M\. Monteiro, S\. Yang, K\. Alarakyia, R\. Geirhos, D\. Mincu, H\. Garnes, H\. Kobayashi, S\. Mariooryad, K\. Krasowiak, Zhixin, Lai, S\. Mourad, M\. Wang, F\. Bu, O\. Aharoni, G\. Chen, A\. Goyal, V\. Zubov, A\. Bapna, E\. Dabir, N\. Kothari, K\. Lamerigts, N\. D\. Cao, J\. Shar, C\. Yew, N\. Kulkarni, D\. Mahaarachchi, M\. Joshi, Z\. Zhu, J\. Lichtarge, Y\. Zhou, H\. Muckenhirn, V\. Selo, O\. Vinyals, P\. Chen, A\. Brohan, V\. Mehta, S\. Cogan, R\. Wang, T\. Geri, W\. Ko, W\. Chen, F\. Viola, K\. Shivam, L\. Wang, M\. C\. Elish, R\. A\. Popa, S\. Pereira, J\. Liu, R\. Koster, D\. Kim, G\. Zhang, S\. Ebrahimi, P\. Talukdar, Y\. Zheng, P\. Poklukar, A\. Mikhalap, D\. Johnson, A\. Vijayakumar, M\. Omernick, M\. Dibb, A\. Dubey, Q\. Hu, A\. Suman, V\. Aggarwal, I\. Kornakov, F\. Xia, W\. Lowe, A\. Kolganov, T\. Xiao, V\. Nikolaev, S\. Hemingray, B\. Li, J\. Iljazi, M\. Rybiński, B\. Sandhu, P\. Lu, T\. Luong, R\. Jenatton, V\. Govindaraj, Hui, Li, G\. Dulac\-Arnold, W\. Park, H\. Wang, A\. Modi, J\. Pouget\-Abadie, K\. Greller, R\. Gupta, R\. Berry, P\. Ramachandran, J\. Xie, L\. McCafferty, J\. Wang, K\. Gupta, H\. Lim, B\. Bratanič, A\. Brock, I\. Akolzin, J\. Sproch, D\. Karliner, D\. Kim, A\. Goedeckemeyer, N\. Shazeer, C\. Schmid, D\. Calandriello, P\. Bhatia, K\. Choromanski, C\. Montgomery, D\. Dua, A\. Ramalho, H\. King, Y\. Gao, L\. Nguyen, D\. Lindner, D\. Pitta, O\. Johnson, K\. Salama, D\. Ardila, M\. Han, E\. Farnese, S\. Odoom, Z\. Wang, X\. Ding, N\. Rink, R\. Smith, H\. T\. Lehri, E\. Cohen, N\. Vats, T\. He, P\. Gopavarapu, A\. Paszke, M\. Patel, W\. V\. Gansbeke, L\. Loher, L\. Castro, M\. Voitovich, T\. von Glehn, N\. George, S\. Niklaus, Z\. Eaton\-Rosen, N\. Rakićević, E\. Jue, S\. Perel, C\. Zhang, Y\. Bahat, A\. Pouget, Z\. Xing, F\. Huot, A\. Shenoy, T\. Bos, V\. Coriou, B\. Richter, N\. Noy, Y\. Wang, S\. Ontanon, S\. Qin, G\. Makarchuk, D\. Hassabis, Z\. Li, M\. Sharma, K\. Venkatesan, I\. Kemaev, R\. Daniel, S\. Huang, S\. Shah, O\. Ponce, Warren, Chen, M\. Faruqui, J\. Wu, S\. Andačić, S\. Payrits, D\. McDuff, T\. Hume, Y\. Cao, M\. Tessler, Q\. Wang, Y\. Wang, I\. Rendulic, E\. Agustsson, M\. Johnson, T\. Lando, A\. Howard, S\. G\. S\. Padmanabhan, M\. Daswani, A\. Banino, M\. Kilgore, J\. Heek, Z\. Ji, A\. Caceres, C\. Li, N\. Kassner, A\. Vlaskin, Z\. Liu, A\. Grills, Y\. Hou, R\. Sukkerd, G\. Cheon, N\. Shetty, L\. Markeeva, P\. Stanczyk, T\. Iyer, Y\. Gong, S\. Gao, K\. Gopalakrishnan, T\. Blyth, M\. Reynolds, A\. Bhoopchand, M\. Bilenko, D\. Gharibian, V\. Zayats, A\. Faust, A\. Singh, M\. Ma, H\. Jiao, S\. Vijayanarasimhan, L\. Aroyo, V\. Yadav, S\. Chakera, A\. Kakarla, V\. Meshram, K\. Gregor, G\. Botea, E\. Senter, D\. Jia, G\. Kovacs, N\. Sharma, S\. Baur, K\. Kang, Y\. He, L\. Zhuo, M\. Kostelac, I\. Laish, S\. Peng, L\. O’Bryan, D\. Kasenberg, G\. R\. Rao, E\. Leurent, B\. Zhang, S\. Stevens, A\. Salazar, Y\. Zhang, I\. Lobov, J\. Walker, A\. Porter, M\. Redshaw, H\. Ke, A\. Rao, A\. Lee, H\. Lam, M\. Moffitt, J\. Kim, S\. Qiao, T\. Koo, R\. Dadashi, X\. Song, M\. Sundararajan, P\. Xu, C\. Kawamoto, Y\. Zhong, C\. Barbu, A\. Reddy, M\. Verzetti, L\. Li, G\. Papamakarios, H\. Klimczak\-Plucińska, M\. Cassin, K\. Kavukcuoglu, R\. Swavely, A\. Vaucher, J\. Zhao, R\. Hemsley, M\. Tschannen, H\. Ge, G\. Menghani, Y\. Yu, N\. Ha, W\. He, X\. Wu, M\. Song, R\. Sterneck, S\. Zinke, D\. A\. Calian, A\. Marsden, A\. C\. Ruiz, M\. Hessel, A\. Gueta, B\. Lee, B\. Farris, M\. Gupta, Y\. Li, M\. Saleh, V\. Misra, K\. Xiao, P\. Mendolicchio, G\. Buttimore, V\. Krayvanova, N\. Nayakanti, M\. Wiethoff, Y\. Pande, A\. Mirhoseini, N\. Lao, J\. Liu, Y\. Hua, A\. Chen, Y\. Malkov, D\. Kalashnikov, S\. Gupta, K\. Audhkhasi, Y\. Zhai, S\. Kopalle, P\. Jain, E\. Ofek, C\. Meyer, K\. Baatarsukh, H\. Strejček, J\. Qian, J\. Freedman, R\. Figueira, M\. Sokolik, O\. Bachem, R\. Lin, D\. Kharrat, C\. Hidey, P\. Xu, D\. Duan, Y\. Li, M\. Ersoy, R\. Everett, K\. Cen, R\. Santamaria\-Fernandez, A\. Taubenfeld, I\. Mackinnon, L\. Deng, P\. Zablotskaia, S\. Viswanadha, S\. Goel, D\. Yates, Y\. Deng, P\. Choy, M\. Chen, A\. Sinha, A\. Mossin, Y\. Wang, A\. Szlam, S\. Hao, P\. K\. Rubenstein, M\. Toksoz\-Exley, M\. Aperghis, Y\. Zhong, J\. Ahn, M\. Isard, O\. Lacombe, F\. Luisier, C\. Anastasiou, Y\. Kalley, U\. Prabhu, E\. Dunleavy, S\. Bijwadia, J\. Mao\-Jones, K\. Chen, R\. Pasumarthi, E\. Wood, A\. Dostmohamed, N\. Hurley, J\. Simsa, A\. Parrish, M\. Pajarskas, M\. Harvey, O\. Skopek, Y\. Kochinski, J\. Rey, V\. Rieser, D\. Zhou, S\. J\. Lee, T\. Acharya, G\. Li, J\. Jiang, X\. Zhang, B\. Gipson, E\. Mahintorabi, M\. Gelmi, N\. Khajehnouri, A\. Yeh, K\. Lee, L\. Matthey, L\. Baker, T\. Pham, H\. Fu, A\. Pak, P\. Gupta, C\. Vasconcelos, A\. Sadovsky, B\. Walker, S\. Hsiao, P\. Zochbauer, A\. Marzoca, N\. Velan, J\. Zeng, G\. Baechler, D\. Driess, D\. Jain, Y\. Huang, L\. Tao, J\. Maggs, N\. Levine, J\. Schneider, E\. Gemzer, S\. Petit, S\. Han, Z\. Fisher, D\. Zelle, C\. Biles, E\. Ie, A\. Fadeeva, C\. Liu, J\. V\. Franco, A\. Collister, H\. Zhang, R\. Wang, R\. Zhao, L\. Kieliger, K\. Shuster, R\. Zhu, B\. Gong, L\. Chan, R\. Sun, S\. Basu, R\. Zimmermann, J\. Hayes, A\. Bapna, J\. Snoek, W\. Yang, P\. Datta, J\. A\. Abdallah, K\. Kilgour, L\. Li, S\. Mah, Y\. Jun, M\. Rivière, A\. Karmarkar, T\. Spalink, T\. Huang, L\. Gonzalez, D\. Tran, A\. Nowak, J\. Palowitch, M\. Chadwick, E\. Talius, H\. Mehta, T\. Sellam, P\. Fränken, M\. Nicosia, K\. He, A\. Kini, D\. Amos, S\. Basu, H\. Jobe, E\. Shaw, Q\. Xu, C\. Evans, D\. Ikeda, C\. Yan, L\. Jin, L\. Wang, S\. Yadav, I\. Labzovsky, R\. Sampath, A\. Ma, C\. Schumann, A\. Siddhant, R\. Shah, J\. Youssef, R\. Agarwal, N\. Dabney, A\. Tonioni, M\. Ambar, J\. Li, I\. Guyon, B\. Li, D\. Soergel, B\. Fang, G\. Karadzhov, C\. Udrescu, T\. Trinh, V\. Raunak, S\. Noury, D\. Guo, S\. Gupta, M\. Finkelstein, D\. Petek, L\. Liang, G\. Billock, P\. Sun, D\. Wood, Y\. Song, X\. Yu, T\. Matejovicova, R\. Cohen, K\. Andra, D\. D’Ambrosio, Z\. Deng, V\. Nallatamby, E\. Songhori, R\. Dangovski, A\. Lampinen, P\. Botadra, A\. Hillier, J\. Cao, N\. Baddi, A\. Kuncoro, T\. Yoshino, A\. Bhagatwala, M\. Ranzato, R\. Schaeffer, T\. Liu, S\. Ye, O\. Sarvana, J\. Nham, C\. Kuang, I\. Gao, J\. Baek, S\. Mittal, A\. Wahid, A\. Gergely, B\. Ni, J\. Feldman, C\. Muir, P\. Lamblin, W\. Macherey, E\. Dyer, L\. Kilpatrick, V\. Campos, M\. Bhutani, S\. Fort, Y\. Ahmad, A\. Severyn, K\. Chatziprimou, O\. Ferludin, M\. Dimarco, A\. Kusupati, J\. Heyward, D\. Bahir, K\. Villela, K\. Millican, D\. Marcus, S\. Bahargam, C\. Unlu, N\. Roth, Z\. Wei, S\. Gopal, D\. Ghoshal, E\. Lee, S\. Lin, J\. Lees, D\. Lee, A\. Hosseini, C\. Fan, S\. Neel, M\. Wu, Y\. Altun, H\. Cai, E\. Piqueras, J\. Woodward, A\. Bissacco, S\. Haykal, M\. Bordbar, P\. Sundaram, S\. Hodkinson, D\. Toyama, G\. Polovets, A\. Myers, A\. Sinha, T\. Levinboim, K\. Krishnakumar, R\. Chhaparia, T\. Sholokhova, N\. B\. Gundavarapu, G\. Jawahar, H\. Qureshi, J\. Hu, N\. Momchev, M\. Rahtz, R\. Wu, A\. P\. S, K\. Dhamdhere, M\. Guo, U\. Gupta, A\. Eslami, M\. Schain, M\. Blokzijl, D\. Welling, D\. Orr, L\. Bolelli, N\. Perez\-Nieves, M\. Sirotenko, A\. Prasad, A\. Kar, B\. D\. B\. Pigem, T\. Terzi, G\. Weisz, D\. Ghosh, A\. Mavalankar, D\. Madeka, K\. Daugaard, H\. Adam, V\. Shah, D\. Berman, M\. Tran, S\. Baker, E\. Andrejczuk, G\. Chole, G\. Raboshchuk, M\. Mirzazadeh, T\. Kagohara, S\. Wu, C\. Schallhart, B\. Orlando, C\. Wang, A\. Rrustemi, H\. Xiong, H\. Liu, A\. Vezer, N\. Ramsden, S\. Chang, S\. Mudgal, Y\. Li, N\. Vieillard, Y\. Hoshen, F\. Ahmad, A\. Slone, A\. Hua, N\. Potikha, M\. Rossini, J\. Stritar, S\. Prakash, Z\. Wang, X\. Dong, A\. Nazari, E\. Nehoran, K\. Tekelioglu, Y\. Li, K\. Badola, T\. Funkhouser, Y\. Li, V\. Yerram, R\. Ganeshan, D\. Formoso, K\. Langner, T\. Shi, H\. Li, Y\. Yamamori, A\. Panda, A\. Saade, A\. S\. Scarpati, C\. Breaux, C\. Carey, Z\. Zhou, C\. Hsieh, S\. Bridgers, A\. Butryna, N\. Gupta, V\. Tulsyan, S\. Woo, E\. Eltyshev, W\. Grathwohl, C\. Parks, S\. Benjamin, R\. Panigrahy, S\. Dodhia, D\. D\. Freitas, C\. Sauer, W\. Song, F\. Alet, J\. Tolins, C\. Paduraru, X\. Zhou, B\. Albert, Z\. Zhang, L\. Shu, M\. Bansal, S\. Nguyen, A\. Globerson, O\. Xiao, J\. Manyika, T\. Hennigan, R\. Rong, J\. Matak, A\. Bakalov, A\. Sharma, D\. Sinopalnikov, A\. Pierson, S\. Roller, G\. Brown, M\. Gao, T\. Fukuzawa, A\. Ghafouri, K\. Vassigh, I\. Barr, Z\. Wang, A\. Korsun, R\. Jayaram, L\. Ren, T\. Zaman, S\. Khan, Y\. Lunts, D\. Deutsch, D\. Uthus, N\. Katz, M\. Samsikova, A\. Khalifa, N\. Sethi, J\. Sun, L\. Tang, U\. Alon, X\. Luo, D\. Yu, A\. Nayyar, B\. Petrini, W\. Truong, V\. Hellendoorn, N\. Chinaev, C\. Alberti, W\. Wang, J\. Hu, V\. Mirrokni, A\. Balashankar, A\. Aharon, A\. Mehta, A\. Iscen, J\. Kready, L\. Manning, A\. Mohananey, Y\. Chen, A\. Tripathi, A\. Wu, I\. Petrovski, D\. Hwang, M\. Baeuml, S\. Chandrakaladharan, Y\. Liu, R\. Coaguila, M\. Chen, S\. Ma, P\. Tafti, S\. Tatineni, T\. Spitz, J\. Ye, P\. Vicol, M\. Rosca, A\. Puigdomènech, Z\. Yahav, S\. Ghemawat, H\. Lin, P\. Kirk, Z\. Nabulsi, S\. Brin, B\. Bohnet, K\. Caluwaerts, A\. S\. Veerubhotla, D\. Zheng, Z\. Dai, P\. Petrov, Y\. Xu, R\. Mehran, Z\. Xu, L\. Zintgraf, J\. Choi, S\. A\. Hombaiah, R\. Thoppilan, S\. Reddi, L\. Lew, L\. Li, K\. Webster, K\. Sawhney, L\. Lamprou, S\. Shakeri, M\. Lunayach, J\. Chen, S\. Bagri, A\. Salcianu, Y\. Chen, Y\. Donchev, C\. Magister, S\. Nørly, V\. Rodrigues, T\. Izo, H\. Noga, J\. Zou, T\. Köppe, W\. Zhou, K\. Lee, X\. Long, D\. Eisenbud, A\. Chen, C\. Schenck, C\. M\. To, P\. Zhong, E\. Taropa, M\. Truong, O\. Levy, D\. Martins, Z\. Zhang, C\. Semturs, K\. Zhang, A\. Yakubovich, P\. Moreno, L\. McConnaughey, D\. Lu, S\. Redmond, L\. Weerts, Y\. Bitton, T\. Refice, N\. Lacasse, A\. Conmy, C\. Tallec, J\. Odell, H\. Forbes\-Pollard, A\. Socala, J\. Hoech, P\. Kohli, A\. Walton, R\. Wang, M\. Sazanovich, K\. Zhu, A\. Kapishnikov, R\. Galt, M\. Denton, B\. Murdoch, C\. Sikora, K\. Mohamed, W\. Wei, U\. First, T\. McConnell, L\. C\. Cobo, J\. Qin, T\. Avrahami, D\. Balle, Y\. Watanabe, A\. Louis, A\. Kraft, S\. Ariafar, Y\. Gu, E\. Rives, C\. Yoon, A\. Rusu, J\. Cobon\-Kerr, C\. Hahn, J\. Luo, Yuvein, Zhu, N\. Ahuja, R\. Benenson, R\. L\. Kaufman, H\. Yu, L\. Hightower, J\. Zhang, D\. Ni, L\. A\. Hendricks, G\. Wang, G\. Yona, L\. Jain, P\. Barrio, S\. Bhupatiraju, S\. Velusamy, A\. Dafoe, S\. Riedel, T\. Thomas, Z\. Yuan, M\. Bellaiche, S\. Panthaplackel, K\. Kloboves, S\. Jauhari, C\. Akbulut, T\. Davchev, E\. Gladchenko, D\. Madras, A\. Chuklin, T\. Hill, Q\. Yuan, M\. Madhavan, L\. Leonhard, D\. Scandinaro, Q\. Chen, N\. Niu, A\. Douillard, B\. Damoc, Y\. Onoe, F\. Pedregosa, F\. Bertsch, C\. Leichner, J\. Pagadora, J\. Malmaud, S\. Ponda, A\. Twigg, O\. Duzhyi, J\. Shen, M\. Wang, R\. Garg, J\. Chen, U\. Evci, J\. Lee, L\. Liu, K\. Kojima, M\. Yamaguchi, A\. Rajendran, A\. Piergiovanni, V\. K\. Rajendran, M\. Fornoni, G\. Ibagon, H\. Ragan, S\. M\. Khan, J\. Blitzer, A\. Bunner, G\. Sun, T\. Kosakai, S\. Lundberg, N\. Elue, K\. Guu, S\. Park, J\. Park, A\. Narayanaswamy, C\. Wu, J\. Mudigonda, T\. Cohn, H\. Mu, R\. Kumar, L\. Graesser, Y\. Zhang, R\. Killam, V\. Zhuang, M\. Giménez, W\. A\. Jishi, R\. Ley\-Wild, A\. Zhai, K\. Osawa, D\. Cedillo, J\. Liu, M\. Upadhyay, M\. Sieniek, R\. Sharma, T\. Paine, A\. Angelova, S\. Addepalli, C\. Parada, K\. Majumder, A\. Lamp, S\. Kumar, X\. Deng, A\. Myaskovsky, T\. Sabolić, J\. Dudek, S\. York, F\. de Chaumont Quitry, J\. Nie, D\. Cattle, A\. Gunjan, B\. Piot, W\. Khawaja, S\. Bang, S\. Wang, S\. Khodadadeh, R\. R, P\. Rawlani, R\. Powell, K\. Lee, J\. Griesser, G\. Oh, C\. Magalhaes, Y\. Li, S\. Tokumine, H\. N\. Vogel, D\. Hsu, A\. BC, D\. Jindal, M\. Cohen, Z\. Yang, J\. Yuan, D\. de Cesare, T\. Bruguier, J\. Xu, M\. Roy, A\. Jacovi, D\. Belov, R\. Arya, P\. Meadowlark, S\. Cohen\-Ganor, W\. Ye, P\. Morris\-Suzuki, P\. Banzal, G\. Song, P\. Ponnuramu, F\. Zhang, G\. Scrivener, S\. Zaiem, A\. R\. Rochman, K\. Han, B\. Ghazi, K\. Lee, S\. Drath, D\. Suo, A\. Girgis, P\. Shenoy, D\. Nguyen, D\. Eck, S\. Gupta, L\. Yan, J\. Carreira, A\. Gulati, R\. Sang, D\. Mirylenka, E\. Cooney, E\. Chou, M\. Ling, C\. Fan, B\. Coleman, G\. Tubone, R\. Kumar, J\. Baldridge, F\. Hernandez\-Campos, A\. Lazaridou, J\. Besley, I\. Yona, N\. Bulut, Q\. Wellens, A\. Pierigiovanni, J\. George, R\. Green, P\. Han, C\. Tao, G\. Clark, C\. You, A\. Abdolmaleki, J\. Fu, T\. Chen, A\. Chaugule, A\. Chandorkar, A\. Rahman, W\. Thompson, P\. Koanantakool, M\. Bernico, J\. Ren, A\. Vlasov, S\. Vassilvitskii, M\. Kula, Y\. Liang, D\. Kim, Y\. Huang, C\. Ye, D\. Lepikhin, and W\. Helmholz\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px5.p1.2)\.
- \[17\]E\. Croxford, Y\. Gao, N\. Pellegrino, K\. Wong, G\. Wills, E\. First, M\. Schnier, K\. Burton, C\. Ebby, J\. Gorski, M\. Kalscheur, S\. Khalil, M\. Pisani, T\. Rubeor, P\. Stetson, F\. Liao, C\. Goswami, B\. Patterson, and M\. Afshar\(2025\-06\)Development and validation of the provider documentation summarization quality instrument for large language models\.Journal of the American Medical Informatics Association32\(6\),pp\. 1050–1060\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocaf068)Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p4.7),[§1](https://arxiv.org/html/2606.15029#S1.p2.1)\.
- \[18\]Y\. Duboiset al\.\(2024\)AlpacaEval 2\.0: automatic evaluation of instruction\-following models\.arXiv preprint arXiv:2401\.04088\.Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[19\]A\. R\. Fabbri, W\. Kryściński, B\. McCann, C\. Xiong, R\. Socher, and D\. Radev\(2021\)SummEval: re\-evaluating summarization evaluation\.External Links:2007\.12626,[Link](https://arxiv.org/abs/2007.12626)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px4.p1.3)\.
- \[20\]C\. Feng, M\. Shen, A\. Balashankar, C\. Gerner\-Beuerle, and M\. R\. D\. Rodrigues\(2026\)Noisy but valid: robust statistical evaluation of LLMs with imperfect judges\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hEhxreaLdU)Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.15029#S4.SS3.p1.4)\.
- \[21\]A\. Grattafiori, A\. Dubey, A\. J\. a…, Z\. Yang, Z\. Zhao, and Z\. Ma\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§I\.1](https://arxiv.org/html/2606.15029#A9.SS1.p1.1)\.
- \[22\]J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Y\. Wang, W\. Gao, L\. Ni, and J\. Guo\(2025\)A survey on llm\-as\-a\-judge\.External Links:2411\.15594,[Link](https://arxiv.org/abs/2411.15594)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[23\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px5.p1.2)\.
- \[24\]X\. Huan, J\. Jagalur, and Y\. Marzouk\(2024\)Optimal experimental design: formulations and computations\.Acta Numerica33,pp\. 715–840\.External Links:ISSN 1474\-0508,[Link](http://dx.doi.org/10.1017/S0962492924000023),[Document](https://dx.doi.org/10.1017/s0962492924000023)Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
- \[25\]A\. Joshi, S\. Kale, S\. Chandel, and D\. K\. Pal\(2015\)Likert scale: explored and explained\.British journal of applied science & technology7\(4\),pp\. 396\.Cited by:[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px4.p1.3)\.
- \[26\]M\. G\. Kendall\(1948\)Rank correlation methods\.Griffin,London\.Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[27\]Z\. Kenton, N\. Y\. Siegel, J\. Kramar, J\. Brown\-Cohen, S\. Albanie, J\. Bulian, R\. Agarwal, D\. Lindner, Y\. Tang, N\. Goodman, and R\. Shah\(2024\)On scalable oversight with weak LLMs judging strong LLMs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=O1fp9nVraj)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[28\]T\. K\. Koo and M\. Y\. Li\(2016\)A guideline of selecting and reporting intraclass correlation coefficients for reliability research\.Journal of Chiropractic Medicine15\(2\),pp\. 155–163\.External Links:ISSN 1556\-3707,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcm.2016.02.012),[Link](https://www.sciencedirect.com/science/article/pii/S1556370716000158)Cited by:[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[29\]K\. Krippendorff\(1970\)Estimating the reliability, systematic error, and random error of interval data\.Educational and Psychological Measurement30\(1\),pp\. 61–70\.Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[30\]C\. Li, Z\. Akhtar, M\. Kwak, Y\. Ji, H\. Zhang, T\. Obi, Y\. Ren, X\. Wu, S\. Sivarajkumar, H\. P\. Lehmann, S\. Visweswaran, M\. J\. Becich, D\. L\. Mowery, R\. Liu, H\. Sun, and Y\. Wang\(2026\)A scoping review of llm\-as\-a\-judge in healthcare and the medjudge framework\.External Links:2604\.25933,[Link](https://arxiv.org/abs/2604.25933)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p2.1)\.
- \[31\]H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu\(2024\)LLMs\-as\-judges: a comprehensive survey on llm\-based evaluation methods\.Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p4.7),[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[32\]D\. Liljequist, B\. Elfving, and K\. Skavberg Roaldsen\(2019\-07\)Intraclass correlation – a discussion and demonstration of basic features\.PLOS ONE14,pp\. 1–35\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0219854),[Link](https://doi.org/10.1371/journal.pone.0219854)Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p3.13),[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p4.7)\.
- \[33\]C\. Lin\(2004\-07\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[34\]Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu\(2023\)G\-eval: nlg evaluation using gpt\-4 with better human alignment\.arXiv preprint arXiv:2303\.16634\.Cited by:[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[35\]D\. J\. C\. MacKay\(1992\-07\)Information\-based objective functions for active data selection\.Neural Computation4\(4\),pp\. 590–604\.External Links:ISSN 0899\-7667,[Link](https://doi.org/10.1162/neco.1992.4.4.590),[Document](https://dx.doi.org/10.1162/neco.1992.4.4.590)Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
- \[36\]J\. G\. Meyer, R\. J\. Urbanowicz, P\. C\. N\. Martin,et al\.\(2023\)ChatGPT and large language models in academia: opportunities and challenges\.BioData Mining16\(1\),pp\. 20\.External Links:[Document](https://dx.doi.org/10.1186/s13040-023-00339-9)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[37\]J\. Novikova, O\. Dušek, A\. Cercas Curry, and V\. Rieser\(2017\-09\)Why we need new evaluation metrics for NLG\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),Copenhagen, Denmark,pp\. 2241–2252\.External Links:[Link](https://aclanthology.org/D17-1238/),[Document](https://dx.doi.org/10.18653/v1/D17-1238)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[38\]OpenAI, J\. Achiam, S\. A\. …\. C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph\(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§I\.1](https://arxiv.org/html/2606.15029#A9.SS1.p1.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px5.p1.2)\.
- \[39\]OpenAI\(2026\-04\)GPT\-5\.5 System Card\.Note:[https://openai\.com/index/gpt\-5\-5\-system\-card/](https://openai.com/index/gpt-5-5-system-card/)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px5.p1.2)\.
- \[40\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Gray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://arxiv.org/abs/2203.02155N)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[41\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\-07\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,P\. Isabelle, E\. Charniak, and D\. Lin \(Eds\.\),Philadelphia, Pennsylvania, USA,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[42\]O\. Sener and S\. Savarese\(2018\)Active learning for convolutional neural networks: a core\-set approach\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1aIuk-RW)Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
- \[43\]P\. E\. Shrout and J\. L\. Fleiss\(1979\)Intraclass correlations: uses in assessing rater reliability\.Psychological Bulletin86\(2\),pp\. 420–428\.External Links:[Document](https://dx.doi.org/10.1037/0033-2909.86.2.420)Cited by:[Appendix A](https://arxiv.org/html/2606.15029#A1.SS0.SSS0.Px1.p2.3),[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[44\]C\. Spearman\(1904\)The proof and measurement of association between two things\.The American Journal of Psychology15\(1\),pp\. 72–101\.Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p2.1),[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px3.p1.3)\.
- \[45\]S\. Tan, S\. Zhuang, K\. Montgomery, W\. Y\. Tang, A\. Cuadron, C\. Wang, R\. A\. Popa, and I\. Stoica\(2025\)JudgeBench: a benchmark for evaluating llm\-based judges\.External Links:2410\.12784,[Link](https://arxiv.org/abs/2410.12784)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.15029#S2.SS2.p1.3)\.
- \[46\]G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot\(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§I\.1](https://arxiv.org/html/2606.15029#A9.SS1.p1.1)\.
- \[47\]A\. J\. Thirunavukarasu, D\. S\. W\. Ting, K\. Elangovan,et al\.\(2023\)Large language models in medicine\.Nature Medicine29\(9\),pp\. 1930–1940\.External Links:[Document](https://dx.doi.org/10.1038/s41591-023-02448-8)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[48\]A\. Unell, N\. Codella, S\. Preston, P\. Argaw, W\. Yim, Z\. Gero, C\. Wong, R\. Jena, E\. Horvitz, A\. K\. Hall, R\. R\. Zhong, J\. Li, S\. Jain, M\. Wei, M\. P\. Lungren, and H\. Poon\(2025\-09\)CancerGUIDE: cancer guideline understanding via internal disagreement estimation\.Note:arXivExternal Links:[Link](https://www.microsoft.com/en-us/research/publication/cancerguide-cancer-guideline-understanding-via-internal-disagreement-estimation/)Cited by:[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1),[§4\.1\.3](https://arxiv.org/html/2606.15029#S4.SS1.SSS3.p3.2)\.
- \[49\]L\. L\. Wang, Y\. Otmakhova, J\. DeYoung, T\. H\. Truong, B\. E\. Kuehl, E\. Bransom, and B\. C\. Wallace\(2023\)Automated metrics for medical multi\-document summarization disagree with human evaluations\.InProceedings of the 61th Annual Meeting of the Association for Computational Linguistics \(Long Papers\),Toronto, Canada\.Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p6.3),[§3\.3](https://arxiv.org/html/2606.15029#S3.SS3.SSS0.Px4.p1.3)\.
- \[50\]S\. Wu, Y\. Nair, and E\. J\. Candès\(2026\)Efficient evaluation of llm performance with statistical guarantees\.External Links:2601\.20251,[Link](https://arxiv.org/abs/2601.20251)Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
- \[51\]W\. Yuan, G\. Neubig, and P\. Liu\(2021\)BARTScore: evaluating generated text as text generation\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 27263–27277\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2606.15029#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15029#S2.SS1.p1.1)\.
- \[52\]L\. Zhao, G\. Sukthankar, and R\. Sukthankar\(2012\)Importance\-weighted label prediction for active learning with noisy annotations\.InProceedings of the 21st International Conference on Pattern Recognition \(ICPR\),Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p1.1)\.
- \[53\]T\. Zrnic and E\. J\. Candès\(2024\)Active statistical inference\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.15029#S2.SS3.p3.1)\.
## Appendix AMetric Functions
Here, we show equations and population information about our chosen metrics\.
##### Intraclass Correlation Coefficient
Intraclass Correlation Coefficient \(ICC\) was originally proposed as an extension to*interclass*correlation coefficient \(Pearson’s correlation coefficient \(PCC\)\), and measures the extent to which the total variance in observed data is due to differences between groups, rather than within groups\. In this perspective, the ICC is understood within the analysis of variance \(ANOVA\) framework\. As opposed to PCC, the data are pooled in the mean calculation\.
In the generic version, ICC is considered a measure that quantifies inter\-rater reliability betweenkkraters onnnsubjects, first introduced as an application of the metric inShrout and Fleiss \[[43](https://arxiv.org/html/2606.15029#bib.bib19)\]\. ICC measures reliability by decomposing the total variance in human evaluations into between\-subjects variance and within\-subjects error variance\. The ICC determines the reliability of ratings by comparing the variability of different ratings of the same individuals to the total variation across all ratings and all individuals\. As we only consider two raters, the human and the LLM, we consider the casek=2k=2\.
Analogously, modern ICC estimators derive ICC through the random effects model framework\. In the random effects model,XijX\_\{ij\}, ratingjjon subjectii,i∈\[n\],j∈\[k\]i\\in\[n\],j\\in\[k\], is modeled as
Xij=μ\+αi\+cj\+εijX\_\{ij\}=\\mu\+\\alpha\_\{i\}\+c\_\{j\}\+\\varepsilon\_\{ij\}such thatμ\\muis an unobserved overall mean,αi\\alpha\_\{i\}is an unobserved random effect shared by all ratings on subjectii,cjc\_\{j\}is an unobserved random effect shared by all subject ratings by raterjj, andεij\\varepsilon\_\{ij\}is an unobserved noise term\. Each class of terms is assumed to be respectively identically distributed with expected value0, and the terms are assumed to be uncorrelated\. For certain random effects models, eitherαi\\alpha\_\{i\}orcjc\_\{j\}is neglected or considered fixed\. We refer toLiljequistet al\.\[[32](https://arxiv.org/html/2606.15029#bib.bib9)\]for a comprehensive overview of ICC definitions and derivations relating classical estimators to the random effects model\.
In our specific use case, we use a two\-way consistency average, i\.e\.ICC\(3,k\)\\mathrm\{ICC\}\(3,k\)as this formulation treats*raters*as fixed effects, \(i\.e\.cjc\_\{j\}is fixed\), meaning the same evaluation panel assesses all LLM outputs, and estimates reliability for the average rating acrosskkevaluators rather than individual rater consistency\. The numerator \(MSR−MSEMS\_\{R\}\-MS\_\{E\}\) captures the true variance between different LLM responses after removing measurement error, while the denominator represents the total variance in averaged ratings, makingICC\(3,k\)\\mathrm\{ICC\}\(3,k\)particularly sensitive to systematic differences in how evaluators rate different model outputs while accounting for random measurement error within the evaluation process\. With random effects model forICC\(3,k\)\\mathrm\{ICC\}\(3,k\), the population ICC
ρ=σα2σα2\+σε2/k\\rho=\\frac\{\\sigma^\{2\}\_\{\\alpha\}\}\{\\sigma^\{2\}\_\{\\alpha\}\+\\sigma^\{2\}\_\{\\varepsilon\}/k\}We utilize the associated formula as the ICC metric for our experiments due to the appropriateness of the setting, random effects model, and use in previous empirical work\[[7](https://arxiv.org/html/2606.15029#bib.bib67),[17](https://arxiv.org/html/2606.15029#bib.bib77),[31](https://arxiv.org/html/2606.15029#bib.bib79)\]\. See table below for reproduced formulas from\[[32](https://arxiv.org/html/2606.15029#bib.bib9)\]for intraclass correlation coefficient\. We use ICC\(3,k\)\(3,k\)in all over our analyses\.
NameNotationRater ModelUse CaseFormulaOne\-way singleICC\(1,1\)RandomAgreement of 1 random raterMSR−MSEMSR\+\(k−1\)MSE\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\+\(k\-1\)\\text\{MS\}\_\{\\text\{E\}\}\}One\-way averageICC\(1,k\)RandomAgreement of average random ratersMSR−MSEMSR\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\}Two\-way absolute singleICC\(2,1\)RandomAbsolute agreement of 1 random raterMSR−MSEMSR\+\(k−1\)MSE\+kn\(MSC−MSE\)\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\+\(k\-1\)\\text\{MS\}\_\{\\text\{E\}\}\+\\frac\{k\}\{n\}\(\\text\{MS\}\_\{\\text\{C\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\)\}Two\-way absolute averageICC\(2,k\)RandomAbsolute agreement of average ratersMSR−MSEMSR\+1n\(MSC−MSE\)\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\+\\frac\{1\}\{n\}\(\\text\{MS\}\_\{\\text\{C\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\)\}Two\-way consistency singleICC\(3,1\)FixedConsistency of 1 fixed raterMSR−MSEMSR\+\(k−1\)MSE\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\+\(k\-1\)\\text\{MS\}\_\{\\text\{E\}\}\}Two\-way consistency averageICC\(3,k\)FixedConsistency of average fixed ratersMSR−MSEMSR\\displaystyle\\frac\{\\text\{MS\}\_\{\\text\{R\}\}\-\\text\{MS\}\_\{\\text\{E\}\}\}\{\\text\{MS\}\_\{\\text\{R\}\}\}Pearson correlationrrN/ACorrelation only \(not agreement\)r=∑\(xi−x¯\)\(yi−y¯\)∑\(xi−x¯\)2∑\(yi−y¯\)2\\displaystyle r=\\frac\{\\sum\(x\_\{i\}\-\\bar\{x\}\)\(y\_\{i\}\-\\bar\{y\}\)\}\{\\sqrt\{\\sum\(x\_\{i\}\-\\bar\{x\}\)^\{2\}\\sum\(y\_\{i\}\-\\bar\{y\}\)^\{2\}\}\}Notation:
- •MSR\\text\{MS\}\_\{\\text\{R\}\}: Mean square between targets \(rows\)
- •MSC\\text\{MS\}\_\{\\text\{C\}\}: Mean square between raters \(columns\)
- •MSE\\text\{MS\}\_\{\\text\{E\}\}: Residual mean square \(error\)
- •nn: Number of targets
- •kk: Number of raters
Formulas:
MSR\\displaystyle\\mathrm\{MS\_\{R\}\}=kn−1∑i=1n\(Si−X¯tot\)2\\displaystyle=\\frac\{k\}\{n\-1\}\\sum\_\{i=1\}^\{n\}\(S\_\{i\}\-\\overline\{X\}\_\{\\mathrm\{tot\}\}\)^\{2\}MSC\\displaystyle\\mathrm\{MS\_\{C\}\}=nk−1∑j=1k\(Mj−X¯tot\)2\\displaystyle=\\frac\{n\}\{k\-1\}\\sum\_\{j=1\}^\{k\}\(M\_\{j\}\-\\overline\{X\}\_\{\\mathrm\{tot\}\}\)^\{2\}MSE\\displaystyle\\mathrm\{MS\_\{E\}\}=∑i=1n∑j=1k\(xij−Mj\)2−k∑i=1n\(Si−X¯tot\)2\(n−1\)\(k−1\)\\displaystyle=\\frac\{\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{k\}\(x\_\{ij\}\-M\_\{j\}\)^\{2\}\-k\\sum\_\{i=1\}^\{n\}\(S\_\{i\}\-\\overline\{X\}\_\{\\mathrm\{tot\}\}\)^\{2\}\}\{\(n\-1\)\(k\-1\)\}Si\\displaystyle S\_\{i\}=1k∑j=1kxij\\displaystyle=\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}x\_\{ij\}Mj\\displaystyle M\_\{j\}=1n∑i=1nxij\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}x\_\{ij\}X¯tot\\displaystyle\\overline\{X\}\_\{\\mathrm\{tot\}\}=1k⋅n∑i=1n∑j=1kxij\\displaystyle=\\frac\{1\}\{k\\cdot n\}\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{k\}x\_\{ij\}
##### Krippendorff’sα\\alpha
Krippendorff’s alpha, introduced inKrippendorff \[[29](https://arxiv.org/html/2606.15029#bib.bib2)\]is a statistical measure for reliability common in inter\-rater reliability literature\. The metric operates on the same reliability matrix as intraclass correlation coefficient,X∈ℝn×kX\\in\\mathbb\{R\}^\{n\\times k\}, such thatXijX\_\{i\}jdenotes the rating from raterjjon subjectii\. Unlike ICC, Krippendorff’s alpha is not generally positioned in the context of the random effects model, nor the analysis of variance \(ANOVA\) framework\. The measure is highly versatile, designed to handle a variety of data types \(nominal, ordinal, interval, and ratio\) and common research obstacles including missing values\.
The general form of the coefficient is based on the ratio of observed disagreement amongst raters to the disagreement expected by chance,
α=1−DoDe\\alpha=1\-\\frac\{D\_\{o\}\}\{D\_\{e\}\}whereDoD\_\{o\}is the observed disagreement andDeD\_\{e\}is the disagreement expected by chance\. To defineDoD\_\{o\}andDeD\_\{e\}, define a unituuas the ratings of all raters for itemii:
- •nnbe the total number of values assigned to all units\.
- •mim\_\{i\}be the number of coders who responded to subjectii\.
- •v,kv,krepresent specific values \(categories\) in the data\.
Then, the observed disagreement,DoD\_\{o\}is the average of the distances between all pairs of values assigned to the same units\. It is defined as:
Do=1n∑i1mi−1∑v∑knivnikδvk2D\_\{o\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\frac\{1\}\{m\_\{i\}\-1\}\\sum\_\{v\}\\sum\_\{k\}n\_\{iv\}n\_\{ik\}\\delta^\{2\}\_\{vk\}where
- •nivn\_\{iv\}andnikn\_\{ik\}are the number of times valuesvvandkkwere assigned to unitii\.
- •δvk2\\delta^\{2\}\_\{vk\}is a distance function \(metric\) that depends on the level of measurement \(Nominal, Ordinal, Interval, or Ratio\)\.
The expected disagreement,DeD\_\{e\}, is the disagreement that would occur if the values were assigned to units completely at random, constrained only by the overall frequency of each value in the dataset:
De=1n\(n−1\)∑v∑knvnkδvk2D\_\{e\}=\\frac\{1\}\{n\(n\-1\)\}\\sum\_\{v\}\\sum\_\{k\}n\_\{v\}n\_\{k\}\\delta^\{2\}\_\{vk\}where
- •nvn\_\{v\}andnkn\_\{k\}are the total number of times valuesvvandkkwere used across all units\.
Finally, we turn to the distance functions \(δvk2\\delta^\{2\}\_\{vk\}\)\. The power ofα\\alphalies in the metricδvk2\\delta^\{2\}\_\{vk\}, which accounts for the magnitude of disagreement:
- •Nominal:δvk2=\{0ifv=k1ifv≠k\\delta^\{2\}\_\{vk\}=\\begin\{cases\}0&\\text\{if \}v=k\\\\ 1&\\text\{if \}v\\neq k\\end\{cases\}
- •Ordinal:δvk2=\(∑g=vkng−nv\+nk2\)2\\delta^\{2\}\_\{vk\}=\\left\(\\sum\_\{g=v\}^\{k\}n\_\{g\}\-\\frac\{n\_\{v\}\+n\_\{k\}\}\{2\}\\right\)^\{2\}
- •Interval:δvk2=\(v−k\)2\\delta^\{2\}\_\{vk\}=\(v\-k\)^\{2\}
- •Ratio:δvk2=\(v−kv\+k\)2\\delta^\{2\}\_\{vk\}=\\left\(\\frac\{v\-k\}\{v\+k\}\\right\)^\{2\}
SubstitutingDoD\_\{o\}andDeD\_\{e\}back into the original formula:
α=1−\(n−1\)∑i1mi−1∑v∑knivnikδvk2∑v∑knvnkδvk2\\alpha=1\-\(n\-1\)\\frac\{\\sum\_\{i\}\\frac\{1\}\{m\_\{i\}\-1\}\\sum\_\{v\}\\sum\_\{k\}n\_\{iv\}n\_\{ik\}\\delta^\{2\}\_\{vk\}\}\{\\sum\_\{v\}\\sum\_\{k\}n\_\{v\}n\_\{k\}\\delta^\{2\}\_\{vk\}\}This formula ensures that when agreement is perfect,Do=0D\_\{o\}=0andα=1\\alpha=1\. When disagreement matches chance,Do=DeD\_\{o\}=D\_\{e\}andα=0\\alpha=0\.
##### Spearmanrr
Spearman’s correlation coefficient, which we denote asrr, evaluates the monotonic relationship between two rank variables\. In particular, letXXandYYbe random variables with paired samples\(Xi,Yi\)\(X\_\{i\},Y\_\{i\}\)and rank variablesR\[Xi\],R\[Yi\]R\[X\_\{i\}\],R\[Y\_\{i\}\],i=1,…,ni=1,\\dots,n\. Spearman’s correlation coefficient is the Pearson correlation coefficient on therankvariables\. Unlike Pearson correlation, which measures linear relationships, Spearman correlation measures the strength and direction of the association between two ranked variables\. The equation representing Spearman’s rank correlation is shown below:
r=Cov\[R\[X\],R\[Y\]\]σR\[X\]σR\[Y\]r=\\frac\{\\text\{Cov\}\[R\[X\],R\[Y\]\]\}\{\\sigma\_\{R\[X\]\}\\sigma\_\{R\[Y\]\}\}If there are no ties in ranks, the coefficient can be calculated as
ρ=1−6∑di2n\(n2−1\)\\rho=1\-\\frac\{6\\sum d\_\{i\}^\{2\}\}\{n\(n^\{2\}\-1\)\}wheredi=R\[Xi\]−R\[Yi\]d\_\{i\}=R\[X\_\{i\}\]\-R\[Y\_\{i\}\],i=1,…,ni=1,\\dots,n\.
##### Kendall’sτ\\tau
Kendall’s tau rank correlation measures the ordinal association between two variables based on the similarity of the orderings of the data\. The measure is a rank correlation measures as in Spearman’srr, and operates on ranks of pairs of random variables,\(Xi,Yi\)\(X\_\{i\},Y\_\{i\}\),i=1,…,ni=1,\\dots,n\. Differing from Spearman’srr, Kendall’sτ\\tauconsiders number of concordant and discordant pairs \(of pairs\) for calculation of rank correlation\. A pair of pairs\(Xi,Yi\)\(X\_\{i\},Y\_\{i\}\),\(Xj,Yj\)\(X\_\{j\},Y\_\{j\}\), such thati<ji<jisconcordantif the sort order matches, onXi,XjX\_\{i\},X\_\{j\}andYi,YjY\_\{i\},Y\_\{j\}, i\.e\. ifsign\(R\[Xi\]−R\[Xj\]\)=sign\(R\[Yi\]−R\[Yj\]\)\\text\{sign\}\\big\(R\[X\_\{i\}\]\-R\[X\_\{j\}\]\\big\)=\\text\{sign\}\\big\(R\[Y\_\{i\}\]\-R\[Y\_\{j\}\]\\big\)\. Otherwise, the pair is discordant\. Thenτ\\tauis calculated as
τ=nC−nD12n\(n−1\)\\tau=\\frac\{n\_\{C\}\-n\_\{D\}\}\{\\frac\{1\}\{2\}n\(n\-1\)\}\(3\)wherenCn\_\{C\}is the number of concordant pairs, andnDn\_\{D\}is the number of discordant pairs\. The value ranges from−1\-1\(perfect inversion\) to\+1\+1\(perfect agreement\)\.
## Appendix BDataset Details
The axes of evaluation for each dataset and number of raters per data point are detailed in Table[3](https://arxiv.org/html/2606.15029#A2.T3)\.
Table 3:Datasets, evaluation axes, and number of raters per datapoint\.DatasetAxes of EvaluationRaters per datapointSummEvalCoherence, Consistency, Fluency, Relevance8HANNARelevance, Coherence, Empathy, Surprise, Engagement, Complexity3MedVALSafety1–3MSLRFluency, Population, Intervention, Outcome1–2
## Appendix CRank Estimation Error Derivation
The performance of metric matching relies on the human\-model estimation error rank ofS∗S^\{\*\}, which is, by definition the subset with rank11in the inter\-model estimation error\. We develop the intuition more formally in this section\.
As in the main paper, consider candidate subsetsS1,…,SC∼𝒳S\_\{1\},\\dots,S\_\{C\}\\sim\\mathcal\{X\}\. Letϵj=\|ρ^Sj−ρ\|\\epsilon\_\{j\}=\|\\widehat\{\\rho\}\_\{S\_\{j\}\}\-\\rho\|be the human\-model estimation error for subsetSjS\_\{j\},j=1,…,Cj=1,\\dots,C\. Similarly, letϵjIM=\|ρ^SjIM−ρIM\|\\epsilon\_\{j\}^\{\\text\{IM\}\}=\|\\widehat\{\\rho\}^\{\\text\{IM\}\}\_\{S\_\{j\}\}\-\\rho^\{\\text\{IM\}\}\|\. For initial understanding, consider the best\-case scenario, when the rank with respect to human\-model estimation error of each subset exactly matches the rank with respect to inter\-model estimation error of each subset\. For now, we are eliding the probabilistic component and simply assuming that this holds for all subsetsS⊆𝒳S\\subseteq\\mathcal\{X\}of sizebb\. Formally,R\[ϵj\]=R\[ϵjIM\]R\[\\epsilon\_\{j\}\]=R\[\\epsilon^\{\\text\{IM\}\}\_\{j\}\]\. This ensures that when we choose the subsetS∗S^\{\*\}with the smallest difference\|ρ^S∗IM−ρIM\|\|\\widehat\{\\rho\}\_\{S^\{\*\}\}^\{\\text\{IM\}\}\-\\rho^\{\\text\{IM\}\}\|, we choose the optimal subset w\.r\.t\. human\-model metric as well, i\.e\.
S∗\\displaystyle S^\{\*\}=argminS⊆X,\|S\|=b\|ρ^SIM−ρIM\|\\displaystyle=\\arg\\min\_\{S\\subseteq X,\|S\|=b\}\|\\widehat\{\\rho\}\_\{S\}^\{\\text\{IM\}\}\-\\rho^\{\\text\{IM\}\}\|≡argminS⊆X,\|S\|=b\|ρ^S−ρ\|\\displaystyle\\equiv\\arg\\min\_\{S\\subseteq X,\|S\|=b\}\|\\widehat\{\\rho\}\_\{S\}\-\\rho\|This is an extremely strong condition and likely never satisfied in practice\. Furthermore, we don’t even need this to hold in order for metric matching to select the optimal subset with respect to the human\-model metric\. We solely require that thetoprank matches, i\.e\. thatR\[ϵj\]=R\[ϵjIM\]R\[\\epsilon\_\{j\}\]=R\[\\epsilon^\{\\text\{IM\}\}\_\{j\}\]for anyj=1,…,Cj=1,\\dots,Csuch that the rank equals11\. In this case, metric matching will always choose the optimal subset \(over possible candidate subsetsS1,…,SCS\_\{1\},\\dots,S\_\{C\}, not over all subsets\)\.
Again, in practice, this is unlikely to occur but it gives some intuition for how to understand when metric matching should work \(and in particular, when metric matching should beat random selection\)\. On a high\-level, when the rank relationship between human\-model estimation error and inter\-model estimation error with respect to subsets of sizebbis as close as possible to that condition, we can expect to perform well\. We formalize this more below by comparison to random\.
In particular, consider theexpectedmicro win\-rate of our method\. This depends both on the expectation of estimation errors ofCCrandom subsets, and the estimation error of aC\+1C\+1th subset, selected by random selection\. We provide a brief derivation that shows: when the expected rankR\[ϵj∗\]R\[\\epsilon\_\{j^\{\*\}\}\]of the human\-model estimation error of thebestsubset \(rank11\) w\.r\.t\. the inter\-model estimation is less thanC\+12\\frac\{C\+1\}\{2\}, we beat random the majority of the time\. Furthermore, our expected micro win\-rate is linear in the expected rankR\[ϵj∗\]R\[\\epsilon\_\{j^\{\*\}\}\]\.
###### Lemma 1\.
Given the notation introduced in this section, the expected micro win rate:
𝔼\[𝟙\[ϵj∗<ϵrand\]\]\\displaystyle\\mathbb\{E\}\\big\[\\mathbbm\{1\}\[\\epsilon\_\{j^\{\*\}\}<\\epsilon^\{\\text\{rand\}\}\]\\big\]=Pr\[ϵj∗<ϵrand\]\\displaystyle=\\Pr\[\\epsilon\_\{j^\{\*\}\}<\\epsilon^\{\\text\{rand\}\}\]=1−𝔼\[R\[ϵj∗\]\]C\+1\\displaystyle=1\-\\frac\{\\mathbb\{E\}\[R\[\\epsilon\_\{j^\{\*\}\}\]\]\}\{C\+1\}
This admits a straightforward derivation\. Given budgetbb, randomly sampled candidate subsetsS1,…,SCS\_\{1\},\\dots,S\_\{C\}from whichS∗=Sj∗S^\{\*\}=S\_\{j^\{\*\}\}is selected by the condition thatR\[ϵj∗IM\]=1R\[\\epsilon^\{\\text\{IM\}\}\_\{j^\{\*\}\}\]=1, consider the rank of subsetj∗j^\{\*\}w\.r\.t\. the human\-model estimation error over subsetsS1,…,SCS\_\{1\},\\dots,S\_\{C\}andthe subset selected by random selectionSrS\_\{r\}\. Since all subsets are selected i\.i\.d\. from the same distribution over𝒳b\\mathcal\{X\}^\{b\}, we can concretely analyze the chance thatSrS\_\{r\}has a better rank thanSj∗S\_\{j^\{\*\}\}in the human\-model estimation error\.
Pr\[ϵj∗<ϵrand\]\\displaystyle\\Pr\[\\epsilon\_\{j^\{\*\}\}<\\epsilon^\{\\text\{rand\}\}\]=∑j=1CPr\[R\[ϵrand\]\>j,R\[ϵj∗\]=j\]\\displaystyle=\\sum\_\{j=1\}^\{C\}\\Pr\[R\[\\epsilon^\{\\text\{rand\}\}\]\>j,R\[\\epsilon\_\{j^\{\*\}\}\]=j\]=∑j=1CPr\[R\[ϵrand\]\>j∣R\[ϵj∗\]=j\]⋅Pr\[R\[ϵj∗\]=j\]\\displaystyle=\\sum\_\{j=1\}^\{C\}\\Pr\[R\[\\epsilon^\{\\text\{rand\}\}\]\>j\\mid R\[\\epsilon\_\{j^\{\*\}\}\]=j\]\\cdot\\Pr\[R\[\\epsilon\_\{j^\{\*\}\}\]=j\]=∑j=1CPr\[R\[ϵrand\]\>j\]⋅Pr\[R\[ϵj∗\]=j\]\\displaystyle=\\sum\_\{j=1\}^\{C\}\\Pr\[R\[\\epsilon^\{\\text\{rand\}\}\]\>j\]\\cdot\\Pr\[R\[\\epsilon\_\{j^\{\*\}\}\]=j\]=∑j=1C\(1−jC\+1\)Pr\[R\[ϵj∗\]=j\]\\displaystyle=\\sum\_\{j=1\}^\{C\}\\Big\(1\-\\frac\{j\}\{C\+1\}\\Big\)\\Pr\[R\[\\epsilon\_\{j^\{\*\}\}\]=j\]=1−𝔼\[R\[ϵj∗\]\]C\+1\\displaystyle=1\-\\frac\{\\mathbb\{E\}\[R\[\\epsilon\_\{j^\{\*\}\}\]\]\}\{C\+1\}The third to fourth line in the above derivation follows from the formula for the random variableϵrand\\epsilon^\{\\text\{rand\}\}having greater than rankjjout ofC\+1C\+1variables \(estimation errorsϵj\\epsilon\_\{j\}on each ofS1,…,SCS\_\{1\},\\dots,S\_\{C\}andSrS\_\{r\}\)\.
###### Condition 1\(Expected rank condition\)\.
𝔼\[R\[ϵj∗\]\]<C\+12\\mathbb\{E\}\\big\[R\[\\epsilon\_\{j^\{\*\}\}\]\\big\]<\\frac\{C\+1\}\{2\}
###### Theorem 2\.
If the expected rank condition \(Condition[1](https://arxiv.org/html/2606.15029#Thmcondition1)\) is satisfied, then the expected micro win\-rate over random selection is greater than0\.50\.5, i\.e\. metric matching beats random the majority of the time\.
This follows directly from the lemma\. We include empirical results validating the theory in Figures[4](https://arxiv.org/html/2606.15029#A3.F4)and[5](https://arxiv.org/html/2606.15029#A3.F5)\.
Figure 4:Empirical micro win\-rate versus empirical average human\-model estimation error rank percentile,∑t=1NR\[ϵj∗\(t\)\]C\+1\\displaystyle\\frac\{\\sum\_\{t=1\}^\{N\}R\[\\epsilon\_\{j^\{\*\(t\)\}\}\]\}\{C\+1\}, overN=40N=40trials in metric\-matchedα\\alphaestimation\. As expected, the relationship is linear across budgets\.Figure 5:Observed micro win\-rate versus observed average human\-model estimation error rank percentile,∑t=1NR\[ϵj∗\(t\)\]C\+1\\displaystyle\\frac\{\\sum\_\{t=1\}^\{N\}R\[\\epsilon\_\{j^\{\*\(t\)\}\}\]\}\{C\+1\}, in metric\-matched ICC estimation\. As expected, the relationship is linear across budgets\.
## Appendix DAdditional Baseline Information and Comparison
We present win rates for stratified sampling as well as random sampling with bias correction\. For stratified sampling, the average macro win rate across metrics for estimation error is 0\.794 and across threshold classification is 0\.637\.
Table 4:Stratified Metric Matching Performance: Estimation Error and Threshold ClassificationEstimation Error \(Macro\)Estimation Error \(Micro\)Budgetα\\alphaICCρ\\rhoτ\\tauα\\alphaICCρ\\rhoτ\\tau50\.8270\.9330\.9470\.8670\.5630\.6240\.5970\.595100\.7600\.9600\.9470\.9330\.5460\.6090\.6020\.588150\.7330\.9070\.8530\.8800\.5420\.5950\.5850\.569200\.7200\.8800\.8670\.8130\.5150\.5600\.5820\.565250\.6930\.9070\.8130\.8400\.5140\.5530\.5820\.575300\.7330\.8130\.8000\.7870\.5150\.5610\.5790\.563350\.7070\.7870\.8270\.7870\.5170\.5430\.5760\.544400\.6800\.7470\.7200\.7070\.5060\.5290\.5650\.534450\.6400\.6670\.7470\.7070\.5280\.5190\.5650\.539500\.6670\.6530\.7730\.7330\.5130\.5200\.5750\.547Avg0\.7160\.8250\.8290\.8050\.5260\.5610\.5810\.562Table 5:Stratified Metric Matching Performance: Classification Win Rate \(Macro and Micro\)Classification \(Macro\)Classification \(Micro\)Budgetα\\alphaICCρ\\rhoτ\\tauα\\alphaICCρ\\rhoτ\\tau50\.6880\.5850\.7840\.8240\.5750\.5540\.6290\.644100\.6550\.6110\.7190\.7170\.5450\.5750\.6110\.628150\.6650\.6230\.7010\.6760\.5570\.5850\.5970\.591200\.5710\.6250\.7030\.6190\.5340\.5640\.6030\.554250\.6760\.6110\.6850\.7380\.5620\.5710\.5980\.595300\.5980\.6200\.7130\.6970\.5470\.5730\.6130\.575350\.6230\.6250\.6790\.6030\.5300\.5710\.5950\.554400\.5270\.6550\.6720\.4230\.4930\.5740\.5730\.469450\.5320\.6300\.6040\.6530\.5100\.5740\.5560\.591500\.4840\.6160\.5770\.4790\.4830\.5640\.5590\.486Avg0\.6020\.6200\.6840\.6430\.5340\.5700\.5930\.569For random selection with bias correction, the average macro win rate across metrics for estimation error is 0\.863 and across threshold classification is 0\.392\.
Table 6:Random with Bias Correction Metric Matching Performance: Estimation Error \(Macro and Micro\)Estimation Error \(Macro\)Estimation Error \(Micro\)Budgetα\\alphaICCρ\\rhoτ\\tauα\\alphaICCρ\\rhoτ\\tau50\.8800\.9600\.8530\.8670\.5400\.5950\.5470\.562100\.7871\.0000\.9470\.9870\.5370\.5690\.5720\.591150\.7730\.9730\.9470\.9330\.5400\.5790\.5720\.575200\.7070\.9070\.9730\.9470\.5320\.5540\.5830\.588250\.7200\.8800\.9330\.9600\.5450\.5580\.5900\.589300\.7330\.8400\.9200\.8670\.5320\.5460\.5980\.569350\.7330\.8130\.9200\.8270\.5430\.5360\.6130\.581400\.7470\.8270\.9470\.8670\.5480\.5390\.6020\.559450\.7600\.8000\.9470\.8130\.5630\.5510\.6220\.569500\.7330\.7330\.9470\.8270\.5690\.5400\.6310\.579Avg0\.7570\.8730\.9330\.8890\.5450\.5570\.5930\.576Table 7:Random with Bias Correction Metric Matching Performance: Thresholding \(Macro and Micro\)Thresholding \(Macro\)Thresholding \(Micro\)Budgetα\\alphaICCρ\\rhoτ\\tauα\\alphaICCρ\\rhoτ\\tau50\.3980\.5370\.3630\.4310\.4430\.5310\.4290\.463100\.2890\.4840\.3870\.6110\.3720\.5260\.4430\.542150\.2700\.5050\.3090\.5480\.3610\.5330\.3850\.511200\.2460\.5060\.2850\.4550\.3100\.5220\.3740\.482250\.2040\.4960\.3060\.6140\.3180\.5330\.3430\.533300\.1650\.5050\.3360\.5210\.2750\.5270\.3820\.522350\.1650\.5180\.2900\.5690\.2750\.5340\.3540\.551400\.1410\.5200\.3420\.4190\.2350\.5380\.3690\.517450\.1580\.5390\.2710\.5040\.2130\.5570\.3270\.565500\.1670\.5070\.2590\.5400\.2270\.5420\.3670\.526Avg0\.2200\.5120\.3150\.5210\.3030\.5340\.3770\.521
## Appendix EAblation Studies
### E\.1Average Aggregation
We show that the way we aggregate model information impacts the performance of our algorithm\. Given a model suiteℳ=\{M1,M2,…,Mn\}\\mathcal\{M\}=\\\{M\_\{1\},M\_\{2\},\\ldots,M\_\{n\}\\\}and a target modelT∈ℳT\\in\\mathcal\{M\}, we present an alternative aggregation approach in Figure[6](https://arxiv.org/html/2606.15029#A5.F6)b\. We calculate the target metric byρ^=f\(T,1\|ℳ\|∑Mi∈ℳMi\)\\hat\{\\rho\}=f\\left\(T,\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M\_\{i\}\\in\\mathcal\{M\}\}M\_\{i\}\\right\)\. This essentially constructs an average score function from the scorers inℳ\\mathcal\{M\},Mavg:𝒳→𝒴M^\{\\text\{avg\}\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}s\.t\.Mavg\(x\)=1\|ℳ\|∑M′∈ℳM′\(x\)M^\{\\text\{avg\}\}\(x\)=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{M^\{\\prime\}\\in\\mathcal\{M\}\}M^\{\\prime\}\(x\)\. We posit that this fails to capture all necessary signal from individual judges, thus causing the alternative aggregation method that we employ in the main body to perform better\.
\(a\)
\(b\)
Figure 6:Aggregation approaches impact effectiveness ofMetric Match, with the optimal approach being to calculate the reliability of our target model with each ensemble member individually and average these results \(a\) as opposed to averaging the ensemble labels and calculating the reliability coefficient with a single "averaged" ensemble point \(b\)\.Estimation error win rates as shown in Table[8](https://arxiv.org/html/2606.15029#A5.T8)are higher for this aggregation method, indicating that although the gains are less significant, they are equally if not more reliable\.
Table 8:Average\-Pairwise Metric Matching Performance: Estimation Error and Threshold Classification Macro Win RateEstimation Error \(Win Rate\)Threshold Classification \(Win Rate\)Budgetα\\alphaICCτ\\tauρ\\rhoα\\alphaICCτ\\tauρ\\rho50\.8670\.7600\.5870\.5070\.6310\.6250\.7970\.671100\.9200\.8400\.7600\.6670\.6230\.6670\.7390\.611150\.8930\.7870\.7730\.7330\.7500\.6530\.6610\.791200\.8670\.7330\.6930\.6800\.6440\.6120\.5710\.714250\.8530\.7330\.6800\.7070\.6670\.5850\.6300\.774300\.7730\.7200\.7330\.7730\.7950\.5160\.7300\.655350\.8400\.6800\.6270\.7330\.6750\.4770\.6490\.623400\.7600\.7200\.6400\.5870\.6940\.5500\.6300\.700450\.8530\.7470\.6400\.6530\.6110\.5670\.6000\.660500\.8800\.8000\.7200\.7200\.5810\.6130\.5560\.651Avg0\.8510\.7520\.6850\.6760\.6670\.5870\.6560\.685
### E\.2Number of Candidate SubsetsCC
Here, we vary the number of candidate subsets considered for variance matching betweenC=10C=10toC=200C=200, and report the estimation errors with comparison to random baseline in the red horizontal line in Figure[8](https://arxiv.org/html/2606.15029#A5.F8)for ICC and Figure[7](https://arxiv.org/html/2606.15029#A5.F7)for alpha estimation\. We note that these values would correspond to macro\-average win\-rate and thus do not represent the results we would expect to see with micro\-average win\-rate\. We do not observe an interpretable trend across datasets or budgets between macro\-average and number of candidate subsets\. Additionally, due to computational constraints, we only study up to200200candidate subsets, which may be well below the number required to observe any relationship \(recall total possible subsets is\(\|𝒳\|b\)\\displaystyle\{\|\\mathcal\{X\}\|\\choose b\}\)\.
\(a\)HANNA
\(b\)MedVAL
\(c\)MSLR
\(d\)SummEval
Figure 7:Ablation study over number of candidate subsets considered: average estimation error inα\\alphametric matching at each budget for each dataset for runs of metric matching withC=10C=10up toC=200C=200candidate subsets withN=40N=40trials each\.\(a\)HANNA
\(b\)MedVAL
\(c\)MSLR
\(d\)SummEval
Figure 8:Ablation study over number of candidate subsets considered: average estimation error in ICC metric matching at each budget for each dataset for runs of metric matching withC=10C=10up toC=200C=200candidate subsets withN=40N=40trials each\.
## Appendix FVariance Matching
We present an alternative algorithm as an extension to metric matching calledVariance Matching\. We find that Variance Matching performs worse than metric matching on average, but in cases where metric matching performs very poorly, variance matching is more robust to these outlying cases\.
Data:Metric function
TT; MSB functionMSB, MSE functionMSE; weights
α,β\\alpha,\\beta; budget
bb; generated text data
X=\{xi\}i=1n∈𝒳nX=\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\\in\\mathcal\{X\}^\{n\}; scores
\{yi\(M\)\}i=1n\\\{y\_\{i\}^\{\(M\)\}\\\}\_\{i=1\}^\{n\}, \{
yi\(M′\)\}i=1ny\_\{i\}^\{\(M^\{\\prime\}\)\}\\\}\_\{i=1\}^\{n\}from LLM judges
MMand
M′M^\{\\prime\}, respectively;
CC, number of candidate subsets over which to search
Result:Subset
S∗⊆XS^\{\*\}\\subseteq X,
\|S∗\|=b\\lvert S^\{\*\}\\rvert=b
1Initialize
δmin←∞\\delta\_\{\\min\}\\leftarrow\\infty
2Initialize
S∗←∅S^\{\*\}\\leftarrow\\emptyset
3
ρIM←α⋅MSB\(\{yi\(M\)\}i=1n,\{yi\(M′\)\}i=1n\)\+β⋅MSE\(\{yi\(M\)\}i=1n,\{yi\(M′\)\}i=1n\)\\rho^\{\\text\{IM\}\}\\leftarrow\\alpha\\cdot\\texttt\{MSB\}\\big\(\\\{y^\{\(M\)\}\_\{i\}\\\}\_\{i=1\}^\{n\},\\\{y^\{\(M^\{\\prime\}\)\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\big\)\+\\beta\\cdot\\texttt\{MSE\}\\big\(\\\{y^\{\(M\)\}\_\{i\}\\\}\_\{i=1\}^\{n\},\\\{y^\{\(M^\{\\prime\}\)\}\_\{i\}\\\}\_\{i=1\}^\{n\}\\big\)
4for*j=1,…,Cj=1,\\dots,C*do
5Sample
Sj=\{xi1,…,xib\}S\_\{j\}=\\\{x\_\{i\_\{1\}\},\\dots,x\_\{i\_\{b\}\}\\\}uniformly from
XXwithout replacement
6
ρ^SjIM←α⋅MSB\(\{yik\(M\)\}k=1b,\{yik\(M′\)\}k=1b\)\+β⋅MSE\(\{yik\(M\)\}k=1b,\{yik\(M′\)\}k=1b\)\\widehat\{\\rho\}\_\{S\_\{j\}\}^\{\\text\{IM\}\}\\leftarrow\\alpha\\cdot\\texttt\{MSB\}\\big\(\\\{y^\{\(M\)\}\_\{i\_\{k\}\}\\\}\_\{k=1\}^\{b\},\\\{y^\{\(M^\{\\prime\}\)\}\_\{i\_\{k\}\}\\\}\_\{k=1\}^\{b\}\\big\)\+\\beta\\cdot\\texttt\{MSE\}\\big\(\\\{y^\{\(M\)\}\_\{i\_\{k\}\}\\\}\_\{k=1\}^\{b\},\\\{y^\{\(M^\{\\prime\}\)\}\_\{i\_\{k\}\}\\\}\_\{k=1\}^\{b\}\\big\)
7
δj=\|ρ^SjIM−ρIM\|\\delta\_\{j\}=\\lvert\\widehat\{\\rho\}\_\{S\_\{j\}\}^\{\\text\{IM\}\}\-\\rho^\{\\text\{IM\}\}\\rvert
8if*δj<δmin\\delta\_\{j\}<\\delta\_\{\\min\}*then
9
S∗←SjS^\{\*\}\\leftarrow S\_\{j\}
10
δmin←δj\\delta\_\{\\min\}\\leftarrow\\delta\_\{j\}
11
return*S∗S^\{\*\}*
Algorithm 2Weighted Variance MatchingWe find that the optimal hyperparameters for this mixture areα=0\.9\\alpha=0\.9andβ=0\.1\\beta=0\.1\. We highlight results of this algorithm in Table[9](https://arxiv.org/html/2606.15029#A6.T9)\.
Table 9:Variance Matching Performance: Estimation Error and Threshold ClassificationEstimation Error \(Win Rate\)Threshold Classification \(Win Rate\)Budgetα\\alphaICCτ\\tauρ\\rhoα\\alphaICCτ\\tauρ\\rho50\.8670\.8400\.7600\.8000\.6550\.5050\.5850\.572100\.9200\.9470\.8800\.8800\.7280\.5390\.7100\.653150\.9730\.9470\.7730\.8270\.6670\.5480\.5870\.627200\.9330\.9070\.8270\.8400\.5060\.4960\.3330\.511250\.8670\.8670\.7730\.7870\.5740\.6090\.5000\.640300\.8800\.9470\.8000\.8400\.6200\.5830\.7200\.640350\.9330\.9070\.7870\.8000\.5630\.6420\.4640\.488400\.7600\.8270\.7070\.7330\.4660\.5390\.4520\.449450\.9070\.8930\.7730\.8000\.4330\.5860\.4610\.504500\.8530\.8670\.7330\.7600\.6790\.5220\.5580\.442Avg0\.8890\.8950\.7810\.8070\.5890\.5570\.5370\.552
## Appendix GMean Squared Error
We introduce results with non\-correlation\-based metrics, such as Mean Squared Error \(MSE\)\. We show that variance matching as opposed toMetric Matchon the MSE metric provides larger gains, althoughMetric Matchstill outperforms random for estimation error and threshold classification at the macro level for small budgets\.
Table 10:MSE Win Rates by Method: Macro and Micro Win RatesEstimation ErrorThreshold ClassificationMetric MatchingVariance MatchingMetric MatchingVariance MatchingBudgetMacroMicroMacroMicroMacroMicroMacroMicro50\.8270\.4750\.7330\.4810\.5680\.4810\.5120\.511100\.7200\.4920\.8130\.5170\.5240\.4690\.5500\.531150\.6930\.4860\.8000\.5170\.5650\.4480\.6760\.520200\.5870\.4690\.7330\.5030\.4620\.4450\.5670\.508250\.6130\.4590\.8000\.5120\.4150\.4390\.5500\.505300\.5870\.4500\.7600\.5000\.3980\.4340\.5470\.495350\.6130\.4470\.8270\.5220\.3870\.4420\.4980\.508400\.5730\.4490\.6400\.5020\.3920\.4350\.5120\.481450\.5200\.4420\.6800\.4990\.3780\.5040\.4540\.497500\.5870\.4460\.6270\.4920\.4190\.4500\.4670\.485Avg0\.6320\.4620\.7410\.5040\.4510\.4550\.5330\.505
## Appendix HResults by Dataset Axis
Here we present disagreggated results at the dataset level in order to highlight the robustness of this method across datasets, metrics, and budgets\. We highlight that poor performance for certain \(metric, dataset\) pairs are empirically driven by low true reliability parameters, indicating thatMetric Matchmay be more susceptible than variance matching when the judge is misaligned from human constructs\.










Figure 9:Annotation savings for bothMetric Matchas wel as associatedVariance Matchselection methods\.We also highlight the true reliability coefficient value per \(dataset,axis\) pair in Table[11](https://arxiv.org/html/2606.15029#A8.T11), along with the estimation error \(Table[12](https://arxiv.org/html/2606.15029#A8.T12)\) and reliability classification \(Table[13](https://arxiv.org/html/2606.15029#A8.T13)\) win rate betweenMetric Matchand random selection at the individual dataset level\.
DatasetAxisAlphaICCRhoTauMSEhannaCoherence\-0\.2300\.6790\.4560\.3914\.275hannaComplexity0\.1810\.6050\.3320\.2771\.923hannaEmpathy0\.1940\.5260\.3020\.2561\.688hannaEngagement0\.0250\.6710\.4040\.3442\.592hannaRelevance0\.4040\.7550\.6030\.5161\.700hannaSurprise0\.1280\.2460\.1970\.1603\.103medvalRisk0\.6550\.8120\.7060\.6291\.196mslrfluency0\.6250\.7790\.6550\.6430\.169mslrintervention0\.5410\.7000\.5380\.4870\.556mslroutcome\-0\.113\-0\.0050\.000\-0\.0001\.430mslrpopulation0\.3770\.5450\.3720\.3410\.642summevalcoherence0\.6440\.7870\.6680\.5481\.143summevalconsistency0\.6540\.8120\.7620\.7180\.987summevalfluency\-0\.2630\.5220\.4200\.3654\.222summevalrelevance0\.5270\.7740\.5760\.4810\.951Table 11:True population values of disagreggated datasets across metrics, averaged over our model suite\.DatasetAxisAlphaICCRhoTauMSEhannaCoherence0\.000\.760\.940\.940\.00hannaComplexity1\.001\.001\.000\.940\.80hannaEmpathy1\.000\.900\.720\.660\.78hannaEngagement0\.820\.920\.900\.840\.24hannaRelevance0\.700\.980\.960\.940\.48hannaSurprise0\.380\.680\.760\.700\.66medvalRisk0\.981\.001\.000\.940\.84mslrfluency1\.001\.000\.981\.000\.96mslrintervention0\.901\.000\.940\.940\.90mslroutcome0\.100\.200\.180\.300\.14mslrpopulation0\.961\.000\.960\.941\.00summevalcoherence1\.000\.960\.940\.840\.92summevalconsistency0\.961\.000\.980\.980\.94summevalfluency0\.080\.980\.940\.940\.00summevalrelevance0\.981\.000\.980\.940\.82Table 12:Disagreggated results for estimation error win rate ofMetric Matchcompared to random selection\.DatasetAxisAlphaICCRhoTauMSEhannaCoherence0\.000\.290\.830\.930\.00hannaComplexity0\.850\.470\.700\.810\.70hannaEmpathy0\.490\.190\.500\.510\.56hannaEngagement0\.560\.560\.710\.690\.17hannaRelevance0\.590\.630\.890\.880\.10hannaSurprise0\.490\.430\.700\.760\.64medvalRisk0\.670\.910\.650\.760\.40mslrfluency0\.770\.670\.870\.841\.00mslrintervention0\.720\.640\.800\.820\.66mslroutcome0\.360\.060\.380\.440\.00mslrpopulation0\.850\.840\.860\.940\.73summevalcoherence0\.820\.740\.730\.750\.42summevalconsistency0\.380\.610\.630\.670\.55summevalfluency—0\.260\.800\.830\.00summevalrelevance0\.730\.600\.640\.710\.68Table 13:Disagreggated results for reliability classification win rate ofMetric Matchcompared to random selection\.
## Appendix ITarget and Ensemble Ablations
### I\.1Small models
Here we present the same analysis but with a smaller, less powerful suite of models to show generalizability\. Models include GPT\-4o\-mini\[[38](https://arxiv.org/html/2606.15029#bib.bib103)\], Meta\-Llama\-3\.1\-8B\-Instruct\[[21](https://arxiv.org/html/2606.15029#bib.bib104)\], Google\-gemma\-3\-1b\-it\[[46](https://arxiv.org/html/2606.15029#bib.bib105)\], and Qwen2\.5\-7B\-Instruct\[[6](https://arxiv.org/html/2606.15029#bib.bib106)\]\.
Estimation ErrorThresholdingBudgetAlphaICCRhoTauMSEAlphaICCRhoTauMSE50\.920\.900\.910\.950\.850\.740\.480\.770\.810\.68100\.800\.920\.810\.840\.720\.920\.590\.890\.910\.79150\.800\.880\.820\.840\.680\.840\.620\.940\.960\.79200\.780\.850\.830\.810\.620\.880\.610\.830\.960\.66250\.780\.830\.880\.860\.621\.000\.580\.951\.000\.80300\.780\.850\.860\.840\.530\.920\.850\.921\.000\.73350\.820\.870\.840\.790\.551\.000\.770\.881\.000\.81400\.730\.920\.840\.790\.501\.000\.760\.941\.000\.76450\.700\.900\.780\.790\.471\.000\.840\.671\.000\.73500\.730\.850\.740\.740\.451\.000\.751\.00—0\.81Average0\.780\.880\.830\.830\.600\.930\.690\.880\.960\.76Table 14:Estimation Error and Thresholding by Budget w/ small ensemble



Figure 10:Small models show similar performance gains across metrics and budgets\.We report results from a suite of small models to highlight that this approach generalizes across model scale\.
### I\.2Single Model Ensemble Ablations
We provide additional ablations for changing the ensemble itself that provides synthetic labels for selection of points to annotate\. We see that the largest gains over random are present when an ensemble of models are used as opposed to individual models, which are more prone to reliance on the correlation of any individual model with human feedback for a given dataset\. Ensembling in this setting helps improve robustness\. Further, we identify that the target models must be roughly aligned capability\-wise with their ensemble, and having an ensemble that is miscalibrated to the target results in worse performance\.
Table 15:Percent Relative Improvement over Random by Method and Metric\. These results highlight that single model ensembles don’t generalize as well as multi\-model ensembles\.MetricStrong Ens\.\+Strong TargetClaude\-OnlyGPT\-5\-OnlyWeak Ens\.\+Strong TargetWeak Ens\.\+Weak TargetICC26\.65%15\.71%23\.56%−\-34\.49%33\.09%Alpha9\.05%\-2\.88%0\.80%−\-46\.87%21\.56%Rho20\.95%9\.61%13\.57%−\-19\.26%21\.67%Tau17\.93%5\.74%11\.12%−\-18\.36%20\.11%Similar Articles
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
This paper explores which agreement statistics for LLM judge validation are redundant when criteria are binary, and provides a checklist for proper reporting including abstention handling.
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.