Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Summary
This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.
View Cached Full Text
Cached at: 05/19/26, 06:42 AM
# Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Source: [https://arxiv.org/html/2605.16354](https://arxiv.org/html/2605.16354)
Jane Paik Kim Department of Psychiatry and Behavioral Sciences Stanford University Stanford, CA 94304 janepkim@stanford\.edu
###### Abstract
Large language models \(LLMs\) are increasingly used as automated evaluators of AI systems, including in high\-stakes applications\. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs\. This approach is motivated by practical constraints\. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost\. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design\. This paper \(1\) shifts the role of the LLM judge from substitutive to auxiliary, and \(2\) formulates the LLM\-as\-a\-judge paradigm as one of augmenting human evaluation through a two\-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage\. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design\. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power\. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high\. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks\.
## 1Introduction
LLMs are increasingly evaluated on specialized tasks such as clinical summarization, diagnostic interpretation, and patient\-facing communication\. \(Beanet al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib1)\); Croxfordet al\.\([2025](https://arxiv.org/html/2605.16354#bib.bib4)\); Kumaret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib8)\)\)\. Evaluating the adequacy of LLM\-generated outputs requires domain expertise, as adequacy depends on accuracy, context\-sensitive interpretation, and appropriateness\. However, expert review is expensive and slow to obtain\. To address this bottleneck, researchers have adopted LLM\-as\-judge approaches, in which LLM outputs are evaluated by other LLMs rather than by human experts \(Guet al\.\([2024](https://arxiv.org/html/2605.16354#bib.bib12)\)\)\. These approaches scale at negligible cost but introduce a new problem\.
A critical problem is that these approaches rely solely on LLM ratings after ad\-hoc justification\. In practice, this substitution is justified through agreement\-based validation\. In a common approach, expert ratings are collected on a convenience sample of benchmark instances and compared to LLM\-generated ratings using agreement metrics\. If agreement meets a chosen threshold, the LLM is treated as a valid replacement, and subsequent evaluation relies on LLM ratings alone \(Liet al\.\([2024](https://arxiv.org/html/2605.16354#bib.bib9)\)\)\. A similar pattern appears in automated evaluator pipelines, where expert\-labeled test sets are used to validate a trained scoring model by demonstrating agreement with human raters, after which the model is deployed without incorporating expert labels\. In both cases, the role of human oversight is solely to gauge the performance of the LLM judge\. Once validation is complete, the human ratings are discarded and not used\. This strategy may be adequate for low\-stakes benchmarking, but in clinical quality monitoring, mental health safety auditing, and regulatory assessment, undetected errors carry meaningful consequences\.
There are several limitations to the ad hoc approach of using agreement as the primary justification for replacing human review\. High agreement does not establish that LLM judges are rating the same constructs as humans \(Chehbouniet al\.\([2025](https://arxiv.org/html/2605.16354#bib.bib2)\)\)\. Second, agreement is typically heterogeneous across item types, content domains, or evaluation dimensions\.Kumaret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib8)\)demonstrated this by showing expert\-LLM agreement ranged from 0\.17 to 0\.86 of weighted kappa across 21 evaluation dimensions of empathic communication\. Third, validation sample sizes are not typically justified\.
#### Addressing the gap in rigor\.
This paper shifts the dominant framing in LLM evaluation from replacement to augmentation\. We propose using LLM\-generated ratings as auxiliary data to complement a carefully designed subsample of human evaluations\. We propose framing the LLM\-as\-a\-judge paradigm as a two\-stage sampling problem \(Zhao and Lipsitz \([1992](https://arxiv.org/html/2605.16354#bib.bib18)\)\), where inexpensive ratings can be measured for the whole sample and expensive ratings can be measured for only a subset of the whole sample\. Under this design, expensive human ratings are partially observed due to cost\. To handle incomplete data, one can either use auxiliary data to predict and impute missing values, or weight the observed data by the inverse of the response probability\. In typical missing data problems, the validity of these methods depend on the correct specification of the prediction or the response probability model\. The doubly robust \(DR\) estimator is the combination of the two and requireseithermodel to be correct for the inference to be valid \(Robins and Rotnitzky \([1994](https://arxiv.org/html/2605.16354#bib.bib11)\)\)\. However, in our case, the response probability is dictated by design and is known, and thus the validity of inferences based on the DR estimator is guaranteed\.
#### Our contributions\.
A critical direct consequence of this framework is the allowance of prospective designs for LLM\-as\-a\-judge evaluations with formal design components\. The form of the asymptotic variance of the DR estimator allows us to determine how many human and LLM samples are needed\. This reframing shifts the emphasis from post\-hoc comparison to prospective study design, and provides the sample size formulas and allocation guidelines\. It enables evaluation studies to be designed with the same inferential rigor applied to clinical trials\. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks\.
Our primary contributions are \(i\) to frame the LLM\-as\-a\-judge paradigm as a two\-stage design by treating human expert ratings as the primary quantity of inferential interest and LLM\-generated ratings as auxiliary measurements that are inexpensive and scalable to obtain and apply missing data methodology, \(ii\) to provide a sample size calculation for LLM and human ratings given how predictable LLM ratings are for human ratings, and \(iii\) to provide an efficient allocation strategy to design an LLM\-as\-a\-judge study\.
## 2Methods
### 2\.1Two\-stage sampling
Our framework first assumes that human ratings are the gold standard of some well defined construct and are of primary inferential interest, and that LLM ratings are available as auxiliary data\. We propose a two\-stage sampling design, where in the first stage, LLM ratings are available for all evaluative units\. In the second stage human ratings are available in only a subset of the first\-stage\. The role of LLM ratings is to serve as auxiliary data that supplement an incomplete set of human evaluations\. This is in contrast with the widely adopted approach of using human ratings as a means to justify the use of LLM ratings but discarded in the main analysis\. In a two\-stage design, missing data methodology is used to handle incomplete human ratings \(Figure[1](https://arxiv.org/html/2605.16354#S2.F1)\)\. One distinctive feature of two\-stage sampling is that the missingness mechanism is known and determined by design\.
Figure 1:Proposed formulation of a two\-stage design: At the first stage, LLM ratings are collected for all evaluation units\. At the second stage, human ratings data are collected for a subsample of units, and the remaining units have no observed human rating\. The outlined box in green represents the target of inferential interest, the full distribution of human ratings across all units, including incomplete observations\.
### 2\.2The estimator
To motivate the estimator, let us start with some simple intuition\. One way to handle missing data is to build a prediction model for human ratings using LLM ratings as a predictor, and fill in the predicted values in place of missing data\. This is known as a prediction approachKim and Shao \([2021](https://arxiv.org/html/2605.16354#bib.bib7)\), and its validity depends on the correct model specification of the prediction model\. Another way to handle missing data is to use the observed data only but weight the observations by the inverse of the probability of observation or response\. This approach is called inverse probability weighting, or propensity score methods, first proposed byHorvitz and Thompson \([1952](https://arxiv.org/html/2605.16354#bib.bib13)\), and the validity depends on the correct model specification of the response model\. The prediction and response models are nuisance models that are not needed if we do not have missing data\. The propensity score approach is known to be inefficient \(Hahn \([1998](https://arxiv.org/html/2605.16354#bib.bib5)\); Heckmanet al\.\([1997](https://arxiv.org/html/2605.16354#bib.bib6)\)\), but can be made efficient by constructing estimating equations with the subtraction of an unbiased term that renders the resulting estimating function orthogonal to the nuisance model estimating function \(Kim and Shao \([2021](https://arxiv.org/html/2605.16354#bib.bib7)\)\)\. The term that is subtracted involves the predictor of the outcome\. We refer to the resulting estimating approach as the doubly robust approach\.
A doubly robust estimator is a weighted average of both estimators but its validity depends on the correctness of either the prediction or response probability model\. As mentioned earlier, the incompleteness of human ratings is a consequence of design so that the probability of observation or response is known and will always be correctly specified\. That is, the DR estimator will guarantee that the parameter estimate for the expert human rating population is still valid even if the prediction model is incorrect\.
LetNNdenote the total number of items requiring evaluation\. Letδi=1\\delta\_\{i\}=1if a human response is observed, 0, otherwise\. LetYiY\_\{i\}be the human ratings, andXi,X\_\{i\},the predictors ofYi,Y\_\{i\},including the LLM ratings, andπi=P\(δi=1\|Xi\),\\pi\_\{i\}=P\(\\delta\_\{i\}=1\|X\_\{i\}\),the sampling probabilities\. Let∑i=1Nδi=n\.\\sum\_\{i=1\}^\{N\}\\delta\_\{i\}=n\.LLM ratings are observed for allNNitems, and human ratings are observed only for thennreviewed items\. The response probabilitiesπi\\pi\_\{i\}are controlled by the investigator\.
LetU\(θ;X,Y\)U\(\\theta;X,Y\)be any estimating function\. A simple example is whenU\(θ;Xi,Yi\)=Yi−θ\.U\(\\theta;X\_\{i\},Y\_\{i\}\)=Y\_\{i\}\-\\theta\.
Let
W\(θ\)=∑i=1N\[δiπiU\(θ;Xi,Yi\)\]−∑i=1N\(δiπi−1\)E\[U\(θ;Xi,Yi\)\|Xi\]W\(\\theta\)=\\sum\_\{i=1\}^\{N\}\\left\[\\frac\{\\delta\_\{i\}\}\{\\pi\_\{i\}\}U\(\\theta;X\_\{i\},Y\_\{i\}\)\\right\]\-\\sum\_\{i=1\}^\{N\}\(\\frac\{\\delta\_\{i\}\}\{\\pi\_\{i\}\}\-1\)E\\left\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\\middle\|X\_\{i\}\\right\]\(1\)
The solution ofW\(θ\)=0W\(\\theta\)=0gives the DR estimator\. The underlying population parameterθ\\thetais estimated using a doubly robust estimator \(Robins and Rotnitzky \([1994](https://arxiv.org/html/2605.16354#bib.bib11)\); Kim and Shao \([2021](https://arxiv.org/html/2605.16354#bib.bib7)\)\)\. When we replace LLM ratings for human ratings, we haveE\(Yi\|Xi\)=Xi\.E\(Y\_\{i\}\|X\_\{i\}\)=X\_\{i\}\.Equation \(1\) shows that forδ=0,\\delta=0,the LLM rating will replace the human rating\. Whenδ=1\\delta=1, a human rating is used but is adjusted by a weighted residual of human and LLM ratings\.
### 2\.3Variance of the Doubly Robust Estimator
The variance of the doubly robust estimator is shown in \(2\) and depends on two sources of uncertainty\. The first term of the variance represents the variance as if allNNvalues are from human ratings\. The second is the penalty for using LLM\-ratings in place of human ratings, and depends on two quantities: the residual error, measuring how well the LLM predicts human ratings, and the response or observation probabilityπi\\pi\_\{i\}\.
The variance of the DR estimator is given by the following:
V\[W\(θ\)\]\\displaystyle V\\left\[W\(\\theta\)\\right\]=V\[1n∑i=1nU\(θ;Xi,Yi\)\]\\displaystyle=V\\left\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}U\(\\theta;X\_\{i\},Y\_\{i\}\)\\right\]\+E\[1n2∑i=1n\(1πi−1\)\{U\(θ;Xi,Yi\)−E\[U\(θ;Xi,Yi\)∣Xi\]\}2\]\\displaystyle\+E\\left\[\\frac\{1\}\{n^\{2\}\}\\sum\_\{i=1\}^\{n\}\(\\frac\{1\}\{\\pi\_\{i\}\}\-1\)\\\{U\(\\theta;X\_\{i\},Y\_\{i\}\)\-E\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\\mid X\_\{i\}\]\\\}^\{2\}\\right\]\(2\)
This variance formula will be the basis for the sample size calculations\.
## 3Main Results
### 3\.1Sample Size Calculations
Let us begin from the point where the usual sample size calculation is completed, assuming that human ratings are obtainable for all units of assessment \(Cohen \([2013](https://arxiv.org/html/2605.16354#bib.bib3)\)\)\. Let that sample size ben∗\.n^\{\*\}\.It is assumed that we needn∗n^\{\*\}human rating samples to achieve either a desired power or pre\-specified level of precision\. Consider a simple case of estimating the mean ofYYand the variance ofYYisσ2\.\\sigma^\{2\}\.Then the target variance of the sample mean isσ2/n∗\.\\sigma^\{2\}/n^\{\*\}\.
In practice, obtaining the target sample sizen∗n^\{\*\}of human ratings may not be feasible\. Our goal is to find a combination of a human sample of sizen≤n∗,n\\leq n^\{\*\},an LLM sample of sizeN≥n∗,N\\geq n^\{\*\},and response probabilitiesπi\\pi\_\{i\}’s that results in the same target varianceσ2/n∗\.\\sigma^\{2\}/n^\{\*\}\.Sometimes, the size of the human sample may be constrained due to feasibility or a given budget\. In that case, given a desiredn∗n^\{\*\}andnn, we can findNN\. To use the sample size formula, we need some external information on the conditional variance,
E\[U\(θ;Xi,Yi\)−E\{U\(θ;Xi,Yi\)\|Xi\}\|Xi\]2\.\\displaystyle E\\left\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\-E\\\{U\(\\theta;X\_\{i\},Y\_\{i\}\)\|X\_\{i\}\\\}\\middle\|X\_\{i\}\\right\]^\{2\}\.
We use the relationship
σe2σ2=1−ρ2\\displaystyle\\frac\{\\sigma\_\{e\}^\{2\}\}\{\\sigma^\{2\}\}=1\-\\rho^\{2\}
where
σe2=E\[\{U\(θ;Xi,Yi\)−E\[U\(θ;Xi,Yi\)\|Xi\)\]\}2\|Xi\],\\displaystyle\\sigma^\{2\}\_\{e\}=E\[\\\{U\(\\theta;X\_\{i\},Y\_\{i\}\)\-E\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\|X\_\{i\}\)\]\\\}^\{2\}\|X\_\{i\}\],σ2=E\[U\(θ;Xi,Yi\)−E\{U\(θ;Xi,Yi\)\}\]2,\\displaystyle\\sigma^\{2\}=E\\left\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\-E\\\{U\(\\theta;X\_\{i\},Y\_\{i\}\)\\\}\\right\]^\{2\},
and
ρ2=corr\(U\(θ;Xi,Yi\),E^\[U\(θ;Xi,Yi\)\|Xi\)\]\)2\.\\displaystyle\\rho^\{2\}=corr\(U\(\\theta;X\_\{i\},Y\_\{i\}\),\\hat\{E\}\[U\(\\theta;X\_\{i\},Y\_\{i\}\)\|X\_\{i\}\)\]\)^\{2\}\.
The conditional variance is related to the correlation between human and LLM ratings\. We denote an estimate of the correlationρ2\\rho^\{2\}byR2R^\{2\}\.
ForU\(θ;Xi,Yi\)=Yi−θU\(\\theta;X\_\{i\},Y\_\{i\}\)=Y\_\{i\}\-\\thetaandπi=π\\pi\_\{i\}=\\pifor allii, the variance equation \(2\) reduces to
σ2n∗\\displaystyle\\frac\{\\sigma^\{2\}\}\{n^\{\*\}\}≈1N\[σ2\+1−ππ⋅σ2\(1−ρ2\)\]\.\\displaystyle\\approx\\frac\{1\}\{N\}\\left\[\\sigma^\{2\}\+\\frac\{1\-\\pi\}\{\\pi\}\\cdot\\sigma^\{2\}\(1\-\\rho^\{2\}\)\\right\]\.\(3\)
Given\(n∗,R2\)\(n^\{\*\},R^\{2\}\)we can obtain multiple pairs of\(N,π\),\(N,\\pi\),or equivalently,\(N,n\),\(N,n\),that satisfy \(3\)\.
Figure[2](https://arxiv.org/html/2605.16354#S3.F2)presents the required human sample size,nn, given the LLM sample sizeN,N,stratified by the range ofR2R^\{2\}for effective sample sizesn∗=50,100,200,500n^\{\*\}=50,100,200,500\. The required number of human reviews decreases in the LLM sample size when fixingR2R^\{2\}, as well as decreases inR2R^\{2\}when holding the LLM sample sizeNNconstant\. The one exception is whenR2=0,R^\{2\}=0,where any increase in LLM sample size does not decrease the required number of human reviews, since the LLM does not provide any predictive ability\.
Consider an investigator with a target effect sample size ofn∗=200n^\{\*\}=200and a human annotation budget of 100\. If prior data between LLM and human ratings yields an estimate ofR2=0\.70,R^\{2\}=0\.70,the investigator can work backward to select an LLM sample sizeNNbased on budget constraints on the number of human reviews\. With an LLM sample size ofN=2000N=2000, only6565human samples are needed\. If the budget can allow for 100 human ratings, then the LLM sample sizeNNcan be reduced to400400while still achieving the target power\.
Figure 2:Number of human reviews as function of total LLM evaluation, for a given target effective sample size\. The total sample is expressed as a multiple of the target\. AtR2=0R^\{2\}=0, the LLM judge does not contribute and the total target sample size is equivalent to the required number of human samples\.#### Behavior of sample size\.
Prediction qualityR2R^\{2\}determines the required number of human reviews, compared to the total LLM sample sizeN\.N\.This follows from the property of the sample size formula\. AsNNincreases, the required number of human reviewsnnconverges to a floor ofn∗\(1−ρ2\)n^\{\*\}\(1\-\\rho^\{2\}\)\. Two implications follow\. One is that the benefit of increasingNNdepends onR2\.R^\{2\}\.Forn∗=100n^\{\*\}=100, whenR2=0\.1R^\{2\}=0\.1, doubling the LLM pool fromN=200N=200toN=400N=400reduces the required human reviews from 95 to 92, a negligible gain\. WhenR2=0\.8R^\{2\}=0\.8, the same doubling reduces the requirement from 33 to 25, a 25%\\%reduction\. Second, each successive increase inNNproduces diminishing reductions innn\(Figure[3](https://arxiv.org/html/2605.16354#S3.F3)\)\. Atn∗=200n^\{\*\}=200andR2=0\.7R^\{2\}=0\.7, a two\-fold increase in LLM evaluations reduces the human budget by 20%\\%, though this reduction shrinks with each successive increase\.
Figure 3:\(Left\): Number of human reviews as a function ofR2R^\{2\}, for a given target effective sample size\. An increase in LLM reviews reduces human reviews for larger values ofR2R^\{2\}\. The minimum number of human samples for a givenR2R^\{2\}, asNNincreases to∞,\\infty,is given by the red dotted line\. \(Right\): The percent reduction in the required number of human samples increases asR2R^\{2\}increases, but the marginal reduction decreases as the LLM ratings scale\.
### 3\.2Stratified sampling
Now we can consider the case where response probabilitiesπi\\pi\_\{i\}are different over response unitsii’s\. That is, in the second stage, instead of using single sampling probability applied for the whole sample, one can sample units for human review through stratified sampling\. The motivation comes fromKumaret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib8)\)who demonstrated the heterogeneity of expert\-LLM agreement across various evaluation dimensions of empathic communication\. WhenR2R^\{2\}varies across strata, it is natural to allowπ\\pito be different\. The strata for the case ofKumaret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib8)\)would be various dimensions of empathic communication\.
Consider the case where there are two stratas=1,2\.s=1,2\.Letπi\\pi\_\{i\}bep1p\_\{1\}if theithi^\{th\}unit belongs tos=1,s=1,p2p\_\{2\}ifs=2,s=2,andRi2R\_\{i\}^\{2\}will ber12r\_\{1\}^\{2\}ifs=1,s=1,andr22,r\_\{2\}^\{2\},ifs=2\.s=2\.The design task is to find the stratum\-specific human evaluation sample sizesn1n\_\{1\}andn2n\_\{2\}respectively, given the stratum\-specific LLM sample sizesN1,N2,N\_\{1\},N\_\{2\},and stratum\-specific correlationsr12,r22\.r\_\{1\}^\{2\},r\_\{2\}^\{2\}\.
In this two\-strata case, the variance formula becomes:
σ2n∗\\displaystyle\\frac\{\\sigma^\{2\}\}\{n^\{\*\}\}≈1N\[σ2\+N1N1−p1p1⋅σ2\(1−r12\)\+N2N1−p2p2⋅σ2\(1−r22\)\]\.\\displaystyle\\approx\\frac\{1\}\{N\}\\left\[\\sigma^\{2\}\+\\frac\{N\_\{1\}\}\{N\}\\frac\{1\-p\_\{1\}\}\{p\_\{1\}\}\\cdot\\sigma^\{2\}\(1\-r\_\{1\}^\{2\}\)\+\\frac\{N\_\{2\}\}\{N\}\\frac\{1\-p\_\{2\}\}\{p\_\{2\}\}\\cdot\\sigma^\{2\}\(1\-r\_\{2\}^\{2\}\)\\right\]\.\(4\)
Multiple numbers of pairs\(p1,p2\)\(p\_\{1\},p\_\{2\}\)and thus\(n1,n2\)\(n\_\{1\},n\_\{2\}\)satisfies the equation \(4\) given\(N1,N2,r12,r22\),\(N\_\{1\},N\_\{2\},r\_\{1\}^\{2\},r\_\{2\}^\{2\}\),providing a valid design equivalent to achieving at leastn∗n^\{\*\}\. Figure[4](https://arxiv.org/html/2605.16354#S3.F4)gives an example of an allocation curve, which shows that there is a minimum number of human ratings with a given setup\. If the human review budget stays above the minimum design, the investigator can choose the optimal number\. If the budget falls below, an increase in\(N1,N2\)\(N\_\{1\},N\_\{2\}\)is necessary since\(r12,r22\)\(r\_\{1\}^\{2\},r\_\{2\}^\{2\}\)is usually given, which will generate a new allocation curve\.
Figure 4:Valid\(p1,p2\)\(p\_\{1\},p\_\{2\}\)pairs satisfying the variance equation atn∗=200,n^\{\*\}=200,forR12=0\.8R\_\{1\}^\{2\}=0\.8,R22=0\.3R\_\{2\}^\{2\}=0\.3,N=1000N=1000\. The budget line atnbudget=100n\_\{\\text\{budget\}\}=100reflects the maximum budget allowed by the investigator\. The minimum\-cost design is shown by the green line\.Figure[4](https://arxiv.org/html/2605.16354#S3.F4)shows the allocation curve, the set of sampling probabilities\(p1,p2\)\(p\_\{1\},p\_\{2\}\)that achieve an equivalent power ton∗=200,n^\{\*\}=200,givenN1=N2=500,N\_\{1\}=N\_\{2\}=500,r12=0\.8r\_\{1\}^\{2\}=0\.8andr22=0\.3\.r\_\{2\}^\{2\}=0\.3\.Any point on this curve is a valid design\. The blue line, the allocation curve, shows the pairs of\(p1,p2\)\(p\_\{1\},p\_\{2\}\)satisfyingn∗=200\.n^\{\*\}=200\.The red dotted line shows the budget constraint of500p1\+500p2=100500p\_\{1\}\+500p\_\{2\}=100\. The green dotted line shows the minimum cost design, where the budget line is tangent to the allocation curve atp1=0\.065p\_\{1\}=0\.065andp2=0\.121,p\_\{2\}=0\.121,requiring at leastn1=33n\_\{1\}=33andn2=61,n\_\{2\}=61,for a total of 94 human ratings\. By contrast, the design atp1=0\.3p\_\{1\}=0\.3andp2=0\.09p\_\{2\}=0\.09also satisfiesn∗=200,n^\{\*\}=200,but requiresn1=150n\_\{1\}=150andn2=45n\_\{2\}=45, a totalnnof 195, more than double the minimal budget\. The difference arises because stratum 1 has a high predictive powerr22=0\.8r\_\{2\}^\{2\}=0\.8, and the minimum cost design exploits this by allocating fewer human reviews where the LLM is most reliable\.
#### Comparison of designs\.
We next compare stratified allocation where each stratum receives a different sampling probability, to uniform allocation where a single sampling probability is applied across all strata\. Using a two\-stratum design with stratum\-specificR2R^\{2\}values, we examine the reduction in total human reviews achieved by allowing the sampling probabilities to differ\. Figure[5](https://arxiv.org/html/2605.16354#S3.F5)shows the savings from stratification across a range of stratum configurations\.
Stratified sampling yields the greatest savings when two conditions hold\. First, the LLM predicts human ratings well on some dimensions but poorly on others, such that the difference inR2R^\{2\}is large\. When the gap is large \(R12=0\.8R\_\{1\}^\{2\}=0\.8andR22=0\.1R\_\{2\}^\{2\}=0\.1\), stratification reduces the human budget by up to 12\.9%\\%relative to uniform allocation\. When the gap is smaller but moderate \(R12=0\.6R\_\{1\}^\{2\}=0\.6andR22=0\.3R\_\{2\}^\{2\}=0\.3\) the savings barely exceed 2%\\%\. Savings are also greater when the stratum where the LLM is most predictive accounts for a large proportion of the total evaluations\. Under uniform allocation, this stratum receives a correspondingly large share of human reviews despite contributing little additional precision, and stratification reclaims that wasted effort\.
Figure 5:Reduction in total human samples when comparing two\-stratum vs\. uniform sampling\. The single\-stratum design assumes a uniform LLM\-human correlation ofR2R^\{2\}across all items, with a single sampling rateπ=nbudget/N\\pi=n\_\{\\text\{budget\}\}/N\. The two\-stratum design assumes heterogeneous LLM reliability, withR12R\_\{1\}^\{2\}in stratum 1 andR22R\_\{2\}^\{2\}in stratum 2\. Both designs assumeN=1000N=1000and ratios ofN1N\_\{1\}andN2N\_\{2\}are varied\.
## 4Discussion
This paper reformulates LLM evaluation as a statistical estimation problem in which human ratings are the primary quantity of inferential interest and LLM\-generated ratings serve as auxiliary measurements\. At present, no widely adopted methodology exists for determining how many items require human review\.
#### Related Work
Two prior works that are most closely related to our work examine allocation strategies under fixed annotation budgets, though neither of these propose sample size formulas\.Mozeret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib10)\)derives a stratified model\-assisted estimator and variance formula\. The approach taken byMozeret al\.\([2026](https://arxiv.org/html/2605.16354#bib.bib10)\)is to take the sample size as given and then allocate the given budget, focusing on variance reduction from stratification\.
Our work addresses the prior step in the design process, which is to determine how large the human coding budget must be to achieve a target effective sample size\. In our framework, our validity comes from the design that the investigator controls the response probabilities\.Unellet al\.\([2025](https://arxiv.org/html/2605.16354#bib.bib16)\)derives a Chernoff bound for the sample size required to estimate the intraclass correlation coefficient between human and LLM judges within a specified tolerance\. They also show that cluster\-based selection of items for human annotation reduces ICC estimation error\.
Another related but distinct area of work is active learning \(Settles and Craven \([2008](https://arxiv.org/html/2605.16354#bib.bib15)\)\), which similarly recognizes that annotation is expensive and seeks to sequentially label instances based on current model uncertainty\. Our approach differs in that human labels are the quantity of inferential interest, whereas in active learning human labels are used to train a model and improve predictive performance on future instances\.
#### Limitations
The sample size formula requires anR2R^\{2\}value from a pilot study\. If the pilot overestimatesR2R^\{2\}due to small sample size, overfitting, or a pilot corpus that is not representative of the full evaluation, the resulting design will underallocate human reviews\. The effective sample size achieved will be lower than the targetn∗n^\{\*\}and the precision guarantees will not hold\. This risk is not unique to this framework\. Conventional sample size calculations face the same sensitivity to variance estimates from pilot data\. Investigators can guard against this by computingnnat a conservatively lowR2R^\{2\}or by conducting a sensitivity analysis across a plausible range\.
This work assumes that human ratings constitute the gold standard against which LLM predictions are evaluated\. This assumption is reasonable in settings where the construct being measured is well defined and disagreement among human raters reflects measurement noise rather than ambiguity in the construct itself\. In many clinical evaluation contexts, however, the reliability of human ratings is itself uncertain, and the designation of human judgment as the gold standard is a pragmatic choice rather than a settled fact\. The framework does not address the problem of unreliable or inconsistent human raters\.
#### Implications
An implication of this framework is that high agreement between human and LLM ratings is not a prerequisite for the LLM to be useful in evaluation\. Even whenR2R^\{2\}is modest, the LLM reduces the required number of human reviews relative to full human evaluation, and the sample size formula quantifies the magnitude of this reduction for any level of predictive quality\. The framework directs human effort to the strata where it is most needed, rather than requiring that all strata meet a uniform threshold of agreement\. Efforts to improve human\-LLM agreement through prompt engineering or model selection, while potentially valuable, are not well motivated without reference to a targetn∗n^\{\*\}derived from the sample size calculation\.
As regulatory agencies develop standards for AI evaluation in healthcare, it is likely they will look to evidentiary standards that are not necessarily identical, but resemble gold standard practices in clinical studies\. Agreement metrics will likely not suffice for this, but the specification of sample sizes, evaluation contexts, and target precision provide a path forward in closing the evidence and policy gap\. The variance expressions and sample size formulas derived here provide a basis for such standards, analogous to the power calculations required for clinical trial protocols\. The pace of model development in clinical AI may seemingly provide fertile ground for relaxing evidentiary standards, but we argue that rapid pace of this evolving field demands principled validation design\. Researchers who must continually monitor and re\-evaluate evolving models will benefit most from prospective designs that allocate human review efficiently\.
#### Future Work
The case we described refers to one human rater\. Future work can explore extending this case to multiple human raters\. The examples covered in this paper are concerned primarily with discrete, single\-turn evaluative tasks, and extending it to multi\-turn dialogues and sequential clinical interactions will require additional considerations\.
#### Conclusions
The framework presented in this paper provides the first principled basis for determining how many human labels are required to achieve a target level of inferential precision when LLM\-generated ratings are used as auxiliary measurements through the principled combination of human and LLM ratings\. It applies wherever the evaluation target is human expert judgment and annotation is scarce\. Substituting LLM ratings for human judgment may be appropriate in low\-stakes settings where the construct is well\-defined and the consequences of error are limited\. In domains where the constructs being evaluated involve normative judgment, such as safety assessment and mental health evaluation, expert ratings remain the standard against which LLM ratings are assessed, and the need for principled human oversight is most acute\.
## References
- A\. M\. Bean, R\. E\. Payne, G\. Parsons, H\. R\. Kirk, J\. Ciro, R\. Mosquera\-Gómez, S\. Hincapié M, A\. S\. Ekanayaka, L\. Tarassenko, L\. Rocher,et al\.\(2026\)Reliability of llms as medical assistants for the general public: a randomized preregistered study\.Nature Medicine,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p1.1)\.
- Neither valid nor reliable? investigating the use of llms as judges\.arXiv preprint arXiv:2508\.18076\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p3.1)\.
- J\. Cohen \(2013\)Statistical power analysis for the behavioral sciences\.routledge\.Cited by:[§3\.1](https://arxiv.org/html/2605.16354#S3.SS1.p1.6)\.
- E\. Croxford, Y\. Gao, N\. Pellegrino, K\. Wong, G\. Wills, E\. First, M\. Schnier, K\. Burton, C\. Ebby, J\. Gorski,et al\.\(2025\)Development and validation of the provider documentation summarization quality instrument for large language models\.Journal of the American Medical Informatics Association32\(6\),pp\. 1050–1060\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p1.1)\.
- J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu,et al\.\(2024\)A survey on llm\-as\-a\-judge\.The Innovation\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p1.1)\.
- J\. Hahn \(1998\)On the role of the propensity score in efficient semiparametric estimation of average treatment effects\.Econometrica,pp\. 315–331\.Cited by:[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p1.1)\.
- J\. J\. Heckman, H\. Ichimura, and P\. E\. Todd \(1997\)Matching as an econometric evaluation estimator: evidence from evaluating a job training programme\.The review of economic studies64\(4\),pp\. 605–654\.Cited by:[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p1.1)\.
- D\. G\. Horvitz and D\. J\. Thompson \(1952\)A generalization of sampling without replacement from a finite universe\.Journal of the American statistical Association47\(260\),pp\. 663–685\.Cited by:[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p1.1)\.
- J\. K\. Kim and J\. Shao \(2021\)Statistical methods for handling incomplete data\.Chapman and Hall/CRC\.Cited by:[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p7.5)\.
- A\. Kumar, N\. Poungpeth, D\. Yang, E\. Farrell, B\. L\. Lambert, and M\. Groh \(2026\)When large language models are reliable for judging empathic communication\.Nature Machine Intelligence,pp\. 1–13\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p1.1),[§1](https://arxiv.org/html/2605.16354#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.16354#S3.SS2.p1.4)\.
- H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu \(2024\)Llms\-as\-judges: a comprehensive survey on llm\-based evaluation methods\.arXiv preprint arXiv:2412\.05579\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.p2.1)\.
- R\. Mozer, N\. E\. Pashley, and L\. Miratrix \(2026\)Stratified sampling for model\-assisted estimation with surrogate outcomes\.arXiv preprint arXiv:2602\.12992\.Cited by:[§4](https://arxiv.org/html/2605.16354#S4.SS0.SSS0.Px1.p1.1)\.
- J\. M\. Robins and A\. Rotnitzky \(1994\)Estimation of regression coefficients when some regressors are not always observed\.Journal of the American Statistical Association89,pp\. 846–866\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.16354#S2.SS2.p7.5)\.
- B\. Settles and M\. Craven \(2008\)An analysis of active learning strategies for sequence labeling tasks\.Inproceedings of the 2008 conference on empirical methods in natural language processing,pp\. 1070–1079\.Cited by:[§4](https://arxiv.org/html/2605.16354#S4.SS0.SSS0.Px1.p3.1)\.
- A\. Unell, N\. Dullerud, N\. Shah, and S\. Koyejo \(2025\)Smarter sampling for llm judges: reliable evaluation on a budget\.InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling,Cited by:[§4](https://arxiv.org/html/2605.16354#S4.SS0.SSS0.Px1.p2.1)\.
- L\. Zhao and S\. Lipsitz \(1992\)Designs and analysis of two\-stage studies\.Statistics in medicine11\(6\),pp\. 769–782\.Cited by:[§1](https://arxiv.org/html/2605.16354#S1.SS0.SSS0.Px1.p1.1)\.Similar Articles
Review Arcade: On the Human Alignment and Gameability of LLM Reviews
This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.
Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.