The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL 06/15/26, 04:00 AM Papers
Summary
This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.
arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.
Original Article
View Cached Full Text
Cached at: 06/15/26, 08:55 AM
# The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
Source: [https://arxiv.org/html/2606.13685](https://arxiv.org/html/2606.13685)
###### Abstract

LLM\-as\-a\-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run\-to\-run reliability remains under\-characterized\. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models \(GPT\-4o\-mini and GPT\-4\.1\-mini\), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt\-sensitivity ablations\. Across judges, pairwise preferences flip on average 13\.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%\. GPT\-4o\-mini also exhibits a significant first\-position bias \(72% A\-majority,p=0\.024p=0\.024\)\. At the same time, mean pointwise score gaps are small \(0\.19–0\.36 on a 10\-point scale\) and not statistically significant in aggregate, producing a pairwise–pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference\. Beyond within\-judge instability, cross\-judge agreement is only 76% \(κ=0\.51\\kappa=0\.51\), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency\. A reliability\-curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50\-trial reference verdict with 95% probability on average, rising to 15 for high\-variance questions\. These findings suggest that single\-trial LLM judging is often too noisy for high\-stakes evaluation, and that multi\-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice\. Because both judges are from a single provider, cross\-provider replication remains an important next step\.

Keywords:LLM evaluation, LLM\-as\-a\-Judge, intra\-judge consistency, position bias, pairwise\-pointwise gap, evaluation reliability, intraclass correlation, benchmark design

## 1Introduction

Large language model \(LLM\) judges are now central to modern evaluation pipelines\. They are used to rank model outputs, approximate human preferences in benchmark construction, and serve as reward\-model proxies in RLHF systems\(Zhenget al\.,[2023](https://arxiv.org/html/2606.13685#bib.bib1); Duboiset al\.,[2024](https://arxiv.org/html/2606.13685#bib.bib3); Ouyanget al\.,[2022](https://arxiv.org/html/2606.13685#bib.bib4); Baiet al\.,[2022](https://arxiv.org/html/2606.13685#bib.bib5)\)\. This shift has made automated evaluation dramatically cheaper and more scalable, but it also raises a basic measurement question:*if we ask the same judge the same question multiple times, do we get the same answer?*

That question is distinct from the better\-studied issue of whether LLM judges are biased\. Prior work has documented position bias, verbosity bias, and self\-enhancement effects in single\-trial settings\(Wanget al\.,[2023](https://arxiv.org/html/2606.13685#bib.bib6); Zhenget al\.,[2023](https://arxiv.org/html/2606.13685#bib.bib1); Stureborget al\.,[2024](https://arxiv.org/html/2606.13685#bib.bib8)\)\. Our focus is complementary\. We study*repeated\-trial reliability*: the extent to which a fixed judge, given identical candidate responses under nominally identical conditions, returns the same verdict across runs\. This matters directly for benchmark validity\. If pairwise verdicts fluctuate across repeated trials, then single\-trial leaderboards and paper tables can be unstable even when the underlying model outputs are held fixed\.

We investigate this question on 29 tasks spanning 10 categories using two OpenAI judge models, with 50 pairwise trials and 50 pointwise trials per question, plus temperature and prompt\-sensitivity ablations\. The paper has three central messages\. First, pairwise judging is often noisier than it appears: mean flip rate is 13\.6%, and 28% of questions exceed 20% flip rate\. Second, pairwise verdicts and pointwise scores can come apart: judges often select a winner even when their own average scalar scores show little evidence of a meaningful quality gap\. Third, reliability improves predictably but nonlinearly with repeated voting: in our dataset, 11 trials are needed on average for a majority vote to recover the 50\-trial reference verdict with 95% probability\.

More broadly, we argue that LLM\-judge reliability is not a single property\. It decomposes into at least four layers: stochastic instability within a judge, systematic bias such as first\-position preference, protocol sensitivity to prompt wording and temperature, and disagreement across judge models\. Treating these as separate layers clarifies why apparently reasonable evaluation pipelines can still produce brittle conclusions\.

Our contributions are therefore: \(i\) a formal framework for separating pairwise verdicts, pointwise scores, intra\-judge consistency, and cross\-judge agreement; \(ii\) a repeated\-trial empirical study of LLM judge reliability across 29 tasks and 10 categories; \(iii\) an analysis of the pairwise–pointwise gap, showing that forced\-choice verdicts can overstate evidence for quality differences; \(iv\) a reliability\-curve analysis that translates flip\-rate estimates into concrete trial\-count recommendations; and \(v\) practical guidance for reporting uncertainty in LLM\-as\-a\-Judge evaluation\.

## 2Related Work

### 2\.1LLM\-as\-a\-Judge Frameworks

Zhenget al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib1)\)introduced MT\-Bench and the LLM\-as\-a\-Judge paradigm, demonstrating that GPT\-4 judgments correlate well with human preferences in aggregate while identifying position and verbosity biases\.Duboiset al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib3)\)proposed AlpacaFarm for instruction\-following evaluation, showing that LLM judges can approximate human annotators at lower cost, though length bias remained a concern\.Liuet al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib7)\)proposed G\-Eval, using chain\-of\-thought prompting and probability calibration to improve judge alignment with human judgments; they showed CoT substantially helps but did not study run\-to\-run variance\.Zhuet al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib9)\)proposed JudgeLM, a fine\-tuned judge model optimized for consistency, and reported improved agreement scores over prompted GPT\-4, although single\-trial evaluation was used throughout\.

### 2\.2Systematic Bias in LLM Evaluation

Wanget al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib6)\)conducted a comprehensive study of LLM judge biases, cataloguing position bias, verbosity bias, and self\-enhancement tendencies, and proposed a calibration approach \(swap\-augmented evaluation\) to mitigate position effects\. Their work tests asingle trial per bias condition, not repeated sampling; our study is the first to quantify these biases at the 50\-trial scale\.Stureborget al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib8)\)showed that LLM judges are systematically influenced by superficial features including response length, formatting, and bullet\-point density, findings consistent with verbosity bias, but did not measure within\-judge run\-to\-run variance\.Shankaret al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib10)\)raised the meta\-question of who validates the validators, arguing that LLM\-judge frameworks require empirical calibration against human annotators on each target task, a position strongly supported by our findings\.

### 2\.3Judge Reliability and Calibration

The reliability of automated evaluation metrics has been studied in the pre\-LLM era:Amideiet al\.\([2019](https://arxiv.org/html/2606.13685#bib.bib11)\)surveyed human inter\-annotator agreement for NLG evaluation tasks, reportingκ=0\.3\\kappa=0\.3–0\.60\.6for subjective tasks, a range that our cross\-judgeκ=0\.51\\kappa=0\.51matches\.Clarket al\.\([2021](https://arxiv.org/html/2606.13685#bib.bib12)\)showed that human evaluators of generated text exhibit substantial disagreement \(∼\\sim20% pairwise inconsistency\), situating LLM judge inconsistency in a broader human\-evaluation context\.

Within the LLM\-judge literature, calibration work has focused primarily on reducingsystematicbias rather than stochastic variance\. Our work is orthogonal: we studyrandomvariance \(what changes if you re\-run the same evaluation\), which is irreducible by bias correction but addressable through multi\-trial aggregation\. Concurrent and complementary work byShankaret al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib10)\)and others highlights that the identity of the judge model substantially affects outcomes, consistent with ourκ=0\.51\\kappa=0\.51inter\-judge finding\.

### 2\.4Reliability Measurement in Psychometrics

Intraclass correlation \(ICC\) is the standard psychometric reliability coefficient for repeated\-measures designs\(Amideiet al\.,[2019](https://arxiv.org/html/2606.13685#bib.bib11)\)\. ICC values below 0\.60 are conventionally classified as “poor to moderate” reliability\(Clarket al\.,[2021](https://arxiv.org/html/2606.13685#bib.bib12)\), providing a principled interpretation framework for our ICC\(2,1\) estimates\. The use of majority voting to aggregate stochastic classifiers is well\-studied in ensemble learning; our reliability curve analysis \(Section[5\.9](https://arxiv.org/html/2606.13685#S5.SS9)\) provides the first such analysis for LLM judge aggregation, showing that the gain from additional trials follows a concave curve with diminishing returns beyond∼\\sim20 trials\.

### 2\.5Positioning This Work

Our study is most directly comparable toStureborget al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib8)\), who also measure LLM judge inconsistency\. Key differences: \(i\) we use 50 trials per question \(vs\.≤\\leq5 in most prior work\), enabling high\-precision flip rate estimates with bootstrap confidence intervals; \(ii\) we introduce the pairwise\-pointwise paradox as a distinct failure mode, showing that pairwise forced\-choice amplifies non\-existent quality differences; \(iii\) we provide a reliability curve and ICC analysis grounded in psychometric methodology; \(iv\) we quantify the downstream impact via a leaderboard noise budget\. Our work has complementary scope toWanget al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib6)\)\(who study systematic bias with response swapping\) and extends it with stochastic variance analysis\.

## 3Formal Framework

We distinguish four related but non\-identical layers of LLM\-as\-a\-Judge behavior: \(i\) the*pairwise verdict*produced when a judge is forced to choose between two responses, \(ii\) the*pointwise score*assigned when each response is evaluated independently, \(iii\) the*intra\-judge consistency*of repeated evaluations by the same judge under fixed conditions, and \(iv\) the*cross\-judge agreement*between different judge models on the same items\. This separation is important because instability in any one layer can undermine benchmark validity even when the others appear well\-behaved\.

Definition 1 \(Judge Evaluation Trial\)\.A judge evaluation trial is a stochastic mapping from an input tuple\(q,rA,rB,p,θ\)\(q,r\_\{A\},r\_\{B\},p,\\theta\)to an output, whereqqis the prompt or question,rAr\_\{A\}andrBr\_\{B\}are candidate responses,ppis an evaluation prompt template, andθ\\thetadenotes judge\-side settings such as model choice, decoding temperature, and response order\. In pairwise mode the output isy∈\{A,B,tie\}y\\in\\\{A,B,\\mathrm\{tie\}\\\}; in pointwise mode the output is a scalar scores∈\[1,10\]s\\in\[1,10\]\.

Definition 2 \(Intra\-Judge Consistency\)\.For a fixed tuple\(q,rA,rB,p,θ\)\(q,r\_\{A\},r\_\{B\},p,\\theta\)and judge modelJJ, intra\-judge consistency is the stability of the output distribution across repeated trials\. Perfect consistency means all repeated trials yield the same verdict \(pairwise\) or the same score \(pointwise\); lower consistency corresponds to a wider repeated\-trial distribution\.

Definition 3 \(Flip Rate\)\.ForNNrepeated pairwise trials with outcome counts\(nA,nB,ntie\)\(n\_\{A\},n\_\{B\},n\_\{\\mathrm\{tie\}\}\), the flip rate is

FR=1−max⁡\(nA,nB,ntie\)N\.\\mathrm\{FR\}=1\-\\frac\{\\max\(n\_\{A\},n\_\{B\},n\_\{\\mathrm\{tie\}\}\)\}\{N\}\.\(1\)This measures the fraction of trials not supporting the majority outcome\. Higher flip rate indicates greater pairwise instability\.

Definition 4 \(Pairwise–Pointwise Gap\)\.Lets¯A\\bar\{s\}\_\{A\}ands¯B\\bar\{s\}\_\{B\}denote mean pointwise scores across repeated trials\. The pairwise–pointwise gap refers to the empirical situation in which pairwise verdicts appear decisive while the corresponding pointwise score gap\|s¯A−s¯B\|\|\\bar\{s\}\_\{A\}\-\\bar\{s\}\_\{B\}\|is small or statistically indistinguishable from zero\.

This framework motivates three hypotheses tested in the paper:

1. 1\.H1: Pairwise instability exceeds what pointwise score gaps alone would predict\.In particular, many questions with small mean score gaps will still exhibit non\-trivial pairwise flip rates\.
2. 2\.H2: Position bias varies systematically across judges\.Even under randomized presentation, some judges will exhibit stronger first\-position preference than others\.
3. 3\.H3: Consensus reliability follows a concave saturation curve\.Additional trials improve majority\-vote reliability quickly at first, then with diminishing returns\.

## 4Methodology

### 4\.1Evaluation Dataset

We construct a diverse evaluation set of 29 question\-response pairs spanning 10 categories: writing \(3\), reasoning \(3\), coding \(3\), knowledge \(3\), math \(3\), roleplay \(2\), extraction \(3\), ethics \(2\), instruction\-following \(3\), and hard/ambiguous tasks \(4\)\. For each question, we use two high\-quality responses from different model tiers \(GPT\-4o\-mini and GPT\-4o\) to ensure meaningful comparison targets\.

Response pairs are deliberately chosen to be competitive, with both responses high\-quality but differing in style, structure, or approach\. Pointwise evaluation confirms this: across both judges, Response A averages9\.3/109\.3/10\(σ=0\.9\\sigma=0\.9\) and Response B averages9\.4/109\.4/10\(σ=0\.6\\sigma=0\.6\), indicating that both responses are consistently rated as high\-quality\. This design maximizes the sensitivity of our consistency measurements; trivially different responses would yield artificially high consistency\. We note that this represents a stress test: real\-world evaluation often involves more diverse quality levels, and consistency may be higher for response pairs with obvious quality differences\.

### 4\.2Judge Models

We evaluate two judge models from the GPT\-4 family:

- •GPT\-4o\-mini: A cost\-efficient model commonly used for large\-scale evaluation
- •GPT\-4\.1\-mini: A newer variant from the GPT\-4\.1 family

Both models are accessed through the OpenAI API\. The main experiment uses the default temperature \(t=1\.0t=1\.0\) to reflect real\-world usage; a supplementary ablation study evaluatest=0t=0\(deterministic decoding\)\.

### 4\.3Evaluation Protocol

Experiment 1 \(Main\)\.For each \(judge, question\) pair, we conduct:

1. 1\.Pairwise comparison\(×50\\times 50\): The judge is asked “Which response is better?” with randomized A/B presentation order across trials\.
2. 2\.Pointwise scoring\(×50\\times 50per response\): Each response is independently scored on a 1–10 scale\.

This yields29×2×\(50\+50\+50\)=8,70029\\times 2\\times\(50\+50\+50\)=8\{,\}700total API calls\. The 50\-trial design provides sufficient statistical power to distinguish genuine preferences from noise \(binomial testp<0\.05p<0\.05requires≥33/50\\geq 33/50for significance\)\.

Experiment 2 \(Temperature Ablation\)\.We repeat the pairwise comparison witht=0t=0for 10 trials per \(judge, question\) pair, yielding an additional29×2×10=58029\\times 2\\times 10=580API calls\. Whilet=0t=0should theoretically produce deterministic outputs, API\-level factors \(batching, quantization, floating\-point nondeterminism\) may introduce variation\.

Experiment 3 \(Prompt Sensitivity\)\.We design a second, semantically equivalent prompt template with different framing and structure, and evaluate 10 diverse questions \(one per category\) with 20 trials per \(judge, prompt, question\) combination, yielding10×2×2×20=80010\\times 2\\times 2\\times 20=800additional API calls\.

### 4\.4Metrics

Flip Rate\.For each question, we define the flip rate as:

FR=1−max⁡\(nA,nB,ntie\)N\\text\{FR\}=1\-\\frac\{\\max\(n\_\{A\},n\_\{B\},n\_\{\\text\{tie\}\}\)\}\{N\}\(2\)wherenAn\_\{A\},nBn\_\{B\},ntien\_\{\\text\{tie\}\}are the counts of each outcome acrossNNtrials\. A flip rate of 0% indicates perfect consistency; 50% indicates random behavior \(i\.e\., the minority vote approaches the majority vote\)\. This measures outcomeuncertainty: a question with FR=0\.14=0\.14is one where 14% of trials would yield the non\-majority verdict\.

Outcome Entropy\.As a complementary measure, we report the Shannon entropy of the outcome distribution:

H=−∑o∈\{A,B,tie\}polog2⁡poH=\-\\sum\_\{o\\in\\\{A,B,\\text\{tie\}\\\}\}p\_\{o\}\\log\_\{2\}p\_\{o\}\(3\)EntropyH=0H=0bits indicates a deterministic outcome;H=log2⁡3≈1\.58H=\\log\_\{2\}3\\approx 1\.58bits indicates a uniform distribution over all three outcomes\. Entropy captures outcome spread more fully than flip rate, which ignores tie frequency\.

Position Bias Index\.We measure the fraction of questions where response A \(presented first\) wins the majority vote:

PBI=\|\{q:majority\(q\)=A\}\|\|𝒬\|\\text\{PBI\}=\\frac\{\|\\\{q:\\text\{majority\}\(q\)=A\\\}\|\}\{\|\\mathcal\{Q\}\|\}\(4\)An unbiased judge would yield PBI≈0\.5\\approx 0\.5\. We test significance using a sign test\.

Pairwise\-Pointwise Gap\.For each question, we compute:

PPG=\|s¯A−s¯B\|\\text\{PPG\}=\|\\bar\{s\}\_\{A\}\-\\bar\{s\}\_\{B\}\|\(5\)wheres¯A\\bar\{s\}\_\{A\}ands¯B\\bar\{s\}\_\{B\}are mean pointwise scores across 50 trials\. We test whether aggregate pointwise scores differ using the Wilcoxon signed\-rank test\.

Intraclass Correlation\.We compute ICC\(2,1\) \(two\-way random effects, absolute agreement, single measures\) from the 50 repeated pointwise scores per response per judge\. ICC\(2,1\) treats each trial as a “rater” and each question\-response pair as a “subject,” measuring the proportion of total variance attributable to genuine quality differences between subjects versus stochastic noise\. ICC<0\.60<0\.60is conventionally classified as poor to moderate reliability\.

Cross\-Judge Agreement\.We report both raw agreement percentage and Cohen’sκ\\kappato account for chance agreement\.

Reliability Curve\.We simulate using onlyKKrandomly sampled trials \(Monte Carlo, 500 repetitions perKK\) and computeP\(majority\(Ktrials\)=majority\(50trials\)\)P\(\\text\{majority\}\(K\\text\{ trials\}\)=\\text\{majority\}\(50\\text\{ trials\}\)\)forK=1,…,50K=1,\\ldots,50\. This characterizes how quickly the majority verdict stabilizes, and provides a principled basis for trial count recommendations\.

All confidence intervals are computed via bootstrap resampling \(10,000 iterations\)\.

### 4\.5Statistical Analysis

We report nonparametric tests whenever the relevant distributions are small\-sample, skewed, or clearly non\-Gaussian\. Category comparisons use Kruskal–Wallis tests; aggregate pointwise comparisons use the Wilcoxon signed\-rank test; position\-bias significance is assessed with a sign test on question\-level majorities; and cross\-judge agreement is summarized with both raw agreement and Cohen’sκ\\kappa\. Because this is an exploratory reliability study with several related outcome measures, we emphasize effect magnitudes and confidence intervals alongsidepp\-values rather than treating binary significance as the only decision criterion\. ICC\(2,1\) is used for repeated pointwise scores because it measures absolute agreement under a two\-way random\-effects design, matching our interpretation of repeated judge calls as interchangeable raters\. A linear mixed\-effects formulation would be a natural extension for larger future datasets, but is unnecessary for the current descriptive analysis\.

## 5Results

### 5\.1Intra\-Judge Consistency

We begin with the most direct repeated\-trial question: how often does the same judge change its pairwise verdict when nothing about the evaluated responses changes? At the aggregate level, both judges exhibit mean flip rates near 14%, but this average conceals substantial heterogeneity across questions\.

Table 1:Summary of judge consistency metrics across 29 evaluation tasks \(t=1\.0t=1\.0\)\. FR = flip rate, PBI = position bias index, PPG = pairwise–pointwise gap\. 95% confidence intervals are bootstrap intervals\.Table[1](https://arxiv.org/html/2606.13685#S5.T1)presents aggregate consistency statistics\. Both judges exhibit mean flip rates of approximately 14%, indicating that roughly one in seven evaluations would change if re\-run\. Relative to the random baseline of 50%, the observed mean flip rate is much lower \(Cohen’sd=3\.07d=3\.07in magnitude\), so the judges are clearly not random overall\. But the distribution is highly bimodal \(Figure[1](https://arxiv.org/html/2606.13685#S5.F1)\): many questions show near\-perfect consistency \(FR=0=0%\), while others approach coin\-flip territory\. The two judges do not differ meaningfully in overall flip rate \(Mann\-WhitneyU=438U=438,p=0\.783p=0\.783;η2≈0\.0003\\eta^\{2\}\\approx 0\.0003for judge identity on item\-level flip rates\)\.

![Refer to caption](https://arxiv.org/html/2606.13685v1/x1.png)Figure 1:Pairwise preference flip rates across all 29 questions, sorted by mean instability across judges\. Each question appears as a paired horizontal bar \(one per judge\), making it easier to compare where the two judges agree on stability and where they diverge\. The shaded band marks the 40–50% “coin\-flip danger zone,” where repeated evaluation becomes highly unstable\.The most notable finding is the existence of extreme inconsistency: GPT\-4\.1\-mini reaches a 56% flip rate on q004 \(a reasoning task\), meaning that on this item the majority verdict is unstable enough to be worse than a fair 50/50 split\. Eight questions per judge exceed 20% flip rates, concentrated in coding, writing, and reasoning\. This already supports H1: instability is not a marginal phenomenon confined to a few pathological cases, but a recurring feature of competitively matched evaluation items\.

### 5\.2Position Bias

Table 2:Position\-bias summary by judge\. “A” denotes the first\-presented response\.Table[2](https://arxiv.org/html/2606.13685#S5.T2)shows substantial position bias\. GPT\-4o\-mini displays a strong primacy effect, with the first\-presented response \(A\) winning majority preference in 21 of 29 questions \(72%, sign testp=0\.024p=0\.024\)\. GPT\-4\.1\-mini is more balanced at 59% \(p=0\.458p=0\.458, not significant\)\. This between\-judge difference is substantively important even though the study includes only two judges, and supports H2: position bias is not a fixed property of the evaluation protocol alone, but also of the judge model\. At the individual question level, 24 of 29 questions show significant position bias \(p<0\.05p<0\.05, binomial test\) forbothjudges, indicating that position effects are pervasive even when aggregate bias appears moderate\.

This position bias has direct implications for evaluation fairness: a model whose response happens to appear first may systematically receive more favorable evaluations\.

### 5\.3The Pairwise\-Pointwise Paradox

![Refer to caption](https://arxiv.org/html/2606.13685v1/x2.png)Figure 2:Pairwise–pointwise gap\. Many questions with very small mean pointwise score gaps still exhibit substantial pairwise flip rates\. The shaded lower\-left region highlights the paradox zone: little scalar evidence of a quality difference, yet unstable forced\-choice verdicts\.Perhaps the most notable finding is the disconnect between pairwise preferences and pointwise scores\. The mean pointwise score gap is only 0\.19 points for GPT\-4o\-mini and 0\.36 for GPT\-4\.1\-mini on a 10\-point scale\. Aggregate pointwise scores donotsignificantly differ between responses A and B for either judge \(Wilcoxon signed\-rank: GPT\-4o\-miniW=42W=42,p=0\.827p=0\.827; GPT\-4\.1\-miniW=96\.5W=96\.5,p=0\.126p=0\.126\)\. Yet in pairwise mode, the same judges still pick “winners\.”

Figure[2](https://arxiv.org/html/2606.13685#S5.F2)illustrates this pairwise–pointwise gap\. Questions with near\-zero score differences often exhibit high flip rates, suggesting that forced\-choice evaluation can amplify weak or nonexistent scalar preferences into unstable ordinal verdicts\. This supports H1 directly: repeated pairwise behavior is often more decisive in form than in evidential content\. In practical terms, a pairwise winner should not automatically be interpreted as evidence of a robust underlying quality difference\.

### 5\.4Category Analysis

![Refer to caption](https://arxiv.org/html/2606.13685v1/x3.png)Figure 3:Mean flip rate by task category, shown as point estimates with standard\-error bars and ordered from highest to lowest mean instability\. Category effects are visually large even though judge\-specific significance tests are underpowered\.Figure[3](https://arxiv.org/html/2606.13685#S5.F3)shows flip rates broken down by task category\. There is a visible trend toward higher inconsistency in subjective categories \(coding, writing, reasoning\) compared to factual ones \(knowledge, roleplay\), although the judge\-specific Kruskal–Wallis tests do not reach conventional significance thresholds \(GPT\-4o\-miniH=15\.8H=15\.8,p=0\.071p=0\.071; GPT\-4\.1\-miniH=11\.4H=11\.4,p=0\.252p=0\.252\), likely due to the small number of questions per category\. Still, category explains a non\-trivial share of item\-level flip\-rate variance overall \(η2≈0\.31\\eta^\{2\}\\approx 0\.31\), indicating that task type matters much more than judge identity in this dataset\.

The pattern differs substantially between judges:

- •Coding: High inconsistency for GPT\-4o\-mini \(39%\) but moderate for GPT\-4\.1\-mini \(22%\)
- •Reasoning: Moderate for GPT\-4o\-mini \(11%\) but high for GPT\-4\.1\-mini \(32%\)
- •Knowledge/Roleplay: Consistently low flip rates \(<<5%\) for both judges
- •Ethics: Stable for GPT\-4o\-mini \(0%\) but variable for GPT\-4\.1\-mini \(27%\)

This category\-dependent inconsistency suggests that judge reliability varies substantially by task type, and that thepatternof unreliability differs across judges, a practically important point for evaluation pipelines that rely on a single judge\.

### 5\.5Difficulty\-Stratified Analysis

To assess whether inconsistency is primarily driven by ambiguous questions, we stratify questions by difficulty, defined as the mean flip rate across both judges\. Questions with mean FR<10%<10\\%are classified as “easy” \(clear winner\), while those with FR≥10%\\geq 10\\%are “hard” \(ambiguous\)\.

Table 3:Flip rates stratified by question difficulty\. Easy questions show near\-deterministic behavior; hard questions exhibit substantial instability\.Table[3](https://arxiv.org/html/2606.13685#S5.T3)reveals a strongly bimodal distribution: nearly half of questions \(14/29\) are judged with high consistency \(mean FR = 2\.9%\), while the remaining 15 questions exhibit substantial instability \(mean FR = 23\.6%\)\. The easy–hard contrast is large \(Cohen’sd=2\.96d=2\.96\), indicating that instability is concentrated rather than diffuse\. Easy questions cluster in factual and well\-defined categories \(knowledge, roleplay, extraction\), while hard questions concentrate in subjective or open\-ended categories \(coding, writing, reasoning\)\. This suggests that LLM judge inconsistency is not uniformly distributed but rather concentrated in task types where evaluation criteria are inherently more subjective, a pattern also observed in human annotation studies\(Amideiet al\.,[2019](https://arxiv.org/html/2606.13685#bib.bib11)\)\.

### 5\.6Cross\-Judge Agreement

Within\-judge stability is only one part of evaluation reliability\. Even if each judge were internally stable, benchmark conclusions could still vary if different judge models systematically disagree\. We therefore analyze cross\-judge agreement separately from intra\-judge inconsistency\.

![Refer to caption](https://arxiv.org/html/2606.13685v1/x4.png)Figure 4:Judge×\\timescategory heatmap of mean flip rates\. Compared with per\-question bar charts, this view emphasizes a cleaner structural point: task type explains substantially more variation in instability than judge identity does in this dataset\.The two judges agree on the majority\-preferred response for only 22 of 29 questions \(76%\), yielding Cohen’sκ=0\.51\\kappa=0\.51\(moderate agreement\)\. Disagreements are concentrated in writing \(q002, q003\), coding \(q007, q008, q009\), and hard tasks \(q028\)\. In three cases, GPT\-4\.1\-mini declares a tie while GPT\-4o\-mini picks a winner, suggesting different decision thresholds\.

For context, thisκ=0\.51\\kappa=0\.51is comparable to the lower end of human inter\-annotator agreement on subjective NLG tasks \(κ=0\.3\\kappa=0\.3–0\.60\.6;Amideiet al\.[2019](https://arxiv.org/html/2606.13685#bib.bib11)\) and notably below the 81% agreement reported for MT\-Bench human evaluators\(Zhenget al\.,[2023](https://arxiv.org/html/2606.13685#bib.bib1)\)\. The 76% inter\-judge agreement rate means that approximately one in four evaluation outcomes depends on which judge model is selected\.

### 5\.7Temperature Ablation

One obvious mitigation for stochastic inconsistency is to reduce decoding randomness\. We therefore test whether setting temperature to zero removes repeated\-trial variance, or merely attenuates it\.

![Refer to caption](https://arxiv.org/html/2606.13685v1/x5.png)Figure 5:Temperature ablation as a slopegraph fromt=1\.0t=1\.0tot=0t=0\. Most questions move downward, showing that deterministic decoding reduces instability, but several retain non\-zero flip rates even att=0t=0, especially for GPT\-4\.1\-mini\.Table 4:Temperature ablation results\. Flip rates att=0t=0vst=1\.0t=1\.0\.Figure[5](https://arxiv.org/html/2606.13685#S5.F5)and Table[4](https://arxiv.org/html/2606.13685#S5.T4)present the temperature ablation results\. Settingt=0t=0substantially reduces flip rates for GPT\-4o\-mini \(79% reduction, from 13\.3% to 2\.8%\) but is less effective for GPT\-4\.1\-mini \(43% reduction, to 7\.9%\)\. Even att=0t=0, GPT\-4\.1\-mini exhibits non\-zero flip rates on 7 of 29 questions, with one reaching 50%\.

This residual inconsistency att=0t=0likely reflects API\-level nondeterminism \(floating\-point variation, batch\-processing effects\) rather than intentional sampling\. It demonstrates thatdeterministic decoding is necessary but not sufficientfor consistent evaluation, and that additional strategies \(multi\-trial voting, multi\-judge panels\) remain useful even when temperature is controlled\.

### 5\.8Prompt Template Sensitivity

Prompt wording is often treated as a minor implementation detail in LLM\-as\-a\-Judge studies\. Here we treat it as part of the evaluation protocol itself\. If semantically equivalent prompt templates yield meaningfully different verdicts, then “the judge” is not just the model, but the model\-prompt pair\.

![Refer to caption](https://arxiv.org/html/2606.13685v1/x6.png)Figure 6:Prompt sensitivity shown as prompt\-to\-prompt slopegraphs on a 10\-question subset\. Changes in wording affect both flip rate and, for several items, the majority verdict itself, reinforcing that prompt template is a real experimental variable rather than a cosmetic choice\.To assess sensitivity to prompt wording, we designed two semantically equivalent but stylistically different evaluation prompts and tested them on a 10\-question subset \(20 trials each, both judges\)\. Prompt A uses our standard format \(“You are an impartial judge…”\), while Prompt B uses an alternative framing \(“Please act as a fair and unbiased evaluator…” with step\-by\-step instructions\)\.

Table 5:Prompt template sensitivity on a 10\-question subset\. Cross\-prompt agreement measures whether the majority\-preferred response is the same under both prompts\.As shown in Figure[6](https://arxiv.org/html/2606.13685#S5.F6)and Table[5](https://arxiv.org/html/2606.13685#S5.T5), changing the prompt template flips the majority\-preferred response in 25% of cases \(5/20\), with an average absolute change in flip rate of 13\.4 percentage points\. Prompt B generally induces higher inconsistency, particularly for questions that were already borderline under Prompt A\. This finding shows that evaluation outcomes are sensitive not only to temperature and judge model, but also to the specific wording of the evaluation prompt, an often\-overlooked source of variance\.

### 5\.9Reliability as a Function of Trial Count

![Refer to caption](https://arxiv.org/html/2606.13685v1/x7.png)Figure 7:Reliability as a function of repeated voting\. Left: probability that aKK\-trial majority vote matches the 50\-trial reference verdict, shown overall and for easy vs\. hard questions\. Right: per\-item minimum trial counts needed to reach 90% fidelity, showing that high\-flip items are disproportionately costly to stabilize\.The flip rate analysis quantifieshow inconsistenta judge is; a complementary question ishow many trials are neededto reach a stable verdict\. Figure[7](https://arxiv.org/html/2606.13685#S5.F7)shows the reliability curve:P\(majority\(K\)=majority\(50\)\)P\(\\text\{majority\}\(K\)=\\text\{majority\}\(50\)\)as a function ofKK\. As predicted by H3, the curve is sharply concave: repeated voting improves reliability quickly in the first few trials, then exhibits diminishing returns\.

Table 6:Minimum number of trialsKKto reach 90% and 95% consensus reliability, overall and by difficulty stratum\.A single trial achieves only 86\.6% consensus fidelity\. Reaching 90% requires approximately 3 trials on average, and 95% requires 11 trials\. Critically, these averages mask substantial stratification: for the 15 high\-flip\-rate questions \(FR≥10%\\geq 10\\%\), 15 trials are needed for 90% fidelity, and 50 trials are still insufficient for 95% on the hardest questions\. The practical implication is direct:single\-trial LLM judge evaluations should be treated as preliminary estimates, not definitive verdicts, particularly for questions in subjective or open\-ended categories\. For evaluation pipelines requiring high confidence, we recommend a minimum of 10–20 trials with majority voting; for adversarial or high\-stakes comparisons \(e\.g\., model release decisions\), 50 trials may be warranted for borderline questions\.

### 5\.10Intraclass Correlation and Score Variance Decomposition

Table 7:ICC\(2,1\) \(absolute agreement, single measures\) for pointwise scores across 50 trials\. ICC<0\.60<0\.60is conventionally “poor to moderate” reliability; ICC0\.600\.60–0\.750\.75is “moderate to good\.”Table[7](https://arxiv.org/html/2606.13685#S5.T7)reports ICC\(2,1\) for the 50\-trial pointwise score sequences\. GPT\-4\.1\-mini achieves moderate\-to\-good reliability \(ICC=0\.77=0\.77\), while GPT\-4o\-mini falls in the poor\-to\-moderate range \(ICC=0\.58=0\.58\)\. Both values are substantially below theκ\>0\.80\\kappa\>0\.80threshold typically required for high\-stakes annotation tasks in clinical or legal settings\.

To further characterize the nature of score variance, we decompose total pointwise score variance into between\-question and within\-question components\. Across all 29 questions and both judges, 55\.3% of variance is between\-question \(reflecting genuine quality differences between the evaluated responses\) while44\.7% is within\-question noise, that is, variance attributable to the judge’s stochastic response generation rather than to the quality of the responses being evaluated\. This near\-equal split is striking: for every point of meaningful signal in a pointwise score, there is almost an equal point of random noise\. The ICC values are consistent with this decomposition: ICC=0\.58=0\.58–0\.770\.77means that 23–42% of observed score variance is measurement error\.

This variance decomposition has a direct implication for pointwise score interpretation: a single pointwise score of “8 vs\. 9” cannot be reliably interpreted as evidence of a quality difference\. With a within\-question standard deviation of approximatelyσw=0\.359≈0\.60\\sigma\_\{w\}=\\sqrt\{0\.359\}\\approx 0\.60points, the 95% margin of error for a single pointwise observation is±1\.2\\pm 1\.2points, large relative to the 0\.19–0\.36 mean score gaps observed between competitive responses\.

## 6Discussion

The results are easiest to interpret if we distinguish four layers of evaluation uncertainty\. First,*stochastic instability*: the same judge can change its verdict across repeated trials\. Second,*systematic bias*: for example, first\-position preference can skew majority outcomes even when repeated trials are averaged\. Third,*protocol dependence*: temperature and prompt template alter the effective evaluator\. Fourth,*judge\-identity dependence*: different judge models disagree on a non\-trivial fraction of items\. This layered view explains why single\-trial LLM\-as\-a\-Judge evaluations can appear deceptively crisp despite multiple underlying sources of variance\.

### 6\.1Comparison to Human Annotators

A natural question is whether LLM judges are more or less consistent than human evaluators\. Table[8](https://arxiv.org/html/2606.13685#S6.T8)contextualizes our findings against reported human baselines\.

Table 8:LLM judge consistency vs\. reported human baselines\.EvaluatorAgreementSourceHuman \(MT\-Bench pairwise\)81%Zhenget al\.\([2023](https://arxiv.org/html/2606.13685#bib.bib1)\)Human \(Chatbot Arena\)66%Chianget al\.\([2024](https://arxiv.org/html/2606.13685#bib.bib2)\)Human \(NLG subjective,κ\\kappa\)0\.3–0\.6Amideiet al\.\([2019](https://arxiv.org/html/2606.13685#bib.bib11)\)LLM intra\-judge \(t=1\.0t=1\.0\)86%∗This workLLM intra\-judge \(t=0t=0\)95%∗This workLLM inter\-judge76% \(κ=0\.51\\kappa=0\.51\)This work∗Computed as1−mean FR1\-\\text\{mean FR\}, averaged across both judges\.Att=1\.0t=1\.0, LLM intra\-judge consistency \(86%\) is comparable to reported human agreement ranges in some prior settings\. However, inter\-judge agreement \(76%\) is lower than stronger human baselines from controlled settings\. This nuance, namely that LLM judges can be individually consistent yet mutually inconsistent, is important: systematic error from judge choice may be more damaging to benchmark validity than stochastic error within a single judge\.

### 6\.2The Leaderboard Noise Budget

A useful way to translate reliability statistics into benchmark design language is to ask how much avoidable label noise a judging protocol injects into a leaderboard\. We call this quantity the*noise budget*: the expected number of question\-level outcomes that would change under repeated evaluation\.

Our findings have a direct quantitative implication for benchmark validity\. Define thenoise budgetof a benchmark as the expected number of question\-level evaluation outcomes that would change if the benchmark were re\-run with a different random seed \(or different judge model\)\. Under single\-trial judging att=1\.0t=1\.0with mean flip rate 13\.6%, a 100\-question benchmark has an expected noise budget of13\.6 incorrect outcomes per run\. For benchmarks with a score gap of≤\\leq10 points between adjacent\-ranked models, this is sufficient to reverse rankings with non\-trivial probability\.

This framing connects our findings to the practical benchmark design question\. The noise budget shrinks substantially with multi\-trial aggregation:

- •1 trial:∼\\sim13\.6% of outcomes incorrect \(flip rate = mean FR\)
- •3 trials:∼\\sim10% of outcomes incorrect \(P\(correct\) = 0\.90 overall\)
- •11 trials:∼\\sim5% of outcomes incorrect \(P\(correct\) = 0\.95 overall\)
- •20 trials:∼\\sim3% of outcomes incorrect \(P\(correct\) = 0\.97\)

For high\-flip\-rate questions \(28% of our dataset\), the noise budget is much larger and requires 15\+ trials for 90% fidelity\. Benchmark designers should therefore adopt differentiated trial counts: a quick initial screen with 5–10 trials to identify borderline comparisons, followed by targeted 20–50 trial evaluation for ambiguous cases\.

### 6\.3Implications for Evaluation Practice

The main practical lesson is not merely that LLM judges are noisy, but that different forms of noise call for different countermeasures\. Multi\-trial voting addresses stochastic instability; position randomization addresses systematic order effects; multi\-judge panels address judge\-identity dependence; and prompt audits address protocol dependence\.

Our findings have several concrete implications:

Single\-trial evaluations should be deprecated for publication\-quality comparisons\.With a 14% mean flip rate, one in seven pairwise comparisons changes outcome on re\-run\. The 86\.6% single\-trial consensus fidelity is lower than the 81% MT\-Bench human agreement benchmark, which itself is considered marginal\. Leaderboards and benchmarks should mandate multi\-trial evaluation and report confidence intervals\.

Position must be randomizedandthe randomization reported\.The 72% A\-wins position bias for GPT\-4o\-mini \(p=0\.024p=0\.024\) means fixed\-position evaluation introduces systematic error\. Even with randomization, thenumberof position\-randomized trials should be reported to allow variance estimation\.

Pairwise preferences should always accompany pointwise scores\.The pairwise\-pointwise paradox \(Section[5\.3](https://arxiv.org/html/2606.13685#S5.SS3)\) is a fundamental problem with forced\-choice formats: they generate spurious certainty when the underlying quality difference is below the judge’s discrimination threshold\. Dual reporting allows readers to assess whether pairwise differences are grounded in genuine quality gaps\.

Score variance is not negligible and should be reported\.ICC=0\.58=0\.58–0\.770\.77and 44\.7% within\-question noise mean that a single pointwise score carries a 95% margin of error of±1\.2\\pm 1\.2points on a 10\-point scale\. Reporting scores without confidence intervals misrepresents the precision of LLM judge evaluations\.

Judge selection is a confound, not a free choice\.Theκ=0\.51\\kappa=0\.51cross\-judge agreement means that one in four evaluation outcomes depends on which judge is selected\. Papers should report results across multiple judge models, or explicitly acknowledge judge selection as a potential confound\.

Prompt wording is a hidden experimental variable\.Semantically equivalent prompts change majority outcomes 25% of the time\. Evaluation papers should either standardize to community\-accepted prompt templates or conduct prompt sensitivity analyses\.

Deterministic decoding \(t=0t=0\) is necessary but not sufficient\.Whilet=0t=0reduces flip rates by 43–79%, residual non\-determinism at the API level means that event=0t=0evaluations benefit from 3–5 trial repetitions\.

### 6\.4Recommendations

Based on our findings, we propose a tiered evaluation protocol:

1. 1\.Minimum standard \(reproducibility\):≥\\geq10 trials att=0t=0with randomized response order; report majority vote, flip rate per question, and question\-level confidence intervals\. This recommendation is grounded in the reliability curve: single\-trial judging achieves only 86\.6% consensus fidelity, whereas 11 trials reach 95% on average\.
2. 2\.Standard practice \(publication\): 20 trials att=0t=0; dual\-mode evaluation \(pairwise \+ pointwise\); multi\-judge panel \(≥\\geq2 judges\); report Cohen’sκ\\kappaand ICC\. This is motivated by the pairwise–pointwise gap and the cross\-judge agreement result \(κ=0\.51\\kappa=0\.51\), which show that neither pairwise winners nor single\-judge evaluations are sufficient on their own\.
3. 3\.High\-stakes evaluation \(leaderboard / model release\): 50 trials; identify high\-flip\-rate questions \(FR\>\>20%\) and flag them as “uncertain”; use at least two judges from different providers; report noise budget alongside final scores\. This follows from the hard\-question regime, where FR rises to 23\.6% on average and 15 trials are needed just to reach 90% fidelity\.
4. 4\.Category\-stratified reporting: Report consistency metrics by task category, as reliability varies substantially across task types \(η2≈0\.31\\eta^\{2\}\\approx 0\.31for category on item\-level flip rates\), with coding and reasoning much less stable than knowledge and roleplay\.

### 6\.5Limitations

Several limitations should shape how broadly these results are interpreted\. Most importantly, this is a careful repeated\-trial study of a narrow slice of the LLM judge design space, not a universal audit of all judge models or all evaluation protocols\.

Our study has several limitations\. First, and most importantly, we evaluate only two judge models from the same provider \(OpenAI\)\. While they represent different model generations \(GPT\-4o vs\. GPT\-4\.1 family\), all findings are potentially artifacts of OpenAI’s specific RLHF and decoding pipeline\. Extending to other providers \(Anthropic Claude, Google Gemini, open\-source Llama/Mistral models\) is necessary to establish generalizability, and we consider this the primary direction for future work\. Second, our 29\-question dataset, while diverse across 10 categories, is relatively small; the lack of statistical significance in category\-level comparisons reflects this limitation\. Third, while we test two prompt templates, the space of possible prompt designs is vast; further systematic exploration may reveal additional sensitivity patterns\. Fourth, our competitive response pairs \(both high\-quality\) represent a stress test, so consistency may be higher for response pairs with more obvious quality differences\. Fifth, our response pairs use GPT\-4o\-mini \(Response A\) and GPT\-4o \(Response B\), meaning GPT\-4o\-mini judges evaluate responses from their own model family\. This introduces a potential self\-preference confound that could partially explain the 72% A\-wins position bias observed for GPT\-4o\-mini; disentangling self\-preference from genuine position effects requires a future controlled study using responses from out\-of\-family models\. Finally, we do not directly compare to human annotators on the same task instances; our human baseline comparisons rely on reported values from prior work\.

## 7Conclusion

We present a repeated\-trial study of LLM\-as\-a\-Judge reliability across over 10,000 judgments\. The core conclusion is not that LLM judges are useless, nor that they behave like random coin flips overall\. Rather, it is that their reliability is layered and uneven: many items are judged stably, but a substantial minority remain noisy enough that single\-trial evaluation is hard to justify\.

Our results support three main takeaways\. First, pairwise judgments are often unstable in precisely the kinds of subjective or competitive cases that matter most for benchmarking\. Second, pairwise verdicts can overstate evidence for quality differences when pointwise scores remain nearly indistinguishable\. Third, repeated majority voting improves reliability quickly but with diminishing returns, making multi\-trial aggregation a practical and principled remedy\.

We present the most comprehensive intra\-judge consistency study of LLM\-as\-a\-Judge evaluation to date, contributing six interconnected findings from over 10,000 judgments at 50 trials per question:

1. 1\.14% mean flip rate\(max 56%\), with 28% of questions exceeding 20%, enough noise to reverse close benchmark rankings\.
2. 2\.Significant position bias\(p=0\.024p=0\.024, sign test\), with 72% of questions showing first\-position preference for GPT\-4o\-mini\.
3. 3\.Pairwise\-pointwise paradox: judges declare confident winners when pointwise scores are statistically indistinguishable \(p\>0\.1p\>0\.1\), revealing that pairwise forced\-choice amplifies non\-existent quality differences\.
4. 4\.ICC\(2,1\) = 0\.58–0\.77with 44\.7% of score variance attributable to within\-judge noise, establishing that LLM judge pointwise scores have “poor to moderate” reliability by psychometric standards\.
5. 5\.Reliability curve: 11 trials are needed to reach 95% consensus fidelity overall; 15\+ for high\-variance questions\. Single\-trial evaluations achieve only 86\.6% fidelity\.
6. 6\.Multi\-source variance: temperature \(t=0t=0reduces but does not eliminate inconsistency\), prompt wording \(25% outcome flips\), and judge selection \(κ=0\.51\\kappa=0\.51\) independently contribute to evaluation noise\.

These findings collectively establish anoise budgetfor LLM\-judge evaluation: a 100\-question single\-trial benchmark is expected to contain∼\\sim14 incorrect pairwise outcomes, which shrinks to∼\\sim5 with 11\-trial majority voting\. We propose a tiered evaluation protocol \(Section[6\.3](https://arxiv.org/html/2606.13685#S6.SS3)\) calibrated to these reliability levels\.

An important limitation is that both judges are from a single provider \(OpenAI\); replicating this analysis with Anthropic Claude, Google Gemini, and open\-source models \(Llama, Mistral\) is the primary future work direction, as findings may not generalize across providers\. Scaling to 100\+ questions per category would also enable statistically powered category\-level comparisons\.

As the AI community increasingly relies on automated evaluation for model development, RLHF training, and public leaderboards, the reliability of the evaluator is not a secondary concern; it is a first\-order validity question\. We hope this work provides both the empirical grounding and the practical tools to elevate evaluation standards\.

## Appendix AExperiment Summary

Table[9](https://arxiv.org/html/2606.13685#A1.T9)summarizes the experimental components used in this paper\.

Table 9:Summary of experimental components\.
## Appendix BPrompt Templates

The prompt\-sensitivity experiment compares two semantically equivalent templates\.

### B\.1Prompt A

> You are an impartial judge evaluating two responses to the same user request\. Compare Response A and Response B on overall quality, helpfulness, correctness, and completeness\. Return exactly one label: A, B, or tie\.

### B\.2Prompt B

> Please act as a fair and unbiased evaluator\. Review both candidate responses carefully, think step by step about which one better satisfies the user request, and then output exactly one verdict: A, B, or tie\.

Both prompts ask for the same pairwise decision, but they differ in framing, tone, and explicitness of instruction\. The prompt\-sensitivity analysis in Section[6](https://arxiv.org/html/2606.13685#S5.F6)measures how much those wording changes affect instability and majority outcomes\.

## Appendix CPer\-Question Flip Rates and Majorities

TablesLABEL:tab:perq\_4ominiandLABEL:tab:perq\_41minilist per\-question flip rates and majority outcomes for the two judges\.

Table 10:Per\-question pairwise outcomes for GPT\-4o\-mini\.QuestionCategoryFlip rateMajorityq001writing22%Bq002writing18%Aq003writing44%Bq004reasoning26%Bq005reasoning0%Bq006reasoning6%Bq007coding46%Aq008coding44%Aq009coding28%Aq010knowledge0%Aq011knowledge0%Aq012knowledge4%Aq013math0%Aq014math46%Bq015math14%Aq016roleplay0%Aq017roleplay0%Aq018extraction6%Aq019extraction2%Bq020extraction4%Aq021ethics0%Aq022ethics0%Aq023instruction18%Aq024instruction36%Aq025instruction0%Aq026hard0%Aq027hard20%Bq028hard0%Aq029hard2%ATable 11:Per\-question pairwise outcomes for GPT\-4\.1\-mini\.QuestionCategoryFlip rateMajorityq001writing0%Bq002writing40%Bq003writing0%Aq004reasoning56%tieq005reasoning0%Bq006reasoning40%Bq007coding20%tieq008coding38%Bq009coding8%tieq010knowledge0%Aq011knowledge0%Aq012knowledge0%Aq013math18%Aq014math0%Bq015math42%Aq016roleplay0%Aq017roleplay0%Aq018extraction0%Aq019extraction32%Bq020extraction28%Aq021ethics16%Aq022ethics38%Aq023instruction0%Aq024instruction0%Aq025instruction6%Aq026hard0%Aq027hard10%Bq028hard10%Bq029hard0%A
## Appendix DICC and Variance Decomposition Details

Our pointwise\-score reliability analysis uses ICC\(2,1\), the two\-way random\-effects, absolute\-agreement, single\-measures form of the intraclass correlation coefficient\. This choice treats repeated judge calls as interchangeable raters and asks how much of the observed variance is attributable to stable between\-item differences rather than within\-item stochasticity\.

In the combined variance decomposition, 55\.3% of pointwise\-score variance is between\-question signal and 44\.7% is within\-question noise\. This near\-even split is the main reason single pointwise scores should be interpreted cautiously: the stochastic component is too large to be ignored when score differences are small\.

## References

- J\. Amidei, P\. Piwek, and A\. Willis \(2019\)The use of rating and Likert scales in natural language generation human evaluation tasks: a review and some recommendations\.Proceedings of the 12th International Conference on Natural Language Generation,pp\. 397–402\.Cited by:[§2\.3](https://arxiv.org/html/2606.13685#S2.SS3.p1.4),[§2\.4](https://arxiv.org/html/2606.13685#S2.SS4.p1.1),[§5\.5](https://arxiv.org/html/2606.13685#S5.SS5.p2.1),[§5\.6](https://arxiv.org/html/2606.13685#S5.SS6.p3.3),[Table 8](https://arxiv.org/html/2606.13685#S6.T8.1.1.3)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p1.1)\.
- W\. Chiang, L\. Zheng, Y\. Sheng, A\. N\. Angelopoulos, T\. Li, D\. Li, H\. Zhang, B\. Zhu, M\. Jordan, J\. E\. Gonzalez,et al\.\(2024\)Chatbot arena: an open platform for evaluating LLMs by human preference\.arXiv preprint arXiv:2403\.04132\.Cited by:[Table 8](https://arxiv.org/html/2606.13685#S6.T8.8.11.3.3)\.
- E\. Clark, T\. August, S\. Serber, N\. Haduong, S\. Gururangan, and N\. A\. Smith \(2021\)All that’s ‘human’ is not gold: evaluating human evaluation of generated text\.arXiv preprint arXiv:2107\.00061\.Cited by:[§2\.3](https://arxiv.org/html/2606.13685#S2.SS3.p1.4),[§2\.4](https://arxiv.org/html/2606.13685#S2.SS4.p1.1)\.
- Y\. Dubois, X\. Li, R\. Taori, T\. Zhang, I\. Gulrajani, J\. Ba, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2024\)AlpacaFarm: a simulation framework for methods that learn from human feedback\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.13685#S2.SS1.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.arXiv preprint arXiv:2303\.16634\.Cited by:[§2\.1](https://arxiv.org/html/2606.13685#S2.SS1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p1.1)\.
- S\. Shankar, J\.D\. Zamfirescu\-Pereira, B\. Hartmann, A\. G\. Parameswaran, and I\. Arawjo \(2024\)Who validates the validators? aligning LLM\-assisted evaluation of LLM outputs with human preferences\.arXiv preprint arXiv:2404\.12272\.Cited by:[§2\.2](https://arxiv.org/html/2606.13685#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.13685#S2.SS3.p2.1)\.
- R\. Stureborg, D\. Alikaniotis, and Y\. Suhara \(2024\)Large language models are inconsistent and biased evaluators\.arXiv preprint arXiv:2405\.01724\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.13685#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2606.13685#S2.SS5.p1.1)\.
- P\. Wang, L\. Li, L\. Chen, Z\. Cai, D\. Zhu, B\. Lin, Y\. Cao, Q\. Liu, T\. Liu, and Z\. Sui \(2023\)Large language models are not fair evaluators\.arXiv preprint arXiv:2305\.17926\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.13685#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2606.13685#S2.SS5.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing,et al\.\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2606.13685#S1.p1.1),[§1](https://arxiv.org/html/2606.13685#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.13685#S2.SS1.p1.1),[§5\.6](https://arxiv.org/html/2606.13685#S5.SS6.p3.3),[Table 8](https://arxiv.org/html/2606.13685#S6.T8.8.10.2.3)\.
- L\. Zhu, X\. Wang, and X\. Wang \(2023\)JudgeLM: fine\-tuned large language models are scalable judges\.arXiv preprint arXiv:2310\.17631\.Cited by:[§2\.1](https://arxiv.org/html/2606.13685#S2.SS1.p1.1)\.
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Similar Articles

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Judge Circuits

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…

Submit Feedback

Similar Articles

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…