LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv cs.CL Papers

Summary

This paper introduces a psychometric datasheet protocol for evaluating LLM judges as measurement instruments, measuring dark current, positional false preference, stable cross-sensitivity, and target sensitivity. A case study on three open-weight models reveals significant differences in judge quality and behavior.

arXiv:2606.15610v1 Announce Type: new Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:49 AM

# LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation
Source: [https://arxiv.org/html/2606.15610](https://arxiv.org/html/2606.15610)
Hiroyasu Usami[https://orcid.org/0000-0003-4161-4239](https://orcid.org/0000-0003-4161-4239)1, Keisuke Hara1, Ayato Tsuboi1, and Naohiko Matsuda2 1Department of Computer Science, Graduate School of Engineering, Chubu University, Kasugai, Aichi 487\-8501, Japan 2Mitsubishi Heavy Industries, Ltd\., Research & Innovation Center, Heat Transfer Research Department, Takasago, Hyogo 676\-8686, Japan usami@fsc\.chubu\.ac\.jp [https://usamilab\.org](https://usamilab.org/)

\(June 12, 2026\)

###### Abstract

LLM\-as\-a\-judge systems are now routinely used for open\-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce\. Yet these judges are often reported as scalar accuracy, win\-rate, or agreement devices\. We argue that a judge should instead be reported as a measurement instrument\. We introduce a Judge Datasheet protocol that measures dark current under true\-vacuum inputs, stable cross\-sensitivity to same\-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions\. The direction\-stability decomposition reveals that apparentΔ​0\\Delta 0preference can be stable surface response or disguised position bias\. In a three\-judge open\-weight case study, Llama\-3\.1\-8B shows high dark current and presentation\-conflictedΔ​0\\Delta 0behavior, Qwen2\.5\-14B is vacuum\-clean and target\-sensitive but mixes stable and positional over\-discrimination, and Qwen2\.5\-32B is vacuum\-clean with low stable cross\-sensitivity and low positional false preference\. A strict tie criterion eliminates Qwen32BΔ​0\\Delta 0false preference but absorbs marginalΔ​1\\Delta 1target signals into ties while preservingΔ​5\\Delta 5sensitivity\. The results show that prompting moves the criterion, not the resolution\. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made\.

## 1Introduction

LLM\-as\-a\-judge evaluation has become a practical default for comparing open\-ended model outputs\. In current LLM development practice, open\-ended model comparison often relies on automatic judges because human evaluation is expensive, slow, and hard to reproduce at benchmark scale\. This makes the judge part of the evaluation infrastructure, not merely a convenient scorer\. The attraction is clear: a judge can read natural language, apply task\-specific criteria, and produce a preference without first reducing the answer to a narrow automatic metric\. But once a judge is used to validate another system, the judge itself becomes a measurement instrument\. It can have background response in the absence of signal, sensitivity to the intended construct, cross\-sensitivity to nuisance variation, position bias, and an operating criterion that determines when a weak signal is reported as a preference or as no preference\.

This work originated from a downstream study of orientation in LLM evaluation\. That downstream mechanism is not tested here; the present paper measures whether the judge is calibrated enough to support such claims\. We use ChiralityEval as the motivating project name for that downstream line, but the contribution here is judge metrology rather than mechanism validation\.

We define*dark current*as false preference under true\-vacuum inputs, including empty answers, whitespace, and identical non\-empty answers\. We define*positional false preference*as an apparent preference driven by the presentation slot rather than candidate content\. We define*stable cross\-sensitivity*as stable response to non\-target but real surface\-form variation under same\-qualityΔ​0\\Delta 0comparisons after presentation order has been canonicalized\. We define*target sensitivity*as detection of intended quality differences on a constructively controlledΔ​Q\\Delta Qladder\. Finally, we define*criterion*as the tie/preference operating point induced by the instruction and prompt\.

Our contributions are fivefold:

- •Judge Datasheet protocol\.We introduce a Judge Datasheet protocol for LLM\-as\-a\-judge systems, combining A0 true\-vacuum tests,A1controlled quality ladders, and criterion\-shift probing\.
- •Direction\-stability decomposition\.We makeΔ​0\\Delta 0direction\-stability decomposition a first\-class measurement: raw same\-quality false preference is separated into stable cross\-sensitivity, positional false preference, one\-sided commit, other conflict, and no\-preference at the canonical\-pair level\.
- •Controlled stimulus ladder\.We construct a prefix\-chain checklist stimulus ladder with Pareto dominance,Δ​0\\Delta 0same\-subset and different\-subset controls, filler controls, and validity gates\.
- •Three\-judge case study\.We present a case study of Llama\-3\.1\-8B, Qwen2\.5\-14B, and Qwen2\.5\-32B, showing that they occupy different metrological profiles\.
- •Criterion\-shift probe\.We show that strict tie prompting moves the criterion: it eliminates Qwen32BΔ​0\\Delta 0false preference but absorbs marginalΔ​1\\Delta 1target signals into ties while preservingΔ​5\\Delta 5sensitivity\.

The central claim is deliberately narrow\. We do not claim that the downstream mechanism hypothesis or an orientation mechanism is confirmed\. We do not claim a broad size\-family trend, a universal judge, or a human\-ground\-truth result\. We claim that LLM judges require multi\-axis measurement before they are used as evidence\-bearing instruments\.

## 2Related Work

#### LLM\-as\-a\-judge reliability and IRT\.

Prior work has diagnosed LLM judges through observational latent\-trait modeling and benchmark\-level reliability\. Choi et al\. use item\-response theory to diagnose LLM\-as\-a\-judge reliability from observational response patterns\[[1](https://arxiv.org/html/2606.15610#bib.bib1)\]\. We complement this line with an experimental psychophysics protocol: constructively controlled stimulus strength, true\-vacuum inputs, direction\-stability tests, and direct criterion manipulation\.

#### Evaluation infrastructure and automatic judges\.

Large\-scale evaluation frameworks such as HELM and BIG\-Bench frame evaluation as infrastructure for measuring language\-model capabilities across tasks and risks\[[2](https://arxiv.org/html/2606.15610#bib.bib2),[3](https://arxiv.org/html/2606.15610#bib.bib3)\]\. In parallel, LLM\-based automatic evaluators such as MT\-Bench/Chatbot Arena and Length\-Controlled AlpacaEval made open\-ended evaluation cheaper and faster, while also exposing judge\-specific biases such as position, verbosity, and length sensitivity\[[4](https://arxiv.org/html/2606.15610#bib.bib4),[5](https://arxiv.org/html/2606.15610#bib.bib5)\]\. Practical evaluation frameworks such as OpenAI Evals further normalize evaluation loops as part of model development\[[6](https://arxiv.org/html/2606.15610#bib.bib6)\]\. Our work focuses on the measuring side of this infrastructure: before a judge score is used as evidence, the judge itself should have a datasheet\.

#### Signal detection and criterion\.

Signal detection theory separates sensitivity from criterion\. Cacioli frames LLM decisions in SDT terms and develops a temperature\-criterion analogy\[[7](https://arxiv.org/html/2606.15610#bib.bib7)\]\. We use SDT framing operationally for pairwise judge decisions and prompt\-induced criterion shifts: a prompt can shift the tie/preference operating point without increasing the resolution of the underlying measurement\.

#### Biases in judge preferences\.

LLM judges are known to exhibit position, verbosity, self\-preference, and order effects\. MT\-Bench and Chatbot Arena popularized LLM\-as\-a\-judge evaluation while documenting position, verbosity, and self\-enhancement biases\[[4](https://arxiv.org/html/2606.15610#bib.bib4)\]\. Shi et al\. systematically study position bias and its dependence on comparison conditions\[[8](https://arxiv.org/html/2606.15610#bib.bib8)\]\. Yang et al\. study self\-preference bias with equal\-quality comparisons and mitigation strategies\[[9](https://arxiv.org/html/2606.15610#bib.bib9)\]\. Prior position\-bias work often measures aggregate order effects or marginal slot preferences\. We complement it with an operationalΔ​0\\Delta 0direction\-stability test that separates content\-stable preference from slot\-stable preference at the canonical\-pair level\.

#### Documentation and datasheets\.

Datasheets for Datasets and Model Cards established structured disclosure practices for datasets and models\[[10](https://arxiv.org/html/2606.15610#bib.bib10),[11](https://arxiv.org/html/2606.15610#bib.bib11)\]\. Our unit of documentation is the evaluator itself\. A Judge Datasheet separates dark current, position\-driven false preference, stable cross\-sensitivity, target sensitivity, and criterion so that downstream claims do not silently inherit unmeasured judge behavior\.

## 3Judge Datasheet Protocol

Table 1:Judge Datasheet protocol components\.### 3\.1Notation

Letz=\{u,v\}z=\\\{u,v\\\}be a canonical unordered content pair\. A presentation order iso∈\{\(u,v\),\(v,u\)\}o\\in\\\{\(u,v\),\(v,u\)\\\}, where the two contents are assigned to slots 1 and 2\. The judge returns a slot\-level outputJ​\(o\)∈\{1,2,tie\}J\(o\)\\in\\\{1,2,\\mathrm\{tie\}\\\}\. We map the slot\-level output back to canonical content identity asWJ​\(o\)∈\{u,v,tie\}W\_\{J\}\(o\)\\in\\\{u,v,\\mathrm\{tie\}\\\}\. The order\-reversal operator isπ​\(u,v\)=\(v,u\)\\pi\(u,v\)=\(v,u\)\. Slot identity and content identity are not the same: direction\-stability metrics are computed only after mapping slot outputs back to canonical candidate identity\. All direction\-stability metrics are computed after mapping slot winners back to canonical content identities; slot\-level equality and content\-level equality mean opposite things under order reversal\.

### 3\.2Metrics at a glance

Table 2:Metrics at a glance\. This compact main\-text view gives the metric level and direction of concern; protocol and denominator details appear in Appendix TableLABEL:tab:metrics\_glance\_full\. Bad direction is relative to the controlled checklist construct rather than a moral judgment\. Some axes are construct\-relative: stable cross\-sensitivity may be useful if style or surface form is part of the target construct\.How to read the metrics\.RawΔ​0\\Delta 0FP is not cross\-sensitivity\. Stable cross\-sensitivity requires content\-stable direction under order reversal\. Positional false preference requires slot\-stable choice under order reversal\. RFP0averages two presentation\-order calls per canonical pair; SCS, PFP, OSC, and Other are canonical\-pair decomposition terms combined by Eq\.[7](https://arxiv.org/html/2606.15610#S3.E7)\. A high tie rate is not always bad; in true vacuum it is desirable\.Δ𝟕𝟓⋆≤1\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\\leq 1is not exactly 1; it is left\-censored by the ladder granularity\. The strict criterion arm is not a new judge and does not have dark\-current measurement unless A0 is rerun under that prompt\. Reference/API judges are external comparators, not ground truth\.

### 3\.3Metric definitions

Let𝒱\\mathcal\{V\}be the true\-vacuum set and letJtieJ\_\{\\mathrm\{tie\}\}denote the tie\-allowed judge protocol\. True\-vacuum inputs include empty, whitespace, and identical non\-empty pairs\. Dark current is the non\-abstaining false\-preference rate under the tie\-allowed protocol:

DC​\(J\)=𝔼\(u,v\)∈𝒱​\[𝟏​\{WJ​\(u,v\)≠tie∧WJ​\(u,v\)≠abstain\}\]\.\\mathrm\{DC\}\(J\)=\\mathbb\{E\}\_\{\(u,v\)\\in\\mathcal\{V\}\}\\left\[\\mathbf\{1\}\\\!\\left\\\{W\_\{J\}\(u,v\)\\neq\\mathrm\{tie\}\\ \\land\\ W\_\{J\}\(u,v\)\\neq\\mathrm\{abstain\}\\right\\\}\\right\]\.\(1\)In this paper, valid abstention is treated as no\-preference for this axis\.

Let𝒟0\\mathcal\{D\}\_\{0\}be theΔ​0\\Delta 0same\-subset canonical\-pair set\. RawΔ​0\\Delta 0false preference is a call\-level non\-tie rate, written as the average of the two presentation\-order calls for each canonical pair:

RFP0​\(J\)=𝔼\(u,v\)∈𝒟0​\[12​\(𝟏​\{WJ​\(u,v\)≠tie\}\+𝟏​\{WJ​\(v,u\)≠tie\}\)\]\.\\mathrm\{RFP\}\_\{0\}\(J\)=\\mathbb\{E\}\_\{\(u,v\)\\in\\mathcal\{D\}\_\{0\}\}\\left\[\\frac\{1\}\{2\}\\left\(\\mathbf\{1\}\\\{W\_\{J\}\(u,v\)\\neq\\mathrm\{tie\}\\\}\+\\mathbf\{1\}\\\{W\_\{J\}\(v,u\)\\neq\\mathrm\{tie\}\\\}\\right\)\\right\]\.\(2\)This quantity includes stable, positional, one\-sided, and conflict components\. We therefore do not call rawΔ​0\\Delta 0false preference cross\-sensitivity\. Although Eq\.[2](https://arxiv.org/html/2606.15610#S3.E2)averages over canonical pairs, the inner average makes it a two\-call rate\.

Stable cross\-sensitivity is the canonical\-pair rate of content\-stable non\-tie choices under order reversal:

SCS​\(J\)=𝔼\(u,v\)∈𝒟0​\[𝟏​\{WJ​\(u,v\)=WJ​\(v,u\)∈\{u,v\}\}\]\.\\mathrm\{SCS\}\(J\)=\\mathbb\{E\}\_\{\(u,v\)\\in\\mathcal\{D\}\_\{0\}\}\\left\[\\mathbf\{1\}\\\{W\_\{J\}\(u,v\)=W\_\{J\}\(v,u\)\\in\\\{u,v\\\}\\\}\\right\]\.\(3\)
A positional flip occurs when the judge chooses the same slot in both presentation orders, which reverses canonical content identity\. Positional false preference is

PFP​\(J\)=𝔼\(u,v\)∈𝒟0​\[𝟏​\{J​\(u,v\)=J​\(v,u\)∈\{1,2\}\}\]\.\\mathrm\{PFP\}\(J\)=\\mathbb\{E\}\_\{\(u,v\)\\in\\mathcal\{D\}\_\{0\}\}\\left\[\\mathbf\{1\}\\\{J\(u,v\)=J\(v,u\)\\in\\\{1,2\\\}\\\}\\right\]\.\(4\)This is not a content\-stable choice; it indicates presentation\-slot driven preference\.

The remaining named component is one\-sided commit, where exactly one of the two presentation orders yields a non\-tie:

OSC​\(J\)=𝔼\(u,v\)∈𝒟0​\[𝟏​\{𝟏​\{WJ​\(u,v\)≠tie\}\+𝟏​\{WJ​\(v,u\)≠tie\}=1\}\]\.\\mathrm\{OSC\}\(J\)=\\mathbb\{E\}\_\{\(u,v\)\\in\\mathcal\{D\}\_\{0\}\}\\left\[\\mathbf\{1\}\\left\\\{\\mathbf\{1\}\\\{W\_\{J\}\(u,v\)\\neq\\mathrm\{tie\}\\\}\+\\mathbf\{1\}\\\{W\_\{J\}\(v,u\)\\neq\\mathrm\{tie\}\\\}=1\\right\\\}\\right\]\.\(5\)We define the residual other\-conflict contribution by subtracting the named components from the pair\-level non\-tie contribution:

Other​\(J\)=RFP0​\(J\)−SCS​\(J\)−PFP​\(J\)−OSC​\(J\)/2\.\\mathrm\{Other\}\(J\)=\\mathrm\{RFP\}\_\{0\}\(J\)\-\\mathrm\{SCS\}\(J\)\-\\mathrm\{PFP\}\(J\)\-\\mathrm\{OSC\}\(J\)/2\.\(6\)Therefore, under the mutually exclusive canonical\-pair classification used for Fig\.[3](https://arxiv.org/html/2606.15610#S4.F3),

RFP0​\(J\)=SCS​\(J\)\+PFP​\(J\)\+OSC​\(J\)/2\+Other​\(J\)\.\\mathrm\{RFP\}\_\{0\}\(J\)=\\mathrm\{SCS\}\(J\)\+\\mathrm\{PFP\}\(J\)\+\\mathrm\{OSC\}\(J\)/2\+\\mathrm\{Other\}\(J\)\.\(7\)The one\-half coefficient is fixed by the fact that one\-sided commit contributes one non\-tie decision across two presentation\-order calls\. No\-preference contributes zero to Eq\.[7](https://arxiv.org/html/2606.15610#S3.E7)\. For clean outputs restricted to\{u,v,tie\}\\\{u,v,\\mathrm\{tie\}\\\}, stable, positional, one\-sided commit, and no\-preference exhaust the two\-order cases, so Other is zero by construction\. We retain Other as a guard category for schema\-invalid calls, valid abstentions not mapped to tie, or future protocols with additional non\-tie states\. In the current clean runs, Other is zero by construction\.

For nonzero ladder pairs𝒟δ\\mathcal\{D\}\_\{\\delta\}, lety⋆​\(u,v\)y^\{\\star\}\(u,v\)be the higher\-quality content under the prefix\-chain construct\. The main tables report the all\-call target sensitivity, with ties counted as not correct:

Pcorrectall​\(δ;J\)=𝔼\(u,v,o\)∈𝒟δ​\[𝟏​\{WJ​\(o\)=y⋆​\(u,v\)\}\]\.P\_\{\\mathrm\{correct\}\}^\{\\mathrm\{all\}\}\(\\delta;J\)=\\mathbb\{E\}\_\{\(u,v,o\)\\in\\mathcal\{D\}\_\{\\delta\}\}\\left\[\\mathbf\{1\}\\\{W\_\{J\}\(o\)=y^\{\\star\}\(u,v\)\\\}\\right\]\.\(8\)When needed, conditional non\-tie accuracy is reported separately as

Pcorrectnon​\-​tie​\(δ;J\)=Pr⁡\[WJ​\(o\)=y⋆​\(u,v\)∣WJ​\(o\)≠tie\]\.P\_\{\\mathrm\{correct\}\}^\{\\mathrm\{non\\mbox\{\-\}tie\}\}\(\\delta;J\)=\\Pr\\left\[W\_\{J\}\(o\)=y^\{\\star\}\(u,v\)\\mid W\_\{J\}\(o\)\\neq\\mathrm\{tie\}\\right\]\.\(9\)
Using an isotonic fitP^correct​\(δ\)\\widehat\{P\}\_\{\\mathrm\{correct\}\}\(\\delta\), the 75 percent detection threshold is

Δ𝟕𝟓⋆​\(J\)=inf\{δ≥0:P^correct​\(δ\)≥0\.75\}\.\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\(J\)=\\inf\\left\\\{\\delta\\geq 0:\\widehat\{P\}\_\{\\mathrm\{correct\}\}\(\\delta\)\\geq 0\.75\\right\\\}\.\(10\)If the threshold is reached at the smallest measured nonzero stepδ=1\\delta=1, we reportΔ𝟕𝟓⋆≤1\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\\leq 1and mark it as left\-censored\.

We define criterion operationally as the tie operating point induced by promptppon condition𝒟\\mathcal\{D\}:

CJ​\(p;𝒟\)=Pr⁡\[WJ​\(o;p\)=tie∣o∈𝒟\],Δ​C=CJ​\(pstrict;𝒟\)−CJ​\(pbase;𝒟\)\.C\_\{J\}\(p;\\mathcal\{D\}\)=\\Pr\\left\[W\_\{J\}\(o;p\)=\\mathrm\{tie\}\\mid o\\in\\mathcal\{D\}\\right\],\\qquad\\Delta C=C\_\{J\}\(p\_\{\\mathrm\{strict\}\};\\mathcal\{D\}\)\-C\_\{J\}\(p\_\{\\mathrm\{base\}\};\\mathcal\{D\}\)\.\(11\)We use criterion operationally as the prompt\-induced tie/preference operating point\. This is analogous to shifting criterion in signal detection theory, but we do not fit a full parametric SDT model\.

Miss\-by\-tie is the rate at which a target signal exists but the judge returns tie:

MBT​\(δ;J\)=Pr⁡\[WJ​\(o\)=tie,y⋆​\(u,v\)​exists∣\(u,v,o\)∈𝒟δ\]\.\\mathrm\{MBT\}\(\\delta;J\)=\\Pr\\left\[W\_\{J\}\(o\)=\\mathrm\{tie\},\\;y^\{\\star\}\(u,v\)\\ \\mathrm\{exists\}\\mid\(u,v,o\)\\in\\mathcal\{D\}\_\{\\delta\}\\right\]\.\(12\)
Algorithm 1: Judge Datasheet Measurement Protocol\.

Input:judgeJJ, task set𝒯\\mathcal\{T\}, ladder depthLL, prompt set𝒫=\{pbase,\\mathcal\{P\}=\\\{p\_\{\\mathrm\{base\}\},optionalpstrict\}p\_\{\\mathrm\{strict\}\}\\\}\. Output:datasheetd=\(DC,RFP0,SCS,PFP,OSC,Other,Pcorrect,Δ𝟕𝟓⋆,d=\(\\mathrm\{DC\},\\mathrm\{RFP\}\_\{0\},\\mathrm\{SCS\},\\mathrm\{PFP\},\\mathrm\{OSC\},\\mathrm\{Other\},P\_\{\\mathrm\{correct\}\},\\Delta^\{\\star\}\_\{\\mathbf\{75\}\},criterion metrics\)\)\.

1\.Build the true\-vacuum set𝒱\\mathcal\{V\}\.2\.Measure dark current on𝒱\\mathcal\{V\}under the tie\-allowed protocol\.3\.For each taskt∈𝒯t\\in\\mathcal\{T\}, draw a prefix\-chain requirement order and constructQ0,…,QLQ\_\{0\},\\ldots,Q\_\{L\}with nested requirement sets\.4\.Build nonzero ladder pairs𝒟δ\\mathcal\{D\}\_\{\\delta\}forδ=1,…,L\\delta=1,\\ldots,L\.5\.BuildΔ​0\\Delta 0same\-subset andΔ​0\\Delta 0different\-subset controls\.6\.For each pair, evaluate both presentation orders and map slot outputs back to canonical content identity\.7\.Compute rawΔ​0\\Delta 0false preference\.8\.DecomposeΔ​0\\Delta 0responses into stable cross\-sensitivity, positional false preference, one\-sided commit, conflict, and no\-preference\.9\.ComputePcorrect​\(δ\)P\_\{\\mathrm\{correct\}\}\(\\delta\)andΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\.10\.If a strict criterion arm is run, re\-evaluate selected pairs under the strict prompt and compute operating\-point shift and miss\-by\-tie\.11\.Return the datasheet\.

The algorithm is schematic\. Exact prompts and JSON schemas are in the appendix and artifacts\. This is a measurement protocol, not a model\-training algorithm\.

#### A0 true vacuum\.

The A0 arm measures response in the absence of an evaluative signal according to Eq\.[1](https://arxiv.org/html/2606.15610#S3.E1)\. A calibrated judge should not manufacture a preference on these inputs\. The reported dark current is the false\-preference rate under the tie\-allowed protocol after schema and semantic validity checks\.

#### A1controlled quality ladder\.

TheA1arm uses a constructively controlled checklist ladder\. Each task has a set of required elements\. Responses are generated as a prefix chain so that higher levels include a superset of required elements\. This yields Pareto\-dominant nonzeroΔ​Q\\Delta Qpairs for target\-sensitivity measurement\. The ladder is intentionally simple: it is not a claim about natural answer quality, but a controlled psychophysical stimulus for the judge\.

In the commonA1asetting, each judge is evaluated on 10 tasks with 60 canonicalΔ​0\\Delta 0same\-subset pairs and the full prefix\-chain nonzero ladder: 50 canonicalΔ​1\\Delta 1, 40Δ​2\\Delta 2, 30Δ​3\\Delta 3, 20Δ​4\\Delta 4, and 10Δ​5\\Delta 5pairs, each evaluated in both presentation orders\. Publication\-facing confidence intervals for call\-level proportions use Wilson binomial intervals from the observed numerator and denominator\. Task\-cluster bootstrap intervals remain in the source artifacts for exploratory diagnostics, but zero\-width bootstrap intervals at boundary estimates are not used as publication intervals\.

Primary ladder candidates are length\-matched within task to avoid conflating checklist cardinality with verbosity\. Filler\-control pairs separately measure whether a judge rewards or penalizes non\-informative filler, and filler variants are not pooled into the primaryΔ​Q\\Delta Qcurve\. Filler diagnostics are retained as datasheet fields in the source artifacts; no publication\-safe common three\-judge filler rate is reported in the main table\.

#### Δ​0\\Delta 0controls\.

The protocol separatesΔ​0\\Delta 0same\-subset andΔ​0\\Delta 0different\-subset pairs\. Same\-subset pairs hold the target checklist fixed and vary surface form or presentation\. Different\-subset pairs keep the same cardinality but exchange checklist elements\. Raw false preference onΔ​0\\Delta 0pairs follows Eq\.[2](https://arxiv.org/html/2606.15610#S3.E2)and is not automatically cross\-sensitivity\. For each canonical unordered pair, we collect both presentation orders and map non\-tie winners back to canonical candidate identity\. A pair is classified as no\-preference, stable direction, positional flip, one\-sided commit, or other conflict\.

#### Direction\-stability decomposition\.

Stable cross\-sensitivity is the stable\-direction rate onΔ​0\\Delta 0same\-subset pairs in Eq\.[3](https://arxiv.org/html/2606.15610#S3.E3)\. Positional false preference is the positional\-flip rate in Eq\.[4](https://arxiv.org/html/2606.15610#S3.E4): the judge chooses the candidate shown in the same slot under both presentation orders, causing canonical direction to reverse\. This distinction is central\. A judge can have raw false preference near one while having almost no stable cross\-sensitivity if the apparent preference is driven by presentation order\.

#### Target sensitivity and SDT\.

For nonzeroΔ​Q\\Delta Qpairs, the protocol reports all\-call target sensitivity in Eq\.[8](https://arxiv.org/html/2606.15610#S3.E8), tie rate, optional conditional non\-tie accuracy in Eq\.[9](https://arxiv.org/html/2606.15610#S3.E9), and the detection threshold in Eq\.[10](https://arxiv.org/html/2606.15610#S3.E10)\. When the threshold is reached at the smallest measured nonzero step, it is reported as≤1\\leq 1, not exactly one\. The current five\-requirement ladder is too coarse to resolve sub\-unit thresholds\.

#### Criterion\-shift probe\.

TheA1c\-2 probe tightens the tie criterion for Qwen32B according to Eq\.[11](https://arxiv.org/html/2606.15610#S3.E11)\. This intervention tests whether false preference can be reduced by changing the operating point and whether doing so preserves sensitivity to target differences\. Prompting moves the criterion, not the resolution\.

## 4Results: Three Judges

The main datasheet separates three quantities that are easily conflated\. RawΔ​0\\Delta 0false preference is simply the rate at which a judge refuses to tie on same\-quality pairs\. Stable cross\-sensitivity is the subset of those decisions that remain content\-consistent after presentation order is reversed\. Positional false preference is the opposite failure mode: the judge chooses the same slot under both orders, so the canonical content direction flips\. Thus rawΔ​0\\Delta 0false preference is a mixture, not a mechanism\. Profile sketches are descriptive summaries, not universal taxonomic classes; they summarize the observed profile in this controlled stimulus family\.

Table 3:Compact Judge Datasheet summary for the three commonA1ajudge runs\. Compact values are shown; full confidence intervals and denominators appear in Appendix TableLABEL:tab:full\_ci\. RawΔ​0\\Delta 0false preference is not cross\-sensitivity\. Stable cross\-sensitivity and positional false preference are canonical\-pair decomposition rates after both presentation orders are mapped back to content identity\. SCS and PFP are mechanism components, not aliases for rawΔ​0\\Delta 0false preference; the weighted relation including one\-sided commit and other conflict is given in Eq\.[7](https://arxiv.org/html/2606.15610#S3.E7)\. Profile sketches are descriptive summaries for this controlled stimulus family\.Δ𝟕𝟓⋆≤1\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\\leq 1is left\-censored at the smallest measured nonzero ladder step\.![Refer to caption](https://arxiv.org/html/2606.15610v1/figures/fig1_metrology_axes_direction_stable.png)Figure 1:Metrology axes map for the commonA1aruns\. The x\-axis is true\-vacuum dark current from Eq\.[1](https://arxiv.org/html/2606.15610#S3.E1); the y\-axis is stable cross\-sensitivity from Eq\.[3](https://arxiv.org/html/2606.15610#S3.E3)\. Marker size reflects all\-call target sensitivity atΔ​1\\Delta 1\. The strict\-criterion arm is excluded because its A0 dark\-current arm was not remeasured and is shown only in the operating\-point tradeoff in Fig\.[4](https://arxiv.org/html/2606.15610#S5.F4)\.![Refer to caption](https://arxiv.org/html/2606.15610v1/figures/fig2_psychometric_curves.png)Figure 2:Psychometric target sensitivity curvesPcorrectall​\(δ;J\)P\_\{\\mathrm\{correct\}\}^\{\\mathrm\{all\}\}\(\\delta;J\)from Eq\.[8](https://arxiv.org/html/2606.15610#S3.E8)for the commonA1aladder\. Error bars use Wilson call\-binomial 95 percent intervals computed from observed correct counts and denominators\. Qwen14B and Qwen32B reach the 75 percent threshold at the smallest measured nonzero step, soΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}from Eq\.[10](https://arxiv.org/html/2606.15610#S3.E10)is reported as≤1\\leq 1\. Llama8B requires a larger step, withΔ𝟕𝟓⋆=4\.0\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}=4\.0in this run\.![Refer to caption](https://arxiv.org/html/2606.15610v1/figures/fig4_delta0_decomposition_stacked_bar.png)Figure 3:Δ​0\\Delta 0same\-subset direction\-stability decomposition\. Raw false preference from Eq\.[2](https://arxiv.org/html/2606.15610#S3.E2)is not cross\-sensitivity; under Eq\.[7](https://arxiv.org/html/2606.15610#S3.E7)it decomposes into stable cross\-sensitivity from Eq\.[3](https://arxiv.org/html/2606.15610#S3.E3), positional false preference from Eq\.[4](https://arxiv.org/html/2606.15610#S3.E4), one\-sided commit from Eq\.[5](https://arxiv.org/html/2606.15610#S3.E5), other conflict from Eq\.[6](https://arxiv.org/html/2606.15610#S3.E6), and no\-preference\. The stacked components are canonical\-pair components, while raw RFP0is a two\-call average; one\-sided commit enters RFP0with a one\-half coefficient\. Llama8B’s raw false preference is mostly presentation\-driven, Qwen14B is mixed, and Qwen32B is low on both stable and positional components\. The strict\-criterion bar is included only for theΔ​0\\Delta 0same\-subset decomposition; A0 dark current was not remeasured under that prompt\. Confidence intervals for the decomposed canonical\-pair cells are not included in the frozen artifact package and are therefore not inferred here\.Table[3](https://arxiv.org/html/2606.15610#S4.T3)summarizes the main datasheet, with full intervals and denominators in Appendix TableLABEL:tab:full\_ci\. Llama8B has dark current 0\.6667 by Eq\.[1](https://arxiv.org/html/2606.15610#S3.E1)and rawΔ​0\\Delta 0false preference 1\.0000 by Eq\.[2](https://arxiv.org/html/2606.15610#S3.E2)\. However, its stable cross\-sensitivity is only 0\.0333, while positional false preference is 0\.9667\. This is a Class B / Presentation\-conflicted profile: the judge is useful for pipeline debugging, but theΔ​0\\Delta 0response is not stable evidence about surface\-form sensitivity\.

Qwen14B eliminates true\-vacuum dark current and is target\-sensitive at bothΔ​1\\Delta 1andΔ​5\\Delta 5\. Its rawΔ​0\\Delta 0false preference is 0\.9917, but direction\-stability decomposition revises the interpretation\. Stable cross\-sensitivity by Eq\.[3](https://arxiv.org/html/2606.15610#S3.E3)is 0\.4500 and positional false preference by Eq\.[4](https://arxiv.org/html/2606.15610#S3.E4)is 0\.5333\. Thus, we assign it the descriptive label“Class A\-delta0 / Mixed stable\-positional”\. This highlights that high raw false preference should not be conflated with stable cross\-sensitivity\.

Qwen32B is vacuum\-clean, target\-sensitive, and has lowerΔ​0\\Delta 0artifacts\. Its rawΔ​0\\Delta 0false preference is 0\.2583, stable cross\-sensitivity is 0\.0000, positional false preference is 0\.0833, and no\-preference is 0\.5667\. In this controlled datasheet it has the cleanest observed profile among the three open\-weight judges\. This is a descriptive result within this stimulus family, not a universal reliability claim\.

The three profiles demonstrate why a scalar win\-rate or agreement score is not enough\. A high raw same\-quality false preference can mean stable sensitivity to nuisance variation, positional false preference, one\-sided commit, other conflict, or residual no\-preference structure\. RawΔ​0\\Delta 0false preference follows the weighted relation in Eq\.[7](https://arxiv.org/html/2606.15610#S3.E7): stable and positional two\-order commitments contribute fully, one\-sided commit contributes half, and no\-preference contributes zero\. These components have different downstream interpretations and require different decisions; Appendix Fig\.[5](https://arxiv.org/html/2606.15610#A1.F5)shows the same decomposition as a positional\-vs\-stable scatter\.

## 5Criterion Shift

Table 4:Criterion intervention on Qwen32B\. The strict prompt eliminates rawΔ​0\\Delta 0false preference, but the lostΔ​Q=1\\Delta Q=1sensitivity is miss\-by\-tie rather than wrong\-choice error\. TheΔ​0\\Delta 0no\-preference row is a canonical\-pair\-level decomposition quantity; tie\-rate rows are call\-level quantities\. The strict arm did not run the full threshold protocol, soΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}is not estimated\.![Refer to caption](https://arxiv.org/html/2606.15610v1/figures/fig3_criterion_shift_tradeoff.png)Figure 4:Operating\-point tradeoff for Qwen32B under the two\-prompt criterion intervention in Eq\.[11](https://arxiv.org/html/2606.15610#S3.E11)\. This is a two\-operating\-point segment, not an estimated ROC curve\. The strict criterion eliminates rawΔ​0\\Delta 0false preference but absorbsΔ​Q=1\\Delta Q=1into tie\. TheΔ​Q=1\\Delta Q=1errors are miss\-by\-tie as defined in Eq\.[12](https://arxiv.org/html/2606.15610#S3.E12), not wrong choices: tie rate is 0\.5000, wrong\-choice rate is 0\.0000, and conditional accuracy among non\-ties is 1\.0000\.Δ​Q=5\\Delta Q=5sensitivity is preserved\. Error bars use Wilson call\-binomial 95 percent intervals\.The strict tie prompt changes the operating point defined in Eq\.[11](https://arxiv.org/html/2606.15610#S3.E11)\. Under the baseline protocol, Qwen32B has a call\-levelΔ​0\\Delta 0same\-subset tie rate of 0\.7417, corresponding to the pair\-level no\-preference rate of 0\.5667 in Table[4](https://arxiv.org/html/2606.15610#S5.T4)and Fig\.[3](https://arxiv.org/html/2606.15610#S4.F3); its call\-levelΔ​Q=1\\Delta Q=1tie rate is 0\.0600, and its call\-levelΔ​Q=5\\Delta Q=5tie rate is 0\.0000\. Under the strict criterion, rawΔ​0\\Delta 0false preference becomes 0\.0000 and pair\-level no\-preference becomes 1\.0000\. This is desirable if the goal is to suppress same\-quality surface\-form over\-discrimination\.

TheΔ​0\\Delta 0tie rate and Fig\.[3](https://arxiv.org/html/2606.15610#S4.F3)no\-preference are related but not identical quantities\. The tie rate is call\-level; no\-preference in the decomposition is canonical\-pair\-level after both orders are collapsed\. In the two\-order design, one\-sided commits contribute one tie and one non\-tie call\. Thus Qwen32B baseline call\-levelΔ​0\\Delta 0tie rate 0\.7417 is consistent with canonical no\-preference 0\.5667 plus one half of the one\-sided\-commit mass\.

The cost appears at the margin\.Δ​Q=1\\Delta Q=1target sensitivity falls from 0\.9400 to 0\.5000\. The loss is not a wrong\-choice problem: atΔ​Q=1\\Delta Q=1, the wrong\-choice rate is 0\.0000 and conditional accuracy among non\-ties is 1\.0000\. The low\-strength signal is absorbed into tie\. AtΔ​Q=5\\Delta Q=5, sensitivity remains 1\.0000\. This pattern supports the criterion interpretation: prompting can move the tie/preference threshold, but it does not create new resolution at lowΔ​Q\\Delta Q\.

## 6Discussion

Judge datasheets should be run before downstream evaluation claims\. A downstream evaluation can be syntactically valid while the judge has high dark current, position\-driven false preference, or insufficient target sensitivity\. The present case study shows all three possibilities\. Llama8B is a strong pipeline\-debug judge but a poor calibrated judge for same\-quality comparisons\. Qwen14B is more sensitive and vacuum\-clean, but the direction\-stability decomposition shows that rawΔ​0\\Delta 0false preference mixes stable, positional, one\-sided, conflict, and no\-preference components\. Qwen32B has the cleanest profile in this family, but it remains a controlled case\-study result rather than a universal guarantee\.

The result also clarifies the relationship to the motivating downstream orientation study\. If orientation capacity is later modeled as geometry times judge resolvability, the judge\-resolvability term cannot be assumed\. It must be measured\. The current paper supplies that measurement layer\. It does not validate the downstream mechanism\.

Criterion control is useful but limited\. A stricter tie prompt can suppress false positives, but it may also convert weak true positives into no\-preference\. This resembles changing a detector threshold\. The right operating point depends on the downstream use: exploratory ranking may tolerate a lower threshold, while claim\-level adjudication requires low dark current and explicit uncertainty about low\-strength signals\.

A future phase should use a stronger external comparator or reference judge as a ceiling estimate, not as ground truth\. It should also refine the ladder below one requirement, becauseΔ𝟕𝟓⋆≤1\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\\leq 1is left\-censored in the present design\. Finally, the protocol should be extended across model families and naturalistic answer domains before any generality claim is made\.

## 7Limitations

Table 5:Explicit non\-claims maintained in the draft\.The stimulus ladder is synthetic\. Its strength is constructive control, not ecological completeness\. Same\-quality surface differences may be genuinely visible to humans, and cross\-sensitivity is not always bad\. In some applications, style, specificity, or clarity are part of the target construct\. The present protocol treats them as nuisance axes only because the controlled task defines the checklist as the target\.

There is no human ground truth or reference/API judge in the reported runs\. Future reference\-judge work should be treated as a ceiling or external comparator, not as ground truth\. Qwen32B is not claim\-level evidence for the downstream mechanism hypothesis\. It is a cleaner measurement instrument in this controlled datasheet\.

TheΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}values for Qwen14B and Qwen32B are left\-censored\. The current ladder measures integer requirement differences\. If a judge reaches 75 percent at the smallest nonzero step, the threshold is≤1\\leq 1, not exactly one\. Finer perturbations are required to compare sub\-unit sensitivity\.

The case study uses three open\-weight judges\. This is insufficient for broad family\-level claims\. The results should be read as a demonstration of metrological profiling and as evidence that scalar judge scores can hide distinct failure modes\.

This paper establishes the metrology base for LLM evaluation under highly controlled conditions\. Future work will extend this Judge Datasheet protocol to large\-scale, naturalistic NLP benchmarks, such as RewardBench\[[12](https://arxiv.org/html/2606.15610#bib.bib12)\], to evaluate its ecological validity and operational cost\-performance in production environments\.

## 8Conclusion

LLM judges are measurement instruments\. They can have dark current, positional false preference, stable cross\-sensitivity, target sensitivity, and criterion shifts\. Reporting only accuracy, win\-rate, or agreement hides these axes\. We introduced a Judge Datasheet protocol that measures true\-vacuum response, controlled target sensitivity, direction\-stability underΔ​0\\Delta 0same\-quality pairs, and criterion tradeoffs\. In a three\-judge case study, Llama8B, Qwen14B, and Qwen32B occupy different profiles that would be collapsed by scalar reporting\. Before a judge is used to support downstream scientific claims, the measuring device itself should be measured\.

## Artifact Statement

This arXiv version does not include a separate public artifact release\. The submission source package contains only the files required to build the manuscript\. The reported experiments use controlled synthetic stimuli and aggregate outputs; raw call/response logs, service endpoints, local absolute paths, and environment\-specific operational traces are not distributed\.

The prompt structure, output schema, stimulus design, denominators, confidence intervals, and numerical audit information needed to assess the reported claims are described in the manuscript and appendix\. Additional reproducibility materials may be provided under the policy of a future peer\-review venue\.

## AI Usage Statement

The authors used AI\-assisted tools in a controlled research workflow for brainstorming, implementation support, code refactoring, draft organization, language polishing, and internal critique\. The reported experimental calls were executed through logged scripts on controlled local or self\-hosted infrastructure unless explicitly stated otherwise\. All experimental designs, model outputs, numerical results, claims, citations, and final text were inspected and verified by the human authors\. AI systems were not authors and are not responsible for the content\. The human authors take full responsibility for the accuracy, integrity, and originality of the manuscript\.

## Acknowledgments

This work was conducted in part within a collaborative research context between Chubu University and Mitsubishi Heavy Industries, Ltd\., Research & Innovation Center\. We thank members of the Research & Innovation Center for discussions on AI\-cluster operation, hardware\-configuration constraints, and deployment\-oriented evaluation settings\. These discussions helped motivate the need to treat LLM judges as measurement instruments rather than scalar scorers\. The judge\-metrology protocol, stimulus construction, model\-call execution, analysis scripts, numerical audits, and manuscript claims reported in this paper were developed and verified by the authors\. The reported experiments use controlled synthetic stimuli and logged local or self\-hosted inference infrastructure; they do not rely on proprietary deployment data, confidential site logs, or non\-public hardware\-integration details\.

## Appendix AAppendix

### A\.1Protocol details

The A0 protocol uses true\-vacuum pairs: empty strings, whitespace\-only candidates, and identical non\-empty candidates\. TheA1protocol uses a prefix\-chain checklist ladder\. Each task has required elements; quality level is the number of included elements under a fixed ordering\. Primary nonzero pairs compare levels separated byΔ​Q\\Delta Q\.Δ​0\\Delta 0same\-subset pairs compare surface variants of the same included set\.Δ​0\\Delta 0different\-subset pairs compare equal\-cardinality but different included sets\.

### A\.2Glossary

Judge Datasheet\.A structured report of evaluator behavior across dark current, rawΔ​0\\Delta 0false preference, stable cross\-sensitivity, positional false preference, target sensitivity, threshold, and criterion; interpretive summary built from measured quantities\.

True vacuum\.Inputs with no evaluative signal, including empty, whitespace, and identical non\-empty candidate pairs; constructed stimulus condition\.

Valid abstention\.A parser\- and schema\-valid no\-decision output when the protocol permits abstention; measured output category\.

No\-preference\.Tie or valid abstention depending on the protocol and table level; measured output category\.

Δ​0\\Delta 0same\-subset\.Same\-quality pairs with the same checklist subset and surface\-form variation; constructed control condition\.

Δ​0\\Delta 0different\-subset\.Same\-cardinality pairs with different checklist elements; constructed control condition\.

Prefix\-chain ladder\.A nested response family in which higher levels include a superset of required elements; constructed stimulus ladder\.

Pareto dominance, construct\-relative\.The relation that higher ladder levels contain all lower\-level checklist elements plus more under the defined construct; constructed relation, not a universal quality claim\.

RawΔ​0\\Delta 0false preference\.The call\-level non\-tie rate on same\-quality pairs; measured quantity and not a mechanism by itself\.

Stable cross\-sensitivity\.A canonical\-pair\-level content\-stable direction under order reversal; measured decomposition quantity interpreted as nuisance in this checklist construct\.

Positional false preference\.A canonical\-pair\-level same\-slot choice under order reversal; measured decomposition quantity indicating slot\-driven preference\.

One\-sided commit\.A canonical\-pair\-level component where exactly one presentation order yields a non\-tie decision; measured decomposition quantity contributing one half to rawΔ​0\\Delta 0false preference\.

Other conflict\.The residual other\-conflict contribution after stable cross\-sensitivity, positional false preference, and one\-sided commit are accounted for; measured decomposition quantity under the frozen classification\.

Target sensitivity\.Correct selection of the Pareto\-dominant candidate on nonzero ladder pairs; measured call\-level quantity\.

Δ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}\.The smallest ladder difference at which isotonic target sensitivity reaches 0\.75; fitted ladder\-level quantity\.

Left\-censoring\.A threshold reported as≤1\\leq 1because the first measured nonzero ladder step already reaches the threshold; interpretive flag from the ladder design\.

Criterion / operating point\.The prompt\-induced tie/preference boundary, operationalized as tie rate by condition; measured condition\-level quantity\.

Miss\-by\-tie\.A target signal not selected because the judge returns tie; measured call\-level error mode\.

Presentation\-conflicted\.A profile sketch where rawΔ​0\\Delta 0false preference is dominated by presentation\-order conflict or slot effects; interpretive label for this stimulus family\.

External comparator / ceiling estimate\.A stronger or reference judge used for comparison, not as ground truth; interpretive role for future work\.

Downstream mechanism hypothesis\.The motivating orientation hypothesis outside the present judge\-metrology paper; interpretive context, not tested here\.

### A\.3Length and filler controls

Primary ladder candidates are length\-matched within task\. Filler variants are diagnostic controls for verbosity or filler preference and are not pooled into the primaryΔ​Q\\Delta Qcurve\. The frozen working\-snapshot artifact package \(frozen WS package\) does not expose a common three\-judge filler\-clean preference value suitable for the main datasheet table\.

### A\.4Prompt templates

The baseline pair\-tie\-allowed prompt instructs the judge that the two candidates may be equally good and allows a tie\. The strict criterion prompt strengthens the instruction to choose tie whenever differences are only wording, style, fluency, verbosity, or surface form\. The artifact package contains prompt templates, output JSON schemas, prompt\-rendering code, stimulus manifests, and redacted rendered\-prompt records sufficient to reconstruct the prompts used in the reported runs\. Full unredacted service logs are retained only in the internal frozen package for audit and are not part of the public release\. The prompt construction does not expose expected winners,Δ​Q\\Delta Qlabels, quality levels, or canonical\-pair metadata to the judge prompt\.

### A\.5Prefix\-chain dominance

For primary ladder pairs, higher levels include a superset of checklist elements\. Under the task construct, this yields a constructive Pareto dominance relation\. This proof is construct\-relative: it does not imply that natural answer quality is fully captured by checklist cardinality\.

### A\.6Metric definitions

Main metric definitions are given in Eqs\.[1](https://arxiv.org/html/2606.15610#S3.E1)–[11](https://arxiv.org/html/2606.15610#S3.E11)and Eq\.[12](https://arxiv.org/html/2606.15610#S3.E12)\. The important implementation detail is that slot outputs are mapped back to canonical content identity before direction\-stability quantities are computed\. RawΔ​0\\Delta 0false preference is a call\-level non\-tie rate and is not treated as stable cross\-sensitivity\.

### A\.7Confidence intervals

Publication\-facing call\-level proportions use Wilson binomial 95 percent intervals computed from the observed numerator and denominator\. This applies to dark current, rawΔ​0\\Delta 0false preference, tie rates, and nonzero target\-sensitivity proportions, including boundary estimates\. Task\-cluster bootstrap intervals are retained in the source artifacts for diagnostics, but zero\-width bootstrap intervals at observed 0 or 1 are not printed as publication intervals\. Cells without a valid design denominator are reported as N/A with a reason\. No confidence intervals are invented in this draft\. Stable cross\-sensitivity and positional false preference CIs are N/A because the frozen package includes canonical\-pair decomposition counts but not decomposition\-level bootstrap intervals\.

Table 6:Full confidence\-interval and denominator table for the reported datasheet metrics\. Values marked N/A were not measured or not estimable in the frozen artifact\. Raw delta0 false preference rows are call\-level non\-tie rates and should not be read as stable cross\-sensitivity\. Stable cross\-sensitivity and positional false preference rows are canonical\-pair decomposition rates; decomposition\-level bootstrap intervals were not included in the frozen package and are not invented here\. For Qwen14B and Qwen32B,Δ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}is displayed as≤1\\leq 1\[left\-censored\]; the underlying numeric plotting value in the frozen artifact is 1\.0000\.JudgeMetricEstimate95% CInnMethod or reasonLlama8Bdark current false preference0\.6667\[0\.5783, 0\.7447\]120Wilson binomialLlama8Braw delta0 false preference1\.0000\[0\.9690, 1\.0000\]120Wilson binomialLlama8BΔ​0\\Delta 0tie rate0\.0000\[0\.0000, 0\.0310\]120Wilson binomialLlama8Btarget sensitivityΔ​1\\Delta 10\.6100\[0\.5120, 0\.6998\]100Wilson binomialLlama8BΔ​1\\Delta 1tie rate0\.0000\[0\.0000, 0\.0370\]100Wilson binomialLlama8Btarget sensitivityΔ​5\\Delta 51\.0000\[0\.8389, 1\.0000\]20Wilson binomialLlama8BΔ​5\\Delta 5tie rate0\.0000\[0\.0000, 0\.1611\]20Wilson binomialLlama8BΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}4\.0000\[2\.7143, 4\.0000\]300source bootstrap; censor flagLlama8Bd′d^\{\\prime\}atΔ​5\\Delta 53\.3812N/A20ceiling d\-prime; CI N/AQwen14Bdark current false preference0\.0000\[0\.0000, 0\.0310\]120Wilson binomialQwen14Braw delta0 false preference0\.9917\[0\.9543, 0\.9985\]120Wilson binomialQwen14BΔ​0\\Delta 0tie rate0\.0083\[0\.0015, 0\.0457\]120Wilson binomialQwen14Btarget sensitivityΔ​1\\Delta 11\.0000\[0\.9630, 1\.0000\]100Wilson binomialQwen14BΔ​1\\Delta 1tie rate0\.0000\[0\.0000, 0\.0370\]100Wilson binomialQwen14Btarget sensitivityΔ​5\\Delta 51\.0000\[0\.8389, 1\.0000\]20Wilson binomialQwen14BΔ​5\\Delta 5tie rate0\.0000\[0\.0000, 0\.1611\]20Wilson binomialQwen14BΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}≤1\\leq 1\[left\-censored\]N/A300source bootstrap; censor flagQwen14Bd′d^\{\\prime\}atΔ​5\\Delta 53\.3812N/A20ceiling d\-prime; CI N/AQwen32Bdark current false preference0\.0000\[0\.0000, 0\.0310\]120Wilson binomialQwen32Braw delta0 false preference0\.2583\[0\.1884, 0\.3433\]120Wilson binomialQwen32BΔ​0\\Delta 0tie rate0\.7417\[0\.6567, 0\.8116\]120Wilson binomialQwen32Btarget sensitivityΔ​1\\Delta 10\.9400\[0\.8752, 0\.9722\]100Wilson binomialQwen32BΔ​1\\Delta 1tie rate0\.0600\[0\.0278, 0\.1248\]100Wilson binomialQwen32Btarget sensitivityΔ​5\\Delta 51\.0000\[0\.8389, 1\.0000\]20Wilson binomialQwen32BΔ​5\\Delta 5tie rate0\.0000\[0\.0000, 0\.1611\]20Wilson binomialQwen32BΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}≤1\\leq 1\[left\-censored\]N/A300source bootstrap; censor flagQwen32Bd′d^\{\\prime\}atΔ​5\\Delta 53\.3812N/A20ceiling d\-prime; CI N/AQwen32B strict criteriondark current false preferenceN/AN/AN/Anot measured in strict criterion probeQwen32B strict criterionraw delta0 false preference0\.0000\[0\.0000, 0\.0310\]120Wilson binomialQwen32B strict criterionΔ​0\\Delta 0tie rate1\.0000\[0\.9690, 1\.0000\]120Wilson binomialQwen32B strict criteriontarget sensitivityΔ​1\\Delta 10\.5000\[0\.4038, 0\.5962\]100Wilson binomialQwen32B strict criterionΔ​1\\Delta 1tie rate0\.5000\[0\.4038, 0\.5962\]100Wilson binomialQwen32B strict criteriontarget sensitivityΔ​5\\Delta 51\.0000\[0\.8389, 1\.0000\]20Wilson binomialQwen32B strict criterionΔ​5\\Delta 5tie rate0\.0000\[0\.0000, 0\.1611\]20Wilson binomialQwen32B strict criterionΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}N/AN/AN/Astrict criterion probe has Delta=1 and Delta=5 onlyQwen32B strict criteriond′d^\{\\prime\}atΔ​5\\Delta 53\.3812N/A20ceiling d\-prime; CI N/ALlama8Bstable cross\-sensitivity decomposition0\.0333N/A60canonical\-pair decomposition; CI N/ALlama8Bpositional false preference decomposition0\.9667N/A60canonical\-pair decomposition; CI N/AQwen14Bstable cross\-sensitivity decomposition0\.4500N/A60canonical\-pair decomposition; CI N/AQwen14Bpositional false preference decomposition0\.5333N/A60canonical\-pair decomposition; CI N/AQwen32Bstable cross\-sensitivity decomposition0\.0000N/A60canonical\-pair decomposition; CI N/AQwen32Bpositional false preference decomposition0\.0833N/A60canonical\-pair decomposition; CI N/AQwen32B \+ strict criterionstable cross\-sensitivity decomposition0\.0000N/A60canonical\-pair decomposition; CI N/AQwen32B \+ strict criterionpositional false preference decomposition0\.0000N/A60canonical\-pair decomposition; CI N/ATable 6:Full confidence\-interval and denominator table, continued\.Table 7:Full metric\-reference table with protocol and denominator\. Bad direction is construct\-relative; some axes, especially stable cross\-sensitivity and criterion, require downstream interpretation rather than a universal good/bad sign\.TermSymbolLevelProtocol / denominatorMeasuresBad directionDark currentDCcalltie\-allowed true vacuum; true\-vacuum callsfalse preference with no evaluative signalhigherRawΔ​0\\Delta 0false preferenceRFP0calltie\-allowedΔ​0\\Delta 0same; both\-order callsany non\-tie on same\-quality same\-subset pairs; raw, not cross\-sensitivityhigherStable cross\-sensitivitySCScanonical pairtie\-allowedΔ​0\\Delta 0same, both orders; canonicalΔ​0\\Delta 0same pairscontent\-stable direction under surface variationconstruct\-dependentPositional false preferencePFPcanonical pairtie\-allowedΔ​0\\Delta 0same, both orders; canonicalΔ​0\\Delta 0same pairssame presentation\-slot choice under order reversalhigherNo\-preferenceNPcall or pairtie\-allowed; relevant calls or pairstie or valid abstentionconstruct\-dependentTarget sensitivityPcorrP\_\{\\rm corr\}callpair protocol in caption; nonzeroΔ​Q\\Delta Qcallscorrect selection of the Pareto\-dominant prefix\-chain candidatelowerDetection thresholdΔ𝟕𝟓⋆\\Delta^\{\\star\}\_\{\\mathbf\{75\}\}ladder fittarget\-sensitivity curve; ladder curvesmallestΔ​Q\\Delta Qwhere isotonic target sensitivity reaches 0\.75higherCriterion / operating pointCJC\_\{J\}/ tie rateconditionbaseline vs strict prompt; condition\-specific callsprompt\-induced tie/preference boundaryno universal directionMiss\-by\-tieMBTcalltie\-allowed target pairs; targetΔ​Q\\Delta Qcallstarget signal not selected because the judge returns tiehigher when signal should be detectedTable 7:Full metric\-reference table, continued\.
### A\.8Supplementary decomposition view

![Refer to caption](https://arxiv.org/html/2606.15610v1/figures/fig3b_pfp_vs_scs_scatter.png)Figure 5:Supplementary positional false preference versus stable cross\-sensitivity scatter forΔ​0\\Delta 0same\-subset pairs\. RawΔ​0\\Delta 0false preference decomposes into distinct mechanisms: Llama8B is mostly positional, Qwen14B is mixed, and Qwen32B is low on both stable and positional components\.
### A\.9Prior mini\-experiment reclassification

The previous downstream mini\-experiment is not treated as evidence for the motivating mechanism hypothesis\. It is reclassified as a motivation for measuring the judge before downstream claim testing\.

### A\.10Future plan

The next phase should introduce an external comparator or reference judge as a ceiling estimate, refine the ladder below one requirement, and maintain the separation between judge metrology and downstream mechanism claims\.

## References

- \[1\]Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, and Bugeun Kim\.Diagnosing the reliability of LLM\-as\-a\-Judge via item response theory\.arXiv preprint arXiv:2602\.00521, 2026\.
- \[2\]Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D\. Manning, Christopher Ré, Diana Acosta\-Navas, Drew A\. Hudson, Eric Zelikman, et al\.Holistic Evaluation of Language Models\.Transactions on Machine Learning Research, 2023\.HELM\.
- \[3\]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R\. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga\-Alonso, et al\.Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models\.Transactions on Machine Learning Research, 2023\.BIG\-Bench\.
- \[4\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623\. Curran Associates, Inc\., 2023\.
- \[5\]Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B\. Hashimoto\.Length\-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators\.InFirst Conference on Language Modeling, 2024\.arXiv:2404\.04475\.
- \[6\]OpenAI\.Evals\.[https://github\.com/openai/evals](https://github.com/openai/evals), 2023\.Software framework\.
- \[7\]Jon\-Paul Cacioli\.LLMs as signal detectors: Sensitivity, bias, and the temperature\-criterion analogy\.arXiv preprint arXiv:2603\.14893, 2026\.
- \[8\]Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi\.Judging the judges: A systematic study of position bias in LLM\-as\-a\-Judge\.arXiv preprint arXiv:2406\.07791, 2024\.
- \[9\]Jinming Yang, Zheng Hu, Chuxian Qiu, Zhenyu Deng, Xinshan Jiao, and Tao Zhou\.Quantifying and mitigating self\-preference bias of LLM judges\.arXiv preprint arXiv:2604\.22891, 2026\.
- \[10\]Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford\.Datasheets for Datasets\.Communications of the ACM, 64\(12\):86–92, 2021\.
- \[11\]Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru\.Model Cards for Model Reporting\.InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229\. ACM, 2019\.
- \[12\]Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A\. Smith, and Hannaneh Hajishirzi\.RewardBench: Evaluating Reward Models for Language Modeling\.InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797, Albuquerque, New Mexico, 2025\. Association for Computational Linguistics\.

Similar Articles

Judge Circuits

arXiv cs.CL

This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL

This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.