Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

arXiv cs.CL Papers

Summary

This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.

arXiv:2605.23970v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.
Original Article
View Cached Full Text

Cached at: 05/26/26, 08:59 AM

# Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
Source: [https://arxiv.org/html/2605.23970](https://arxiv.org/html/2605.23970)
Riya TapwalSchool of Computing and Electrical EngineeringIndian Institute of Technology \(IIT\) Mandiriya@iitmandi\.ac\.inLondonU\.K\.akumar@turing\.ac\.ukCarsten MapleWarwick Manufacturing GroupU\.K\.CM@warwick\.ac\.uk

###### Abstract

Large language models \(LLMs\) are increasingly used as automatic judges for summarization and dialogue evaluation\. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on*outcomes*, leaving judge*explanations*underexplored\. We instead ask whether LLM judges arecue\-invariant, i\.e\., whether their rankings and explanations remain stable when*non\-evidential cues*are perturbed while holding the underlying texts fixed\. We introduce a suite of*cue interventions*\(Blind, Truth, Flip, Placebo, Reveal\-After\) and tie\-aware metrics that quantify*outcome anchoring*and*rationale anchoring*\(label\-aligned rhetoric and explanation drift\), alongside consistency and stereotype\-intrusion checks\. We design*anchoring attacks*via verbosity and confidence cues, and compare two mitigations: structured chain\-of\-thought prompting andProof\-Before\-Preference\(evidence lock→\\rightarrowscore→\\rightarrowrank\)\. Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial*cue\-anchored rationalization*under label/placebo perturbations, whileProof\-Before\-Preferencemarkedly improves cue invariance over baselines\.

\{IEEEImpStatement\}

The increasing adoption of large language models \(LLMs\) as automated judges in evaluation pipelines raises critical concerns about the reliability and faithfulness of their decisions and explanations\. This work makes a timely and impactful contribution by introducing a causal framework to formally characterize and quantify rationalization bias, where LLM judges align their verdicts and explanations with non\-evidential cues rather than the underlying textual evidence\. By proposing cue\-invariance probing, anchoring metrics, and the Proof\-Before\-Preference \(PBP\) mitigation protocol, this study provides both diagnostic tools and practical solutions to improve robustness, fairness, and auditability in LLM\-based evaluation systems\. These advances are particularly significant for high\-stakes applications such as benchmarking, compliance monitoring, and automated decision support, where unreliable or cue\-sensitive judgments could undermine trust and fairness\. At the same time, this work promotes responsible deployment by exposing systemic vulnerabilities and demonstrating mitigation strategies that reduce post\-hoc rationalization, thereby contributing to the development of more transparent, accountable, and trustworthy AI systems\.

\{IEEEkeywords\}

Large language models, LLM\-as\-a\-judge, rationalization bias, explanation faithfulness, cue invariance, causal probing, bias mitigation, trustworthy AI\.

## 1Introduction

Summarization has long been a flagship task in NLP, historically benchmarked against human\-authored gold summaries and judged by human annotators\[[19](https://arxiv.org/html/2605.23970#bib.bib19)\]\. Yet this paradigm is rapidly disappearing\. Do humans still summarize? In practice, the answer is largely no\[[6](https://arxiv.org/html/2605.23970#bib.bib21)\]\. Producing human summaries at scale is prohibitively costly and inconsistent, while automated systems, both traditional extractive methods and modern large language models \(LLMs\), can generate summaries instantly\[[27](https://arxiv.org/html/2605.23970#bib.bib22),[6](https://arxiv.org/html/2605.23970#bib.bib21),[19](https://arxiv.org/html/2605.23970#bib.bib19)\]\. Further, relying on human evaluation does not scale, leading to the widespread adoption of LLMs as judges\[[5](https://arxiv.org/html/2605.23970#bib.bib20)\]\. In this new regime, the central question shifts from whether LLMs match human summarization quality to whether our*evaluation pipeline*remains reliable when decisions and explanations are delegated to LLMs\. In particular, applications that still favor lightweight extractive summarization, e\.g\., large\-scale monitoring, enterprise reporting, or compliance, depend on judges whose verdicts and rationales are grounded in the*evidence*, not in superficial artifacts\. Prior studies have documented outcome\-level biases in LLM judges, including position, verbosity, stylistic preference, and self\-enhancement\[[24](https://arxiv.org/html/2605.23970#bib.bib12)\]\. These findings are important but incomplete: they primarily track*what*the judge decides, not*why*\. A decision is trustworthy only if its explanation reflects the same evidential features of the input that drove the choice\. If, instead, explanations realign to extraneous signals, labels, badges, or stylistic hints, then evaluation can be steered without changing the underlying texts, eroding both auditability and fairness\.

![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/intro_diagram.png)Figure 1:Overview of the three judging protocols and where rationalization can arise\.*Baseline*\(left: bottom\): a single LLM judge directly chooses between an LLM and a TradML summary; explanations are post\-hoc and thus cue/label\-susceptible\.*SCoT*\(left:top\): the judge reasons along predefined criteria \(accuracy, completeness, conciseness, fluency\) before deciding, but evidence is not locked, allowing rubric\-amplified rationalization\.*PBP*\(right\): Proof\-Before\-Preference, the judge first writes and*locks*criterion\-wise evidence, then scores and aggregates to rank, which curbs label anchoring and explanation drift\.We frame this reliability requirement ascue invariance\. LetXXdenote the fixed textual evidence \(source document and candidate summaries\) and letCCdenote*non\-evidential cues*such as metadata labels\. An LLM judge is cue\-invariant if its rankingrrand explanationeeremain stable whenCCis perturbed whileXXis held constant\. This perspective turns a vague notion of “faithfulness” into a precise robustness target: measure the causal effect ofCCon\(r,e\)\(r,e\)under controlled interventions\. To that end, we introduce a suite of*cue interventions*that manipulateCCwhile keepingXXfixed:Blind\(no cues\),Truth\(ground\-truth label\),Flip\(inverted label\),Placebo\(credible but non\-informative badge\), andReveal\-After\(applies FLIP, then reveals TRUTH\)\. These probes expose when a judge’s decisions and explanations move toward the presented cues\. We complement them with*tie\-aware*metrics that separate two phenomena:*outcome anchoring*\(directional shifts in rankings\) and*rationale anchoring*\(label\-aligned rhetoric and explanation drift\)\. The same framework also reveals a structural weakness:*anchoring attacks*that use innocuous style cues, verbosity and confidence, to sway both decisions and rationales even for identical summaries\. Finally, we study defenses\. A*criterion\-guided structured chain\-of\-thought*\(SCoT\) prompts judges to reason along explicit dimensions \(accuracy, completeness, conciseness, fluency\) before deciding\. Building on this,*Proof\-Before\-Preference\(PBP\)*first*locks*criterion\-wise notes with cited spans, then scores and ranks strictly from the locked evidence, reducing opportunities for post\-hoc rationalization and label anchoring\.

##### Motivation

The increasing use of large language models \(LLMs\) as automated judges in evaluation pipelines has introduced significant scalability and efficiency benefits, but it also raises critical concerns about the reliability and faithfulness of their decisions and explanations\. While prior work has identified outcome\-level biases such as position, verbosity, and stylistic preferences, it remains unclear whether LLM judges base their decisions on the underlying textual evidence or on superficial, non\-evidential cues such as labels, confidence signals, or formatting\. In high\-stakes applications such as benchmarking, compliance monitoring, and automated reporting, explanations are essential for transparency and auditability; however, if these explanations are generated post hoc to justify decisions influenced by external cues, the evaluation process becomes vulnerable to manipulation and loses its trustworthiness\. This problem, which we characterize as rationalization bias, highlights a fundamental gap in current evaluation methodologies: the lack of a principled framework to causally isolate and measure the influence of irrelevant cues on both judgments and explanations\. Addressing this gap is essential to ensure that LLM judges produce decisions and rationales that are grounded in evidence, thereby improving the robustness, fairness, and accountability of automated evaluation systems\.

##### Contributions

- •Cue\-invariance probing:We propose a controlled intervention suite \(Blind/Truth/Flip/Placebo/Reveal\-After\) that isolates the causal effect of non\-evidential cues on both decisions and explanations while holding texts fixed\.
- •Anchoring metrics:We present tie\-aware measures for*outcome anchoring*and*rationale anchoring*\(label\-aligned rhetoric, explanation drift\), enabling standardized comparison across judges and prompts\.
- •Rationalization attacks:We present a demonstration that verbosity and confidence cues reliably shift outcomes and rationales even for identical summaries, exposing a practical vulnerability in LLM\-as\-judge pipelines\.
- •Mitigations\.We evaluate two defenses: - –*Criterion\-guided SCoT:*require reasoning along explicit dimensions \(e\.g\., accuracy, completeness, conciseness, fluency\) before deciding\. - –*Proof\-Before\-Preference \(PBP\):**lock*criterion\-wise notes with cited spans, then score and rank strictly from the locked evidence, reducing post\-hoc rationalization and label anchoring\.

## 2Related Work

### 2\.1Biases in LLMs as Judges

A growing line of work has investigated the reliability of LLMs as evaluators, often called the LLM\-as\-a\-judge paradigm\. Zheng et al\.\[[28](https://arxiv.org/html/2605.23970#bib.bib1)\]introduced MT\-Bench and Chatbot Arena, showing that LLM judges exhibit systematic biases, including position bias, verbosity bias, and self\-enhancement bias \(favoring their own family of models\)\. Wu & Aji\[[25](https://arxiv.org/html/2605.23970#bib.bib2)\]highlighted a related fluency/style bias, where LLMs prefer eloquent but less accurate answers, rationalizing their decisions by praising surface form\.

Broader studies have cataloged biases using benchmark suites\. Koo et al\.\[[9](https://arxiv.org/html/2605.23970#bib.bib3)\]proposed CoBBLEr, identifying implicit biases \(e\.g\., order, egocentric, salience/length\) and induced biases \(bandwagon, distractor cues\)\. Ye et al\.\[[26](https://arxiv.org/html/2605.23970#bib.bib4)\]extended this work with Calm, measuring twelve bias types, including authority bias, sentiment bias, and diversity bias, and introducing metrics like robustness rate\. Chen et al\.\[[1](https://arxiv.org/html/2605.23970#bib.bib7)\]compared human vs\. LLM judges, revealing vulnerabilities to misinformation oversight, authority cues, and formatting \(“beauty bias”\)\. Lee et al\.\[[12](https://arxiv.org/html/2605.23970#bib.bib5)\]examined judgments of epistemic markers, showing that LLMs penalize expressions of uncertainty while humans do not\.

Several works explore adversarial attacks on LLM evaluators\. Chen et al\.\[[1](https://arxiv.org/html/2605.23970#bib.bib7)\]and Raina et al\.\[[21](https://arxiv.org/html/2605.23970#bib.bib6)\]design optimization\-based or prompt\-based attacks that reliably flip judgments, while Li et al\.\[[15](https://arxiv.org/html/2605.23970#bib.bib8)\]propose bias\-mitigation techniques \(e\.g\., randomized ordering\)\. Surveys such as Szymanski et al\.\[[22](https://arxiv.org/html/2605.23970#bib.bib10)\], Croxford et al\.\[[4](https://arxiv.org/html/2605.23970#bib.bib11)\], and Thakur et al\. \(\[[23](https://arxiv.org/html/2605.23970#bib.bib9)\]emphasize both the promise and fragility of LLM judges: they can align well with human preferences, but remain vulnerable to superficial cues\.

### 2\.2Rationalization and Explanation Faithfulness in LLMs

Parallel work examines the faithfulness of model\-generated explanations\. Turpin et al\.\[[24](https://arxiv.org/html/2605.23970#bib.bib12)\]demonstrated that LLMs often rely on hidden cues to answer correctly but omit them from their chain\-of\-thought \(CoT\), generating post\-hoc rationalizations\. Chen et al\.\[[2](https://arxiv.org/html/2605.23970#bib.bib13)\]extended this to state\-of\-the\-art reasoning models, showing low “reveal rates” even when hints clearly influenced answers\. Lanham et al\.\[[11](https://arxiv.org/html/2605.23970#bib.bib14)\]introduced metrics for CoT faithfulness, such as early\-answer and confidence trajectories, while Lewis\-Lim et al\.\[[13](https://arxiv.org/html/2605.23970#bib.bib16)\]analyzed when CoT actively guides reasoning versus narrates predetermined outcomes\. Other studies propose methods for improving faithfulness\. Chuang et al\.\[[3](https://arxiv.org/html/2605.23970#bib.bib17)\]introduced FaithLM, which ties explanations causally to model outputs by testing against contrary rationales\. Li et al\.\[[14](https://arxiv.org/html/2605.23970#bib.bib18)\]proposed DRiFT, using dual rewards \(accuracy \+ faithfulness\) to guide probabilistic inference, improving rationale fidelity\.

## 3Problem Formulation

LetD=\{d1,d2,…,dN\}D=\\\{d\_\{1\},d\_\{2\},\\dots,d\_\{N\}\\\}be a collection of source documents\. For each documentd∈Dd\\in D, we generate a set of candidate summaries

Sd=\{s1ML,…,sMML,s1LLM,…,sLLLM\},S\_\{d\}=\\\{s^\{\\text\{ML\}\}\_\{1\},\\dots,s^\{\\text\{ML\}\}\_\{M\},\\;s^\{\\text\{LLM\}\}\_\{1\},\\dots,s^\{\\text\{LLM\}\}\_\{L\}\\\},where\{siML\}\\\{s^\{\\text\{ML\}\}\_\{i\}\\\}are produced by traditional machine learning–based extractive systems and\{sjLLM\}\\\{s^\{\\text\{LLM\}\}\_\{j\}\\\}are produced by large language models\. An LLM judgeJJevaluatesSdS\_\{d\}and returns two outputs:

1. 1\.Arankingof candidates, rJ,d∈ℛM\+L,r\_\{J,d\}\\in\\mathcal\{R\}\_\{M\+L\},whereℛM\+L\\mathcal\{R\}\_\{M\+L\}is the set of permutations of theM\+LM\+Lsummaries\.
2. 2\.Anexplanationor rationale, eJ,d=fJ​\(Sd\),e\_\{J,d\}=f\_\{J\}\(S\_\{d\}\),in free\-text form, to justify the ranking\.

### 3\.1From Faithfulness to Cue Invariance

Classically, an explanationeJ,de\_\{J,d\}is called*faithful*if it reflects exactly the evidential features𝒳d\\mathcal\{X\}\_\{d\}of the fixed texts \(source and candidate summaries\) that determine the judge’s rankingrJ,dr\_\{J,d\}\. A standard formalization is the conditional invariance

Pr⁡\(rJ,d∣𝒳d,eJ,d\)=Pr⁡\(rJ,d∣𝒳d\),\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\},e\_\{J,d\}\\big\)\\;=\\;\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\}\\big\),\(1\)which asserts that, given the evidence𝒳d\\mathcal\{X\}\_\{d\}, exposing the explanation does not change the distribution of decisions, i\.e\., the explanation neither adds spurious signal nor reflects hidden, non\-evidential influences\. In practice, LLM judges can produce*rationalized*explanations: texts that are persuasive to humans yet cite features not causally responsible for the decision\. In such cases, \([1](https://arxiv.org/html/2605.23970#S3.E1)\) fails:

Pr⁡\(rJ,d∣𝒳d,eJ,d\)≠Pr⁡\(rJ,d∣𝒳d\)\.\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\},e\_\{J,d\}\\big\)\\;\\neq\\;\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\}\\big\)\.\(2\)We refer to the systematic production of such plausible\-but\-unfaithful rationales as rationalization bias\. Directly verifying \([1](https://arxiv.org/html/2605.23970#S3.E1)\) is challenging because we cannot exhaustively observe or control all evidential factors in𝒳d\\mathcal\{X\}\_\{d\}\. We therefore operationalize reliability through*cue invariance*\. LetCCdenote*non\-evidential cues*\(e\.g\., labels, badges, stylistic hints\)\. Holding𝒳d\\mathcal\{X\}\_\{d\}fixed, a judge is*cue\-invariant*if perturbingCCleaves both the decision and the explanation \(in expectation over measurable propertiesϕ\\phiof the rationale\) unchanged:

Pr⁡\(rJ,d∣𝒳d,C\)\\displaystyle\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\},C\\big\)=Pr⁡\(rJ,d∣𝒳d\),\\displaystyle=\\Pr\\\!\\big\(r\_\{J,d\}\\mid\\mathcal\{X\}\_\{d\}\\big\),\(3\)𝔼\[ϕ\(eJ,d\)\|𝒳d,C\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\phi\\\!\\left\(e\_\{J,d\}\\right\)\\middle\|\\mathcal\{X\}\_\{d\},C\\right\]=𝔼\[ϕ\(eJ,d\)\|𝒳d\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\phi\\\!\\left\(e\_\{J,d\}\\right\)\\middle\|\\mathcal\{X\}\_\{d\}\\right\]\.\(4\)Violations of \([3](https://arxiv.org/html/2605.23970#S3.E3)\) indicate*outcome anchoring*; violations of \([4](https://arxiv.org/html/2605.23970#S3.E4)\) indicate*rationale anchoring*\(e\.g\., increased label\-aligned rhetoric or greater explanation drift\)\. In our framework, we intervene onCCvia controlled conditions \(Blind, Truth, Flip, Placebo, Reveal\-After\) while keeping𝒳d\\mathcal\{X\}\_\{d\}fixed, and we quantitatively estimate the resulting shifts inrJ,dr\_\{J,d\}andeJ,de\_\{J,d\}\. This causal, cue\-centric view provides a practical surrogate for faithfulness: when a judge is cue\-invariant, its decisions and explanations are robust to non\-evidential perturbations, thereby reducing opportunities for post\-hoc rationalization\.

## 4Methodology

Our methodology has five components: dataset construction, cue\-based causal probing, anchoring metrics, mitigation strategies, and anchoring attacks\.

### 4\.1Dataset and Notation

For eachd∈Dd\\in D, we construct two candidate summaries drawn from traditional extractive systems and LLM\. Each summarysis\_\{i\}has observable textual properties \(e\.g\., length, fluency, coverage, factuality\), denoted collectively by𝒳​\(si\)\\mathcal\{X\}\(s\_\{i\}\)\. We write𝒳d=\{𝒳​\(si\)\}i=12\\mathcal\{X\}\_\{d\}=\\\{\\mathcal\{X\}\(s\_\{i\}\)\\\}\_\{i=1\}^\{2\}for the feature set associated withSdS\_\{d\}\. A judgeJ∈𝒥J\\in\\mathcal\{J\}observesSdS\_\{d\}under a probe conditionp∈𝒫p\\in\\mathcal\{P\}and produces: \(i\) a rankingrJ,d\(p\)∈ℛKr^\{\(p\)\}\_\{J,d\}\\in\\mathcal\{R\}\_\{K\}\(for experiments, we consider K=2, i\.e one LLM and one traditional ML method\), and \(ii\) a free\-text explanationeJ,d\(p\)=fJ​\(Sd,p\)e^\{\(p\)\}\_\{J,d\}=f\_\{J\}\(S\_\{d\},p\)\. To control trivial position effects, the presentation order ofSdS\_\{d\}is randomized for each trial\.

### 4\.2Cue\-Based Causal Probing

We intervene on*non\-evidential cues*while holding the texts fixed\. LetCCdenote the cue vector shown withSdS\_\{d\}\(e\.g\., source labels such asLLM/ML, or Placebo labels\)\. The ground\-truth labels forSdS\_\{d\}areC∗C^\{\\ast\}\. We define five probe conditions:

B\\displaystyle B:do⁡\(C=∅\)\\displaystyle:\\penalty 10000\\ \\operatorname\{do\}\(C=\\varnothing\)\(Blind\)T\\displaystyle T:do⁡\(C=C∗\)\\displaystyle:\\penalty 10000\\ \\operatorname\{do\}\(C=C^\{\\ast\}\)\(Truth\)F\\displaystyle F:do⁡\(C=π​\(C∗\)\)\\displaystyle:\\penalty 10000\\ \\operatorname\{do\}\(C=\\pi\(C^\{\\ast\}\)\)\(Flip\)P\\displaystyle P:do⁡\(C=Cplacebo\)\\displaystyle:\\penalty 10000\\ \\operatorname\{do\}\(C=C^\{\\text\{placebo\}\}\)\(Placebo\)R\\displaystyle R:do⁡\(C=π​\(C∗\)\)\\displaystyle:\\penalty 10000\\ \\operatorname\{do\}\(C=\\pi\(C^\{\\ast\}\)\)→do⁡\(C=C∗\)\\displaystyle\\to\\operatorname\{do\}\(C=C^\{\\ast\}\)\(Reveal\-After\)\.\\displaystyle\\text\{\(Reveal\-After\)\}\.
Here,BBhides cues;TTreveals correct cues;FFreveals inverted cues;PPattaches credible but non\-informative badges; andRRfirst applies FLIP, then reveals TRUTH\.

##### Cue invariance as our reliability target

Directly asserting explanation*faithfulness*is intractable without full control of all evidential factors in𝒳d\\mathcal\{X\}\_\{d\}\. We therefore operationalize reliability via*cue invariance*\. Holding𝒳d\\mathcal\{X\}\_\{d\}fixed, a judge is cue\-invariant if perturbingCCdoes not change \(a\) the decision distribution and \(b\) measurable properties of the explanation as shown in Eq\. \([3](https://arxiv.org/html/2605.23970#S3.E3)\) and \([4](https://arxiv.org/html/2605.23970#S3.E4)\) \. Our probes estimate these effects by contrastingp∈\{T,F,P,R\}p\\in\\\{T,F,P,R\\\}against the blind baselineBBwhile keepingSdS\_\{d\}\(and thus𝒳d\\mathcal\{X\}\_\{d\}\) fixed\.

### 4\.3Anchoring Metrics

We quantify how non\-evidential cues affect*outcomes*and*explanations*when texts are held fixed\. Let𝒪=\{\[1,2\],\[2,1\],Tie\}\\mathcal\{O\}=\\\{\[1,2\],\[2,1\],\\mathrm\{Tie\}\\\}be the outcome set for a two\-candidate comparison\. For probep∈\{B,T,F,P,R\}p\\in\\\{\\mathrm\{B\},\\mathrm\{T\},\\mathrm\{F\},\\mathrm\{P\},\\mathrm\{R\}\\\}, define

pJ,o\(p\)=Pr⁡\(rJ,d\(p\)=o\),o∈𝒪,tJ\(p\)=pJ,Tie\(p\)\.p^\{\(p\)\}\_\{J,o\}\\;=\\;\\Pr\\\!\\big\(r^\{\(p\)\}\_\{J,d\}=o\\big\),\\quad o\\in\\mathcal\{O\},\\qquad t^\{\(p\)\}\_\{J\}\\;=\\;p^\{\(p\)\}\_\{J,\\mathrm\{Tie\}\}\.We useB\\mathrm\{B\}\(Blind\) as the reference condition\.

##### Equality Detection Rate \(EDR\):

EDR measures whether the judge recognizes that two candidates are effectively equal*when they truly are*\. Let𝒞eq⊆D\\mathcal\{C\}\_\{\\text\{eq\}\}\\subseteq Ddenote comparisons labeled as equal \(e\.g\., identical summaries or near\-duplicates by a predefined content\-equality heuristic\)\. Then:

EDR\(J\)=1\|𝒞eq\|∑d∈𝒞eq𝕀\[rJ,d\(B\)=Tie\];\\boxed\{\\ \\mathrm\{EDR\}\(J\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{C\}\_\{\\text\{eq\}\}\|\}\\sum\_\{d\\in\\mathcal\{C\}\_\{\\text\{eq\}\}\}\\mathbb\{I\}\\\!\\big\[r^\{\(\\mathrm\{B\}\)\}\_\{J,d\}=\\mathrm\{Tie\}\\big\]\\ ;\}Here,𝕀​\{⋅\}\\mathbb\{I\}\\\{\\cdot\\\}is an Indicator function which provides11if the condition is true,0otherwise\. Higher EDR is better \(fewer spurious preferences when cues are hidden\)\.

##### Neutrality Deviation under Blind \(tie\-aware\):

Among non\-ties under Blind, a neutral judge should split\[1,2\]\[1,2\]vs\.\[2,1\]\[2,1\]at50%50\{\\small\\%\}\-50%50\{\\small\\%\}\. Let

pJ,12∣¬Tie\(B\)=pJ,12\(B\)pJ,12\(B\)\+pJ,21\(B\),tJ\(B\)=Pr⁡\(rJ,d\(B\)=Tie\)\.p^\{\(\\mathrm\{B\}\)\}\_\{J,12\\mid\\neg\\mathrm\{Tie\}\}=\\frac\{p^\{\(\\mathrm\{B\}\)\}\_\{J,12\}\}\{p^\{\(\\mathrm\{B\}\)\}\_\{J,12\}\+p^\{\(\\mathrm\{B\}\)\}\_\{J,21\}\},\\qquad t^\{\(\\mathrm\{B\}\)\}\_\{J\}=\\Pr\(r^\{\(\\mathrm\{B\}\)\}\_\{J,d\}=\\mathrm\{Tie\}\)\.Then

NDB\(J\)=\|2pJ,12∣¬Tie\(B\)−1\|⋅\(1−tJ\(B\)\);\\boxed\{\\ \\mathrm\{ND\}\_\{\\mathrm\{B\}\}\(J\)=\\bigl\|\\,2\\,p^\{\(\\mathrm\{B\}\)\}\_\{J,12\\mid\\neg\\mathrm\{Tie\}\}\-1\\,\\bigr\|\\cdot\\bigl\(1\-t^\{\(\\mathrm\{B\}\)\}\_\{J\}\\bigr\)\\ ;\}when ties are disallowed,tJ\(B\)=0t^\{\(\\mathrm\{B\}\)\}\_\{J\}\{=\}0andNDB=\|2​pJ,12\(B\)−1\|\\mathrm\{ND\}\_\{\\mathrm\{B\}\}=\|2p^\{\(\\mathrm\{B\}\)\}\_\{J,12\}\-1\|\. Lower is better \(more neutral Blind behavior\)\.

##### Tie\-aware, Directional Label Susceptibility \(LAO\):

We compare each labeled probe to*the same scheme’s*Blind and count only movement*toward*the label\-favored decision, while not penalizing movement intoTie\\mathrm\{Tie\}\. Forp∈\{T,F\}p\\in\\\{\\mathrm\{T\},\\mathrm\{F\}\\\}, let the favored non\-tie outcome be

favT=\[1,2\],favF=\[2,1\],\\mathrm\{fav\}\_\{\\mathrm\{T\}\}=\[1,2\],\\qquad\\mathrm\{fav\}\_\{\\mathrm\{F\}\}=\[2,1\],andoppp\\mathrm\{opp\}\_\{p\}the other non\-tie\. Let

Δfav\(p\)\\displaystyle\\Delta\_\{\\mathrm\{fav\}\}^\{\(p\)\}=pJ,favp\(p\)−pJ,favp\(B\),\\displaystyle=p\_\{J,\\mathrm\{fav\}\_\{p\}\}^\{\(p\)\}\-p\_\{J,\\mathrm\{fav\}\_\{p\}\}^\{\(\\mathrm\{B\}\)\},Δopp\(p\)\\displaystyle\\Delta\_\{\\mathrm\{opp\}\}^\{\(p\)\}=pJ,oppp\(p\)−pJ,oppp\(B\),\\displaystyle=p\_\{J,\\mathrm\{opp\}\_\{p\}\}^\{\(p\)\}\-p\_\{J,\\mathrm\{opp\}\_\{p\}\}^\{\(\\mathrm\{B\}\)\},Δtie\(p\)\\displaystyle\\Delta\_\{\\mathrm\{tie\}\}^\{\(p\)\}=tJ\(p\)−tJ\(B\)\.\\displaystyle=t\_\{J\}^\{\(p\)\}\-t\_\{J\}^\{\(\\mathrm\{B\}\)\}\.
Decompose positive “movement\-in” mass:

LDS\(p\)\\displaystyle\\mathrm\{LDS\}^\{\(p\)\}=max⁡\(0,Δfav\(p\)\),\\displaystyle=\\max\\\!\\bigl\(0,\\Delta\_\{\\mathrm\{fav\}\}^\{\(p\)\}\\bigr\),OLS\(p\)\\displaystyle\\mathrm\{OLS\}^\{\(p\)\}=max⁡\(0,Δopp\(p\)\),\\displaystyle=\\max\\\!\\bigl\(0,\\Delta\_\{\\mathrm\{opp\}\}^\{\(p\)\}\\bigr\),TS\(p\)\\displaystyle\\mathrm\{TS\}^\{\(p\)\}=max⁡\(0,Δtie\(p\)\)\.\\displaystyle=\\max\\\!\\bigl\(0,\\Delta\_\{\\mathrm\{tie\}\}^\{\(p\)\}\\bigr\)\.
Then

LAO​\(J;p\)=LDS\(p\)LDS\(p\)\+OLS\(p\)\+TS\(p\)\+ε\\boxed\{\\mathrm\{LAO\}\(J;p\)=\\frac\{\\mathrm\{LDS\}^\{\(p\)\}\}\{\\mathrm\{LDS\}^\{\(p\)\}\+\\mathrm\{OLS\}^\{\(p\)\}\+\\mathrm\{TS\}^\{\(p\)\}\+\\varepsilon\}\}p∈\{T,F\}p\\in\\\{\\mathrm\{T\},\\mathrm\{F\}\\\}\.

with smallε\>0\\varepsilon\{\>\}0for numerical stability\. Lower is better \(less label\-directed movement\)\. We also report the*absolute*label\-directed shift

LDS\(p\)=max\(0,Δfav\(p\)\);\\boxed\{\\ \\mathrm\{LDS\}^\{\(p\)\}=\\max\(0,\\Delta\_\{\\mathrm\{fav\}\}^\{\(p\)\}\)\\ ;\\ \}to show magnitude, not only fraction\.

##### Label\-Aligned Rationale on Same\-Decision \(Flip\)\.

Let𝒞same\(F\)=\{d∈D:rJ,d\(F\)=rJ,d\(B\)\}\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}=\\\{d\\in D:\\ r^\{\(\\mathrm\{F\}\)\}\_\{J,d\}=r^\{\(\\mathrm\{B\}\)\}\_\{J,d\}\\\}be items whose verdict underFlipmatchesBlind\. We quantify label\-aligned rhetoric with a hard, temperature\-0 embedding scorerα​\(e,L\(F\)\)∈\[0,1\]\\alpha\(e,L^\{\(\\mathrm\{F\}\)\}\)\\in\[0,1\]: letting𝐞\\mathbf\{e\}be the explanation embedding and𝐝fav,𝐝opp\\mathbf\{d\}\_\{\\mathrm\{fav\}\},\\mathbf\{d\}\_\{\\mathrm\{opp\}\}the embeddings of favored/opposite label descriptors underFlip, set

α​\(e,L\(F\)\)=\{1,cos⁡\(𝐞,𝐝fav\)\>cos⁡\(𝐞,𝐝opp\),0,cos⁡\(𝐞,𝐝opp\)\>cos⁡\(𝐞,𝐝fav\),0\.5,otherwise\.\\alpha\(e,L^\{\(\\mathrm\{F\}\)\}\)=\\begin\{cases\}1,&\\cos\(\\mathbf\{e\},\\mathbf\{d\}\_\{\\mathrm\{fav\}\}\)\>\\cos\(\\mathbf\{e\},\\mathbf\{d\}\_\{\\mathrm\{opp\}\}\),\\\\ 0,&\\cos\(\\mathbf\{e\},\\mathbf\{d\}\_\{\\mathrm\{opp\}\}\)\>\\cos\(\\mathbf\{e\},\\mathbf\{d\}\_\{\\mathrm\{fav\}\}\),\\\\ 0\.5,&\\text\{otherwise\.\}\\end\{cases\}Then

LAItext∣SD\(F\)=1\|𝒞same\(F\)\|∑d∈𝒞same\(F\)α\(eJ,d\(F\),L\(F\)\);\\boxed\{\\ \\mathrm\{LAI\}\_\{\\text\{text\}\}\\\!\\mid\\\!\\mathrm\{SD\}\(\\mathrm\{F\}\)=\\frac\{1\}\{\|\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\|\}\\sum\_\{d\\in\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\}\\alpha\\\!\\big\(e^\{\(\\mathrm\{F\}\)\}\_\{J,d\},\\,L^\{\(\\mathrm\{F\}\)\}\\big\)\\ ;\}larger values indicate stronger label\-aligned rhetoric even when the verdict is unchanged\.

##### Explanation Shift on Same\-Decision \(Flip\)\.

Let𝒞same\(F\)=\{d∈D:rJ,d\(F\)=rJ,d\(B\)\}\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}=\\\{d\\in D:r^\{\(\\mathrm\{F\}\)\}\_\{J,d\}=r^\{\(\\mathrm\{B\}\)\}\_\{J,d\}\\\}\. We measure how much the*textual explanation itself*changes when misleading labels are shown, while the verdict remains fixed\.

For two explanationsxxandyy, we define a normalized cosine distance:

δ​\(x,y\)=1−cos⁡\(x,y\)2∈\[0,1\],\\delta\(x,y\)=\\frac\{1\-\\cos\(x,y\)\}\{2\}\\in\[0,1\],where0indicates identical explanations and11indicates maximal dissimilarity\.

The average explanation drift is

Δ​e∣SD​\(F\)=1\|𝒞same\(F\)\|​∑d∈𝒞same\(F\)δ​\(eJ,d\(F\),eJ,d\(B\)\)\\boxed\{\\Delta e\\mid\\mathrm\{SD\}\(\\mathrm\{F\}\)=\\frac\{1\}\{\|\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\|\}\\sum\_\{d\\in\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\}\\delta\\\!\\big\(e^\{\(\\mathrm\{F\}\)\}\_\{J,d\},\\,e^\{\(\\mathrm\{B\}\)\}\_\{J,d\}\\big\)\}
and we report a thresholded change rate

δJ\(F\)​\(τ\)=1\|𝒞same\(F\)\|​∑d∈𝒞same\(F\)𝕀​\[δ​\(eJ,d\(F\),eJ,d\(B\)\)\>τ\]\\boxed\{\\delta^\{\(\\mathrm\{F\}\)\}\_\{J\}\(\\tau\)=\\frac\{1\}\{\|\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\|\}\\sum\_\{d\\in\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\}\\mathbb\{I\}\\\!\\Big\[\\delta\\\!\\big\(e^\{\(\\mathrm\{F\}\)\}\_\{J,d\},e^\{\(\\mathrm\{B\}\)\}\_\{J,d\}\\big\)\>\\tau\\Big\]\}withτ∈\[0,1\)\\tau\\in\[0,1\)fixed*a priori*\. If\|𝒞same\(F\)\|=0\|\\mathcal\{C\}\_\{\\text\{same\}\}^\{\(\\mathrm\{F\}\)\}\|=0, both metrics are reported as n/a\.

### 4\.4Mitigation Strategies

We study two complementary defenses:

Structured Chain\-of\-Thought \(SCoT\):Judges are instructed to evaluate along predefined criteria𝒞=\{accuracy,completeness,conciseness,fluency\}\\mathcal\{C\}=\\\{\\text\{accuracy\},\\text\{completeness\},\\text\{conciseness\},\\text\{fluency\}\\\}\. For eachsis\_\{i\}, the judge outputs criterion\-specific scores \(and optionally cited spans\), and we aggregate with nonnegative weights\{wc\}c∈𝒞\\\{w\_\{c\}\\\}\_\{c\\in\\mathcal\{C\}\}\(typically∑cwc=1\\sum\_\{c\}w\_\{c\}=1\):

rJ,d\(p\)=rank​\(\{∑c∈𝒞wc⋅scorec​\(si\)\}i=1K\)\.r^\{\(p\)\}\_\{J,d\}=\\mathrm\{rank\}\\Big\(\\big\\\{\\sum\_\{c\\in\\mathcal\{C\}\}w\_\{c\}\\cdot\\text\{score\}\_\{c\}\(s\_\{i\}\)\\big\\\}\_\{i=1\}^\{K\}\\Big\)\.Structuring the rationale can reduce scope for free\-form post\-hoc justifications\.

Proof\-Before\-Preference \(PBP\):PBP enforces*evidence lock*before any preference is stated\. Given candidates\{si\}i=1K\\\{s\_\{i\}\\\}\_\{i=1\}^\{K\}and criteria𝒞\\mathcal\{C\}, Turn 1 collects for each\(i,c\)\(i,c\)a short noteni,cn\_\{i,c\}with cited spans from the source; these notes are then locked,lock​\(\{ni,c\}\)\\mathrm\{lock\}\(\\\{n\_\{i,c\}\\\}\), prohibiting edits\. Turn 2 assigns scores using only the locked notes:

scorec​\(si\)=fc​\(lock​\(\{ni,c\}\)\)\.\\text\{score\}\_\{c\}\(s\_\{i\}\)\\;=\\;f\_\{c\}\\\!\\big\(\\mathrm\{lock\}\(\\\{n\_\{i,c\}\\\}\)\\big\)\.Turn 3 aggregates scores into a final ranking,

rJ,d\(p\)=rank​\(\{∑c∈𝒞wc⋅scorec​\(si\)\}i=1K\),r^\{\(p\)\}\_\{J,d\}\\;=\\;\\mathrm\{rank\}\\Big\(\\big\\\{\\sum\_\{c\\in\\mathcal\{C\}\}w\_\{c\}\\cdot\\text\{score\}\_\{c\}\(s\_\{i\}\)\\big\\\}\_\{i=1\}^\{K\}\\Big\),and the narrative justification must reference the locked evidence\. By forcing*evidence before preference*, PBP reduces post\-hoc rationalization and label anchoring\.

### 4\.5Anchoring Attacks

To further stress\-test robustness, we apply semantic\-preserving style transformations𝒜:Sd↦Sd′\\mathcal\{A\}:S\_\{d\}\\mapsto S\_\{d\}^\{\\prime\}that inject known cues\.

Verbosity Attack:𝒜verb\\mathcal\{A\}\_\{\\text\{verb\}\}appends redundant but content\-preserving text to target summaries, probing whether judges reward length and then rationalize it as “detail\.”

Confidence Attack:𝒜conf\\mathcal\{A\}\_\{\\text\{conf\}\}rewrites tone to be more assertive without altering factual content, probing whether judges reward certainty and rationalize it as “precision\.”

## 5Experimental Setup

### 5\.1Dataset Construction

Reliable bias testing requires evaluation data that is \(i\)*unseen*by the judged models and \(ii\)*decontaminated*from common pretraining corpora\. Widely used benchmarks such as CNN/DailyMail\[[8](https://arxiv.org/html/2605.23970#bib.bib33)\]and XSum\[[18](https://arxiv.org/html/2605.23970#bib.bib34)\]are likely present, at least in part, in LLM training data, risking confounds from memorization or prior exposure\. We therefore curate a new corpus ofN=1,000N\{=\}1\{,\}000*summaries*drawn from publicly available sources spanning business, literature, mathematics, science, and technology to ensure topical diversity\. Each document is between 500 and 1,200 tokens\.

For each documentdd, we create a comparison setSd=\{s1,…,sK\}S\_\{d\}=\\\{s\_\{1\},\\dots,s\_\{K\}\\\}that mixes traditional extractive systems,TextRank,LexRank,KL\-Sum,SumBasic, and LLM summaries \(instruction\-following prompts with fixed decoding parameters\)\. In initial experiments, LLM judges frequently preferred LLM outputs over extractive baselines; importantly, these preferences often reflected*genuine quality gains*\(e\.g\., higher coverage/fluency\) rather than mere self\-favoring \(refer to Table[9](https://arxiv.org/html/2605.23970#Ax2.T9)in Appendix\)\. To*isolate*cue effects from true quality differences, we add two controlled subsets that hold content fixed while perturbing labels:

##### Equal\-content pair subset \(𝒞eq\-pair\\mathcal\{C\}\_\{\\text\{eq\-pair\}\}\)

For eachdd, the*same*LLM produces two paraphrasess~1,s~2\\tilde\{s\}\_\{1\},\\tilde\{s\}\_\{2\}that preserve semantics but vary superficially\. We then vary cuesCC\(e\.g\., label one asLLMand the other asTradML\) across the T/F/P probes, keeping texts fixed\. Any movement in outcomes or explanations on𝒞eq\-pair\\mathcal\{C\}\_\{\\text\{eq\-pair\}\}quantifies*pure*cue anchoring\. Most of the experiments are conducted using this approach \(except the ones which are explicitly mentioned\)\.

##### Single\-summary relabel subset \(𝒞single\\mathcal\{C\}\_\{\\text\{single\}\}\)

We also present the*identical*summaryssunder different labels across probes while holding the opponent fixed\. Because𝒳​\(s\)\\mathcal\{X\}\(s\)is constant, any change in ranking or rationale must be driven by cuesCC, not content refer to Table[10](https://arxiv.org/html/2605.23970#Ax2.T10)in Appendix\.

### 5\.2Judges

We employ five open\-weight LLMs, Gemma\-2\-9B, Llama\-3\.1\-8B, Mistral\-7B, Qwen2\.5\-7B, and Zephyr\-7B, as*judges*, executed from downloaded checkpoints on our local infrastructure; no third\-party APIs were used\. For each documentdd, each judgeJJis shown the full candidate setSdS\_\{d\}under probe conditionp∈𝒫p\\in\\mathcal\{P\}and produces \(i\) a complete rankingr\(p\)​J,d∈ℛ​2r^\{\(p\)\}\{J,d\}\\in\\mathcal\{R\}\{2\}and \(ii\) a free\-form explanationeJ,d\(p\)e^\{\(p\)\}\_\{J,d\}\. Prompts are standardized across models; only the probe condition varies\. All judge generations use temperature 0 to ensure deterministic outputs and reproducibility\.

![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/EDR.png)\(a\)Equality Detection Rate\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/ND.png)\(b\)Neutrality Deviation\.

Figure 2:Blind\-Condition Behavior of Different Judges\.![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/Flip_percentage.png)Figure 3:Revision Susceptibility after Label Reveal![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/LAOT.png)\(a\)Label–Anchoring in Outcomes for True Labels\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/LAOF.png)\(b\)Label–Anchoring in Outcomes for Flip Labels\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/LAOP.png)\(c\)Label–Anchoring in Outcomes for Placebo Labels\.

Figure 4:Label–Anchoring in Outcomes for Different Judges\.
### 5\.3Prompt Examples

We document the prompt templates used for*generation*and*judging*\. Our goals are \(i\)reproducibilityvia frozen, model\-agnostic instructions; \(ii\)comparabilityacross protocols by holding wording constant except for the protocol\-specific constraints; and \(iii\)diagnosabilityof bias via controlled label probes \(Blind,True,FLIP,Placebo,Reveal\-after\) and style attacks \(Verbosity,Confidence\)\. All judges emit structured verdict & rationale, and all templates forbid external knowledge to isolate effects of labels/cues rather than content leakage\. The three protocols differ only in*how*evidence is elicited:Baselinegives a direct verdict with a brief post\-hoc explanation;SCoTrequires criterion\-guided notes and scores \(accuracy, completeness, conciseness, fluency\) prior to the verdict; andPBPenforces an*evidence lock*\(Explain→\\rightarrowScore→\\rightarrowRank\), so decisions and rationales must derive from quoted spans captured before scoring\. Placeholders in\{BRACES\}are programmatically filled at runtime and hyperparameters \(e\.g\., temperature, max tokens, seeds\) are fixed across conditions\.

Summary Generation \(LLM author\)You are a careful writer\. Given a document, produce a concise, faithful summary of 500 words\.Document: \{DOC\_TEXT\}Output:*The summary text*\.

Baseline Judge \(direct verdict, brief rationale\)You are a careful judge\. Two summaries describe the same document\. Decide which summary isbetter overalland briefly explain why\.Document: \{DOC\_TEXT\}Summary 1 \(label = \{LABEL\_1\}\): \{SUM1\}Summary 2 \(label = \{LABEL\_2\}\): \{SUM2\}Output: \{\{ "verdict": "\[1,2\]" \| "\[2,1\]" \| "Tie", "explanation": "2–4 sentences in plain English" \}\}

SCoT Judge \(criterion–guided decision\)You are a judge\. Evaluate along the following criteria:accuracy,completeness,conciseness,fluency\. Decide which summary isbetter overalland briefly explain why\.Document: \{DOC\_TEXT\}Summary 1 \(label = \{LABEL\_1\}\): \{SUM1\}Summary 2 \(label = \{LABEL\_2\}\): \{SUM2\}Output: \{\{ "verdict": "\[1,2\]" \| "\[2,1\]" \| "Tie", "explanation": "2–4 sentences in plain English" \}\}

PBP Judge:*Lock Evidence*\(no scores, no verdict\)You are an impartial judge\. First, write criterion\-wise notes foraccuracy,completeness,conciseness,fluency\. For every note, quote or reference the exact span\(s\) from theDocumentorSummariesthat support it\.Do notprovide a verdict\.*Score*using locked notes only Usingonlythelocked\_notesabove, assign 0–5 scores for each criterion and summary\. Do*not*add new evidence\. Aggregate the criterion scores \(equal weights unless specified\) and produce the overall verdict and a one–sentence justification that references the locked notes\.Document: \{DOC\_TEXT\}Summary 1 \(label = \{LABEL\_1\}\): \{SUM1\}Summary 2 \(label = \{LABEL\_2\}\): \{SUM2\}Output: \{\{ "verdict": "\[1,2\]" \| "\[2,1\]" \| "Tie", "explanation": "reference locked evidence only" \}\}

### 5\.4Results

#### 5\.4\.1Blind\-Condition Behavior: Equality Detection and Neutrality

Under the Blind condition \(B\), we assess two complementary properties of judge behavior: the*Equality Detection Rate*EDRB\\mathrm\{EDR\}\_\{B\}\(refer Fig\.[2\(a\)](https://arxiv.org/html/2605.23970#S5.F2.sf1)\), the fraction of items marked*Tie/No selection*, and the*Neutrality Deviation*NDB\\mathrm\{ND\}\_\{B\}\(refer Fig\.[2\(b\)](https://arxiv.org/html/2605.23970#S5.F2.sf2)\), the tie\-adjusted deviation from a 50–50 split between\[1,2\]\[1,2\]and\[2,1\]\[2,1\]\(lower is better\)\. Taken together, these metrics reveal a consistent pattern across models:PBPexhibits the most faithful Blind behavior,SCoTis intermediate, and theBaselineis worst\. Specifically,EDRB\\mathrm\{EDR\}\_\{B\}is essentially zero for Baseline \(no abstention\), rises for SCoT \(≈0\.71\\approx 0\.71–0\.840\.84\), and is highest for PBP \(≈0\.82\\approx 0\.82–0\.940\.94\), indicating that PBP most often recognizes near\-equivalent summaries\. The same ordering holds in reverse forNDB\\mathrm\{ND\}\_\{B\}: Baseline shows the largest deviations \(e\.g\., Gemma0\.0180\.018, Llama0\.0600\.060, Mistral0\.1000\.100, Qwen0\.0800\.080, Zephyr0\.0200\.020\), SCoT is modestly skewed \(≈0\.019\\approx 0\.019–0\.1220\.122\), and PBP is closest to neutral \(≈0\.001\\approx 0\.001–0\.0050\.005\)\. For instance,*Gemma\-2\-9B*moves fromEDRB≈0\\mathrm\{EDR\}\_\{B\}\\approx 0\(Baseline\) to∼0\.84\\sim 0\.84\(SCoT\) and0\.937\(PBP\), whileNDB\\mathrm\{ND\}\_\{B\}drops from0\.0180\.018\(Baseline\) to0\.005\(PBP\);*Llama\-3\.1\-8B*shows a similar reduction inNDB\\mathrm\{ND\}\_\{B\}from0\.0600\.060\(Baseline\) to0\.002\(PBP\)\.

![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/Delta_E.png)\(a\)Explanation drift under Same–Decision\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/LAI.png)\(b\)Label\-aligned explanation under Same–Decision\.

Figure 5:Explanation rationalization with the verdict held constant\.

### 5\.5Label–Anchoring in Outcomes \(LAO\):

We quantify label susceptibility using the tie\-aware Label–Anchoring in Outcomes \(LAO\) metric\. It captures the fraction of total outcome movement that aligns with the presented label or cue, with higher values indicating stronger label\-directed anchoring\. Fig\.[4](https://arxiv.org/html/2605.23970#S5.F4)presents LAO values under three probe conditions: True labels, Flip labels, and Placebo labels\. Under True labels \(refer Fig\.[4\(a\)](https://arxiv.org/html/2605.23970#S5.F4.sf1)\), the Baseline protocol exhibits maximal anchoring \(LAO = 1\.00\) across all models\. SCoT also shows high anchoring, with LAO values ranging from approximately 0\.91 to 1\.00\. In contrast, PBP consistently exhibits lower LAO values across models \(approximately 0\.00–0\.77\), indicating reduced alignment between outcomes and label cues when evidence locking is enforced\.

Under Flip labels \(refer Fig\.[4\(b\)](https://arxiv.org/html/2605.23970#S5.F4.sf2)\), where misleading labels are presented, the Baseline protocol shows negligible anchoring \(LAO≈\\approx0\), indicating minimal movement toward incorrect label\-favored outcomes\. SCoT exhibits selective anchoring, with some models showing low LAO and others showing elevated anchoring, notably Qwen2\.5\-7B approaching LAO≈\\approx1\.00\. PBP generally maintains lower anchoring than SCoT, with most values near zero and moderate anchoring observed in only a subset of models\.

Under Placebo labels \(refer Fig\.[4\(c\)](https://arxiv.org/html/2605.23970#S5.F4.sf3)\), which introduce non\-informative but credible cues, the Baseline protocol again shows strong anchoring in several models \(LAO≈\\approx1\.00 in Gemma\-2\-9B, Mistral\-7B, and Zephyr\-7B\)\. SCoT also exhibits elevated anchoring across models, with values ranging from approximately 0\.68 to 1\.00\. In contrast, PBP maintains lower LAO values overall, although moderate anchoring is observed in some models \(approximately 0\.50–0\.63\), while remaining near zero in others\.

Overall, these results show that the Baseline protocol is highly sensitive to label and cue signals, SCoT reduces anchoring under some conditions but remains susceptible when cues are present, and PBP consistently exhibits lower anchoring across probe conditions\.

![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/VerbosityLAO.png)\(a\)Variation in LAO\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/VerbosityLDS.png)\(b\)Variation in LDS\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/VerbosityDeltaE.png)\(c\)Variation in Explanation drift under Same–Decision\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/VerbosityCue.png)\(d\)Variation in Cue\.

Figure 6:Variation in Different Parameters under Verbosity attack\.![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/ConfidienceLAO.png)\(a\)Variation in LAO\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/ConfidienceLDS.png)\(b\)Variation in LDS\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/ConfidienceDelta.png)\(c\)Variation in Explanation drift under Same–Decision\.
![Refer to caption](https://arxiv.org/html/2605.23970v1/Figures/ConfidienceCue.png)\(d\)Variation in Cue\.

Figure 7:Variation in Different Parameters under Confidence attack\.
### 5\.6Revision Susceptibility after Label Reveal\.

We quantify revision after revealing true labels following an initial judgment under*Flip*labels:

Pr⁡\[rJ,d\(R\)≠rJ,d\(F\)\]\.\\Pr\\\!\\big\[r^\{\(\\mathrm\{R\}\)\}\_\{J,d\}\\neq r^\{\(\\mathrm\{F\}\)\}\_\{J,d\}\\big\]\.Across models,Baseline/SCoTshow high flip\-after\-reveal rates \(75–85%\), whilePBPis markedly lower \(5–22%\) \(refer Fig\.[3](https://arxiv.org/html/2605.23970#S5.F3)\)\. Concretely, for \(Gemma, Llama, Mistral, Qwen, Zephyr\): Baseline\{85,78,76,83,82\}%\\\{85,78,76,83,82\\\}\\%, SCoT\{80,80,76,82,80\}%\\\{80,80,76,82,80\\\}\\%, PBP\{22,5,8,18,15\}%\\\{22,5,8,18,15\\\}\\%\. This indicates strong label\-anchored revision for Baseline/SCoT and robust resistance for PBP\.

### 5\.7Explanation Rationalization under Same–Decision\.

We evaluate explanation rationalization while holding the verdict fixed using \(i\)*explanation drift*Δ​e∣Same\\Delta e\\\!\\mid\\\!\\text\{Same\}\(refer to Fig\.[5\(a\)](https://arxiv.org/html/2605.23970#S5.F5.sf1)\) \(distance between probe and Blind explanations\) and \(ii\)*label–aligned explanation*LAItext∣Same\\mathrm\{LAI\}\_\{\\text\{text\}\}\\\!\\mid\\\!\\text\{Same\}\(share of label\-conforming rhetoric\) \(refer to Fig\.[5\(b\)](https://arxiv.org/html/2605.23970#S5.F5.sf2)\)\. Across probes and models,Δ​e∣Same\\Delta e\\\!\\mid\\\!\\text\{Same\}is high \(≈0\.75\\approx 0\.75–0\.810\.81\) accorss all models, indicating substantial rewriting even when outcomes do not change\. In contrast,LAItext∣Same\\mathrm\{LAI\}\_\{\\text\{text\}\}\\\!\\mid\\\!\\text\{Same\}differentiates models:*Mistral\-7B*and*Llama\-3\.1\-8B*show the strongest label alignment \(≈0\.43\\approx 0\.43and0\.250\.25–0\.350\.35\), whereas*Gemma\-2\-9B*,*Qwen2\.5\-7B*, and*Zephyr\-7B*remain low \(≈0\.00\\approx 0\.00–0\.100\.10\)\. Notably,Placeboyields alignment comparable toFlip/Truefor the more susceptible models, suggesting increased alignment with presented cues even under Placebo labels\.

### 5\.8Robustness under a Verbosity Attack\.

We inflate one candidate with redundant but fluent tokens and evaluate four complementary views of susceptibility: \(i\)*LAO*\(tie–aware, directional outcome anchoring\) \(refer to Fig\.[6\(a\)](https://arxiv.org/html/2605.23970#S5.F6.sf1)\), \(ii\)*LDS*\(absolute outcome mass moving toward the verbosity–favored side\) \(refer to Fig\.[6\(b\)](https://arxiv.org/html/2605.23970#S5.F6.sf2)\), \(iii\)Δ​e∣Same\\Delta e\\\!\\mid\\\!\\text\{Same\}\(explanation drift vs\. Blind on items whose verdict does not change\) \(refer to Fig\.[6\(c\)](https://arxiv.org/html/2605.23970#S5.F6.sf3)\), and \(iv\)*Cue–align %*\(share of explanations whose rhetoric aligns with the injected cue\) \(refer to Fig\.[6\(d\)](https://arxiv.org/html/2605.23970#S5.F6.sf4)\)\. The four panels collectively show a coherent ordering across models:PBPis most robust,SCoTis most susceptible, and theBaselinelies in between\. Concretely,PBPkeepsLAOlow \(Gemma 0\.15, Llama 0\.18, Mistral 0\.00, Qwen 0\.12, Zephyr 0\.10\) with*LDS*near zero \(0\.00–0\.02\), and its textual changes are modest \(Δ​e∣Same≈0\.08\\Delta e\\\!\\mid\\\!\\text\{Same\}\\approx 0\.08–0\.110\.11\) with muted cue alignment \(15–26%\)\. TheBaselineshows moderate anchoring \(e\.g\., LAO 0\.35–0\.70; LDS 0\.04–0\.11\), larger rewrites \(Δ​e∣Same≈0\.14\\Delta e\\\!\\mid\\\!\\text\{Same\}\\approx 0\.14–0\.180\.18\), and higher cue alignment \(30–55%\)\. In contrast,SCoTexhibits pronounced label–directed movement \(LAO 0\.90–0\.97 with LDS 0\.09–0\.14\), the largest narrative drift \(Δ​e∣Same≈0\.22\\Delta e\\\!\\mid\\\!\\text\{Same\}\\approx 0\.22–0\.280\.28\), and the strongest cue alignment \(48–70%\)\.

### 5\.9Robustness to a Confidence Attack

We evaluate susceptibility to assertive phrasing by increasing the confidence of one candidate and measuring four indicators: \(i\) the tie–aware, directional label–anchoring of outcomesLAO=LDS/\(LDS\+TS\+OLS\)\\mathrm\{LAO\}=\\mathrm\{LDS\}/\(\\mathrm\{LDS\}\+\\mathrm\{TS\}\+\\mathrm\{OLS\}\)\(refer to Fig\.[7\(a\)](https://arxiv.org/html/2605.23970#S5.F7.sf1)\); \(ii\) the absolute label–directed shift in outcome probabilityLDS\\mathrm\{LDS\}\(refer to Fig\.[7\(b\)](https://arxiv.org/html/2605.23970#S5.F7.sf2)\); \(iii\) the explanation drift under same–decisionΔ​e∣Same\\Delta e\\\!\\mid\\\!\\text\{Same\}\(distance between probe and Blind explanations, conditional on unchanged verdicts\) \(refer to Fig\.[7\(c\)](https://arxiv.org/html/2605.23970#S5.F7.sf3)\); and \(iv\) the fraction of cue–aligned explanations \(*Cue*\) \(refer to Fig\.[7\(d\)](https://arxiv.org/html/2605.23970#S5.F7.sf4)\)\. Across models, the ordering is consistent and statistically salient:PBPexhibits the lowest outcome anchoring and minimal text drift, theBaselineshows moderate effects, andSCoTis most affected\. Concretely, under PBP,LAO\\mathrm\{LAO\}remains small \(approximately0\.000\.00–0\.150\.15\) with near–zeroLDS\\mathrm\{LDS\}\(0\.000\.00–0\.020\.02\), while explanations change only modestly \(Δ​e≈0\.08\\Delta e\\\!\\approx\\\!0\.08–0\.110\.11\) and cue alignment is limited \(≈18\\approx\\\!18–26%26\\%\)\. The Baseline exhibits intermediate susceptibility \(LAO≈0\.35\\mathrm\{LAO\}\\\!\\approx\\\!0\.35–0\.650\.65,LDS≈0\.04\\mathrm\{LDS\}\\\!\\approx\\\!0\.04–0\.110\.11,Δ​e≈0\.13\\Delta e\\\!\\approx\\\!0\.13–0\.180\.18, Cue≈48\\approx\\\!48–67%67\\%\)\. By contrast, SCoT displays pronounced anchoring and rationalization \(LAO≈0\.90\\mathrm\{LAO\}\\\!\\approx\\\!0\.90–0\.960\.96,LDS≈0\.08\\mathrm\{LDS\}\\\!\\approx\\\!0\.08–0\.140\.14,Δ​e≈0\.22\\Delta e\\\!\\approx\\\!0\.22–0\.280\.28, Cue≈50\\approx\\\!50–70%70\\%\), indicating that rubric dimensions such as fluency/authority are over–rewarded by assertive tone\.

## 6Limitations and Scope

This paper targets a specific reliability failure in LLM\-based evaluation,*cue\-driven rationalization*, where decisions and explanations change under non\-evidential cues while the underlying texts remain fixed\. We frame*cue invariance*as a necessary \(but not sufficient\) condition for reliable explanations and use controlled interventions to diagnose violations of this condition, rather than to fully validate explanation faithfulness\. Our explanation\-level metrics function as relative diagnostics under fixed evidence and are not intended as ground\-truth semantic judgments; their behavior may depend on the choice of scorer\. Empirically, we focus on pairwise summarization and report aggregate effects without confidence intervals, multi\-seed variance, or measured cost/latency; thePBPprotocol also increases abstention when candidates are near\-equivalent and introduces modest, predictable overhead due to serialized prompting\. These constraints reflect deliberate design trade\-offs to isolate a clean causal signal, and motivate future work on cross\-task generalization, uncertainty estimation, human validation of explanation alignment, and cost–robustness analysis\.

## 7Conclusion

In this paper, we investigated whether explanations provided by LLM judges were faithful to their decision\-making or served as post\-hoc rationalizations\. We introduced a causal probing framework that manipulated metadata labels and allowed us to isolate the influence of external cues on explanations\. We defined a suite of metrics to measure fabrication, label alignment, cue susceptibility, stereotype intrusion, and consistency\. We further designed rationalization attacks that exposed systematic vulnerabilities and proposed two mitigation strategies: structured chain\-of\-thought prompting and PBP\. Our experiments showed that LLM judges frequently rationalized under label perturbations, but that our mitigation strategies reduced rationalization bias significantly\. Overall, this work established a foundation for auditing and mitigating explanation faithfulness in LLM judges\.

However, our evaluation is limited in scope and statistical depth: we study summarization only, rely on explanation\-level metrics that depend on specific rationale scorers, and do not report confidence intervals or multi\-seed variance\. These are principled deferrals; future work will test transfer to tasks with tighter contextual control, compare rationale scorers to disentangle semantic and stylistic effects, and measure cost\-robustness trade\-offs across models and serving stacks\.

## References

- \[1\]G\. H\. Chen, S\. Chen, Z\. Liu, F\. Jiang, and B\. Wang\(2024\-11\)Humans or LLMs as the judge? a study on judgement bias\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8301–8327\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.474/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.474)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[2\]Y\. Chen, J\. Benton, A\. Radhakrishnan, J\. Uesato, C\. Denison, J\. Schulman, A\. Somani, P\. Hase, M\. Wagner, F\. Roger, V\. Mikulik, S\. Bowman, J\. Leike, J\. Kaplan, E\. Perez, and A\. Alignment Science Team\(2025\)Reasoning models don’t always say what they think\.Anthropic Research Report\.Note:Working paper; available at Anthropic’s website under “Reasoning Models Don’t Always Say What They Think”External Links:[Link](https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[3\]Y\. Chuang, G\. Wang, C\. Chang, R\. Tang, S\. Zhong, F\. Yang, M\. Du, X\. Cai, and X\. Hu\(2024\)FaithLM: towards faithful explanations for large language models\.External Links:2402\.04678,[Link](https://arxiv.org/abs/2402.04678)Cited by:[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[4\]E\. Croxford, Y\. Gao, N\. Pellegrino,et al\.\(2025\-02\)Current and future state of evaluation of large language models for medical summarization tasks\.npj Health Systems2\(6\),pp\. 6\.External Links:[Document](https://dx.doi.org/10.1038/s44401-024-00011-2),[Link](https://doi.org/10.1038/s44401-024-00011-2)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[5\]M\. Desmond, Z\. Ashktorab, W\. Geyer, E\. M\. Daly, M\. S\. Cooper, Q\. Pan, R\. Nair, N\. Wagner, and T\. Pedapati\(2025\)EvalAssist: llm\-as\-a\-judge simplified\.InProceedings of the AAAI Conference on Artificial Intelligence, Demonstration Track,Vol\.39,pp\. 35351\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i28.35351),[Link](https://doi.org/10.1609/aaai.v39i28.35351)Cited by:[§1](https://arxiv.org/html/2605.23970#S1.p1.1)\.
- \[6\]A\. R\. Fabbri, W\. Kryściński, B\. McCann, C\. Xiong, R\. Socher, and D\. Radev\(2021\-04\)SummEval: re\-evaluating summarization evaluation\.Transactions of the Association for Computational Linguistics9,pp\. 391–409\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00373),[Link](https://doi.org/10.1162/tacl_a_00373),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/tacl\_a\_00373/1923949/tacl\_a\_00373\.pdfCited by:[§1](https://arxiv.org/html/2605.23970#S1.p1.1)\.
- \[7\]T\. Goyal and G\. Durrett\(2021\-06\)Annotating and modeling fine\-grained factuality in summarization\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 1449–1462\.External Links:[Link](https://aclanthology.org/2021.naacl-main.114/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.114)Cited by:[Rationale for Choosing Evaluation Categories\.](https://arxiv.org/html/2605.23970#Ax1.SS0.SSS0.Px1.p1.1)\.
- \[8\]K\. M\. Hermann, T\. Kocisky, E\. Grefenstette, L\. Espeholt, W\. Kay, M\. Suleyman, and P\. Blunsom\(2015\)Teaching machines to read and comprehend\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.28\.Cited by:[§5\.1](https://arxiv.org/html/2605.23970#S5.SS1.p1.1)\.
- \[9\]R\. Koo, M\. Lee, V\. Raheja, J\. I\. Park, Z\. M\. Kim, and D\. Kang\(2024\-08\)Benchmarking cognitive biases in large language models as evaluators\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 517–545\.External Links:[Link](https://aclanthology.org/2024.findings-acl.29/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.29)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p2.1)\.
- \[10\]F\. Koto, T\. Baldwin, and J\. H\. Lau\(2022\)Ffci: a framework for interpretable automatic evaluation of summarization\.Journal of Artificial Intelligence Research73,pp\. 1553–1607\.External Links:[Link](https://dl.acm.org/doi/10.1613/jair.1.13167)Cited by:[Rationale for Choosing Evaluation Categories\.](https://arxiv.org/html/2605.23970#Ax1.SS0.SSS0.Px1.p1.1)\.
- \[11\]T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion, K\. Lukošiūtė, K\. Nguyen, N\. Cheng, N\. Joseph, N\. Schiefer, O\. Rausch, R\. Larson, S\. McCandlish, S\. Kundu, S\. Kadavath, S\. Yang, T\. Henighan, T\. Maxwell, T\. Telleen\-Lawton, T\. Hume, Z\. Hatfield\-Dodds, J\. Kaplan, J\. Brauner, S\. R\. Bowman, and E\. Perez\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.External Links:2307\.13702,[Link](https://arxiv.org/abs/2307.13702)Cited by:[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[12\]D\. Lee, Y\. Hwang, Y\. Kim, J\. Park, and K\. Jung\(2025\-01\)Are llm\-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm\-based evaluation\.pp\. 8962–89848962–8984\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.452),[Link](https://aclanthology.org/2025.naacl-long.452.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p2.1)\.
- \[13\]S\. Lewis\-Lim, X\. Tan, Z\. Zhao, and N\. Aletras\(2025\)Analysing chain of thought dynamics: active guidance or unfaithful post\-hoc rationalisation?\.External Links:2508\.19827,[Link](https://arxiv.org/abs/2508.19827)Cited by:[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[14\]J\. Li, H\. Yan, and Y\. He\(2025\-07\)Drift: enhancing LLM faithfulness in rationale generation via dual\-reward probabilistic inference\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6850–6866\.External Links:[Link](https://aclanthology.org/2025.acl-long.340/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.340),ISBN 979\-8\-89176\-251\-0Cited by:[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[15\]Z\. Li, C\. Wang, P\. Ma, D\. Wu, S\. Wang, C\. Gao, and Y\. Liu\(2024\-11\)Split and merge: aligning position biases in LLM\-based evaluators\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 11084–11108\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.621/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.621)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[16\]C\. Lin and E\. Hovy\(2003\)Automatic evaluation of summaries using n\-gram co\-occurrence statistics\.InProceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics,pp\. 150–157\.External Links:[Link](https://aclanthology.org/N03-1020/)Cited by:[Rationale for Choosing Evaluation Categories\.](https://arxiv.org/html/2605.23970#Ax1.SS0.SSS0.Px1.p1.1)\.
- \[17\]J\. Maynez, S\. Narayan, B\. Bohnet, and R\. T\. Mcdonald\(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of The 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://aclanthology.org/2020.acl-main.173.pdf)Cited by:[Rationale for Choosing Evaluation Categories\.](https://arxiv.org/html/2605.23970#Ax1.SS0.SSS0.Px1.p1.1)\.
- \[18\]S\. Narayan, S\. B\. Cohen, and M\. Lapata\(2018\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1797–1807\.Cited by:[§5\.1](https://arxiv.org/html/2605.23970#S5.SS1.p1.1)\.
- \[19\]A\. Nenkova and K\. McKeown\(2011\)Automatic summarization\.Foundations and Trends® in Information Retrieval5\(2–3\),pp\. 103–233\.External Links:[Link](http://dx.doi.org/10.1561/1500000015),[Document](https://dx.doi.org/10.1561/1500000015),ISSN 1554\-0669Cited by:[§1](https://arxiv.org/html/2605.23970#S1.p1.1)\.
- \[20\]A\. Nenkova and R\. Passonneau\(2004\-May2 \-May7\)Evaluating content selection in summarization: the pyramid method\.InProceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT\-NAACL 2004,Boston, Massachusetts, USA,pp\. 145–152\.External Links:[Link](https://aclanthology.org/N04-1019/)Cited by:[Rationale for Choosing Evaluation Categories\.](https://arxiv.org/html/2605.23970#Ax1.SS0.SSS0.Px1.p1.1)\.
- \[21\]V\. Raina, A\. Liusie, and M\. Gales\(2024\-11\)Is LLM\-as\-a\-judge robust? investigating universal adversarial attacks on zero\-shot LLM assessment\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7499–7517\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.427/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.427)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[22\]A\. Szymanski, N\. Ziems, H\. A\. Eicher\-Miller, T\. J\. Li, M\. Jiang, and R\. A\. Metoyer\(2025\)Limitations of the llm\-as\-a\-judge approach for evaluating llm outputs in expert knowledge tasks\.InProceedings of the 30th International Conference on Intelligent User Interfaces,IUI ’25,New York, NY, USA,pp\. 952–966\.External Links:ISBN 9798400713064,[Link](https://doi.org/10.1145/3708359.3712091),[Document](https://dx.doi.org/10.1145/3708359.3712091)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[23\]A\. S\. Thakur, K\. Choudhary, V\. S\. Ramayapally, S\. Vaidyanathan, and D\. Hupkes\(2025\-07\)Judging the judges: evaluating alignment and vulnerabilities in LLMs\-as\-judges\.InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics \(GEM²\),Vienna, Austria and virtual meeting,pp\. 404–430\.External Links:[Link](https://aclanthology.org/2025.gem-1.33/),ISBN 979\-8\-89176\-261\-9Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p3.1)\.
- \[24\]M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman\(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems \(NIPS 2023\), Poster,Note:PosterExternal Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3669397)Cited by:[§1](https://arxiv.org/html/2605.23970#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23970#S2.SS2.p1.1)\.
- \[25\]M\. Wu and A\. F\. Aji\(2025\-01\)Style over substance: evaluation biases for large language models\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 297–312\.External Links:[Link](https://aclanthology.org/2025.coling-main.21/)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p1.1)\.
- \[26\]J\. Ye, Y\. Wang, Y\. Huang, D\. Chen, Q\. Zhang, N\. Moniz, T\. Gao, W\. Geyer, C\. Huang, P\. Chen, N\. V\. Chawla, and X\. Zhang\(2025\)Justice or prejudice? quantifying biases in LLM\-as\-a\-judge\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3GTtZFiajM)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p2.1)\.
- \[27\]J\. Zhang, Y\. Zhao, M\. Saleh, and P\. J\. Liu\(2020\)PEGASUS: pre\-training with extracted gap\-sentences for abstractive summarization\.InProceedings of the 37th International Conference on Machine Learning,ICML’20\.External Links:[Link](https://dl.acm.org/doi/abs/10.5555/3524938.3525989)Cited by:[§1](https://arxiv.org/html/2605.23970#S1.p1.1)\.
- \[28\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://neurips.cc/virtual/2023/poster/73434)Cited by:[§2\.1](https://arxiv.org/html/2605.23970#S2.SS1.p1.1)\.

## Category & Likert Scale

Table 1:Summary evaluation categories and their definitions\.CategoryDefinitionFactual AccuracyMeasures how faithfully the summary reflects the information in the source document, without introducing hallucinations, distortions, or fabricated details\.Completeness \(Content Coverage\)Evaluates how well the summary captures the key information, main points, and essential content units of the source document\.Coherence & FluencyAssesses the linguistic quality of the summary, including clarity, grammatical correctness, logical flow, and readability\.##### Rationale for Choosing Evaluation Categories\.

We evaluate summaries using three core dimensions: Factual Accuracy, Completeness, and Coherence & Fluency\. Factual Accuracy is essential because modern abstractive models often generate fluent but incorrect statements; prior work shows that factual consistency is a crucial determinant of trustworthy summarization\[[17](https://arxiv.org/html/2605.23970#bib.bib27),[7](https://arxiv.org/html/2605.23970#bib.bib28)\]\. Completeness ensures that summaries retain key content elements from the source, aligned with classical frameworks such as the Pyramid Method and ROUGE, which emphasise coverage of important information\[[20](https://arxiv.org/html/2605.23970#bib.bib29),[16](https://arxiv.org/html/2605.23970#bib.bib30)\]\. Coherence & Fluency capture the readability and structural quality of the summary, reflecting grammaticality, clarity, and logical flow, dimensions shown to correlate strongly with human preferences in recent evaluation benchmarks\[[10](https://arxiv.org/html/2605.23970#bib.bib32)\]\. Together, these categories form a comprehensive and balanced framework for assessing both content and linguistic quality in summarization\.

##### Scoring Rubric:

For each evaluation category, we adopt a Likert\-scale scoring rubric ranging from 1 to 5, where 1 denotes the lowest quality and 5 denotes the highest\. This scale provides fine\-grained resolution while remaining intuitive for both human evaluators and LLM\-based judges\. A score of 1 indicates severe deficiencies, such as major factual errors, missing essential content, or incoherent writing, whereas a score of 5 reflects excellent factual grounding, comprehensive content coverage, and highly fluent and well\-structured summaries\. The 1–5 scale is widely used in summarization evaluation because it offers a reliable and interpretable method for comparing models across multiple dimensions\.

## Appendix ALLM Judge Prompt\.

We instruct the LLM to evaluate a candidate summary with respect to three dimensions: Factual Accuracy, Completeness, and Coherence & Fluency\. The model is explicitly informed of the definition of each dimension and is required to assign a score from 1 to 5 \(1 = worst, 5 = best\) using a Likert\-scale rubric\. The full prompt is shown below\.

> You are an expert evaluator of text summarization quality\. You will be given: 1\. ASource Document, and 2\. ACandidate Summary\. Your task is to evaluate the quality of the summary using the three criteria defined below\. Assign a score from 1 to 5 for each criterion, where 1 indicates very poor performance and 5 indicates excellent performance\. Evaluation Criteria - •Factual Accuracy: Assess how faithfully the summary reflects the information in the source document\. The summary should not introduce hallucinated content, distort facts, or omit critical causal or factual relationships\. - •Completeness \(Content Coverage\): Evaluate how well the summary captures the key information, major points, and essential content units from the source document\. Consider whether important content is missing\. - •Coherence & Fluency: Assess the linguistic quality of the summary\. The summary should be grammatically correct, logically structured, clear, and easy to read\. Sentences should flow smoothly and maintain a consistent style\. Scoring Rubric \(1–5 Scale\) - •1: Very poor\. Major errors, severe factual inconsistencies, missing key content, or extremely unclear writing\. - •2: Poor\. Multiple issues, limited content coverage, or low fluency\. - •3: Acceptable\. Partially correct and somewhat informative, but with noticeable inaccuracies, omissions, or awkward writing\. - •4: Good\. Mostly correct, captures most key information, and reads well with minor issues\. - •5: Excellent\. Fully accurate, complete, and highly fluent with no notable errors\. ## Ablation study ### A\.1Human Check for Near\-Equivalent Pairs To confirm that many instances are genuinely hard to separate in overall quality, we conducted a small human check\. We randomly sampledn=100n=100documents for manual review\. Two annotators independently compared the two summaries for each sampled document and gave a single overall label: either selecting the better summary or marking*Tie*when neither summary was clearly better\. Annotator 1 marked*Tie*on9494of the100100items, and Annotator 2 marked*Tie*on9292of the100100items\. These results support our assumption that a substantial portion of our comparisons are near\-equivalent, which motivates tie\-aware analysis\. ### A\.2Blind Judgments \(no labels revealed\)\. Table[2](https://arxiv.org/html/2605.23970#Ax2.T2)reports raw Blind counts per model and scheme for*No Selection*\(Tie\) and strict pairings\[1,2\]\[1,2\]and\[2,1\]\[2,1\]\. The pattern is consistent across judges:PBPyields the largest abstention rates \(e\.g\., Gemma 937/1000; Llama 924/1000; Mistral 824/1000; Qwen 885/1000; Zephyr 897/1000\),SCoTattains substantial but lower tie rates \(Gemma 840/1000; Llama 804/1000; Mistral 710/1000; Qwen 728/1000; Zephyr 735/1000\), and theBaselinenever abstains \(all zeros in the*No Selection*column\)\. Consequently, equality detectionEDRB\\mathrm\{EDR\}\_\{B\}is highest for PBP \(approx 0\.82–0\.94\), moderate for SCoT \(approx 0\.71–0\.84\), and zero for Baseline\. The residual non–tie choices under SCoT/PBP are roughly balanced \(e\.g\., Gemma SCoT:9797vs\.6363; PBP:3434vs\.2929\), leading to very small neutrality deviationNDB\\mathrm\{ND\}\_\{B\}for PBP \(approx 0\.001–0\.005\) and modest values for SCoT \(approx 0\.019–0\.122\), while Baseline shows skew because it is forced to pick a side \(e\.g\., Gemma540540vs\.460460over 1000 items\)\. Overall, the Blind counts already separate the schemes: PBP exhibits the highest abstention rates under Blind conditions, SCoT helps but still finds small differences, and Baseline cannot abstain\. Table 2:Blind \(no\-label\) decisions by scheme and model\. Each cell reports raw counts of*No Selection*\(Tie\),\[1,2\]\[1,2\]\(LLM≻\\succTradML\), and\[2,1\]\[2,1\]\(TradML≻\\succLLM\)\. Totals per model are10001000for SCoT and PBP and the Baseline\. The consistent ordering, PBP≫\\ggSCoT≫\\ggBaseline in*No Selection*, implies highest equality detection and lowest neutrality deviation for PBP\.Model \(Judge\)BaselineSCoTPBPNo Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]Gemma\-2\-9B054046084097639373429Llama\-3\.1\-8B0470530804110869243739Mistral\-7B0550450710206848249388Qwen2\.5\-7B04605407281681048855857Zephyr\-7B05104907351421238975053Table 3:True\-label decisions by scheme and model\. Counts for*No Selection*\(Tie\) and strict pairings\[1,2\]\[1,2\]\(LLM≻\\succTradML\) and\[2,1\]\[2,1\]\(TradML≻\\succLLM\)\. Baseline exhibits maximal label anchoring \(no ties, heavy\[1,2\]\[1,2\]\), SCoT shows reduced but still substantial anchoring with limited abstention, and PBP sustains high abstention with near\-balanced non–tie choices, reflecting minimal outcome\-level label susceptibility\.Model \(Judge\)BaselineSCoTPBPNo Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]Gemma\-2\-9B0100002156801059253936Llama\-3\.1\-8B08601401986601429204238Mistral\-7B0100007692048259085Qwen2\.5\-7B0750250103867308726860Zephyr\-7B010000128805678955154 ### A\.3Decisions with True Labels Revealed\. Table[3](https://arxiv.org/html/2605.23970#Ax2.T3)reports raw counts when the correct labels are shown\. TheBaselineis maximally label–anchored: it never abstains and selects\[1,2\]\[1,2\]almost exclusively \(e\.g\., Gemma/Mistral/Zephyr1000:01000\\\!:\\\!0, Llama860:140860\\\!:\\\!140, Qwen750:250750\\\!:\\\!250\)\.SCoTintroduces some abstention \(No Selection7676–215215\) but still strongly favors the label–advantaged option \(e\.g\., Gemma680:105680\\\!:\\\!105, Mistral920:4920\\\!:\\\!4\), yielding large label–directed shifts\. In contrast,PBPmaintains high abstention \(No Selection825825–925925\) with small, nearly balanced non–tie choices \(e\.g\., Gemma39:3639\\\!:\\\!36, Llama42:3842\\\!:\\\!38, Mistral90:8590\\\!:\\\!85\), indicating minimal label susceptibility in outcomes\. Overall, revealing true labels sharply separates the schemes: Baseline≫\\ggSCoT in label anchoring, while PBP largely preserves Blind neutrality and tie behavior\. ### A\.4Decisions with FLIP \(Misleading\) Labels in Summaries\. Table[4](https://arxiv.org/html/2605.23970#Ax2.T4)presents raw counts when summaries carry*FLIP*labels; the identification of candidates remains fixed \(1=1\\\!=\\\!LLM,2=2\\\!=\\\!TradML\)\. TheBaselineagain shows strong anchoring toward\[1,2\]\[1,2\]despite labels being incorrect \(e\.g\., Gemma800:200800\{:\}200, Llama760:240760\{:\}240, Qwen750:250750\{:\}250, and1000:01000\{:\}0for Mistral/Zephyr\), with no abstention\.SCoTintroduces limited abstention \(No Selection7171–210210\) but still exhibits substantial label\-directed preference \(e\.g\., Gemma683:107683\{:\}107, Llama645:155645\{:\}155, Mistral898:31898\{:\}31, Qwen861:42861\{:\}42, Zephyr820:57820\{:\}57\), indicating susceptibility to misleading cues\. In contrast,PBPlargely preserves Blind behavior: high abstention \(No Selection823823–923923\) and non\-tie choices that remain small and near\-balanced \(e\.g\., Gemma39:3839\{:\}38, Llama41:4241\{:\}42, Mistral91:8691\{:\}86, Qwen65:6065\{:\}60, Zephyr50:5350\{:\}53\)\. Overall, when exposed to incorrect labels, Baseline and SCoT continue to favor the label\-indicated side, whereas PBP resists such anchoring by keeping most mass in*Tie*and limiting outcome shifts\. Table 4:FLIP\-label decisions by scheme and model\. Counts for*No Selection*\(Tie\) and strict pairings\[1,2\]\[1,2\]\(LLM≻\\succTradML\) and\[2,1\]\[2,1\]\(TradML≻\\succLLM\)\. Despite labels being misleading, Baseline and SCoT still prefer\[1,2\]\[1,2\]strongly, while PBP maintains high abstention and near\-balanced non\-tie choices, indicating robustness to incorrect cues\.Model \(Judge\)BaselineSCoTPBPNo Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]Gemma\-2\-9B08002002106831079233938Llama\-3\.1\-8B07602402006451559174142Mistral\-7B01000071898318239186Qwen2\.5\-7B075025097861428756560Zephyr\-7B010000123820578975053Table 5:Placebo–label decisions by scheme and model \(1=1\{=\}LLM,2=2\{=\}TradML\)\. Counts for*No Selection*\(Tie\) and strict pairings\[1,2\]\[1,2\]and\[2,1\]\[2,1\]\. Placebo labels should be irrelevant; nonetheless, Baseline and SCoT exhibit sizable shifts toward one side \(and a consistent\[2,1\]\[2,1\]preference for*Qwen2\.5\-7B*\), whereas PBP sustains high abstention and near\-balanced non–ties, indicating robustness to irrelevant cues\.Model \(Judge\)BaselineSCoTPBPNo Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]No Selection\[1,2\]\[2,1\]Gemma\-2\-9B0960402405402209283933Llama\-3\.1\-8B06803202136201679174439Mistral\-7B01000095880258169792Qwen2\.5\-7B02307701031277708796259Zephyr\-7B0920801407181428905456Table 6:Decomposition of outcome shifts from Blind to True when both candidates are LLM outputs \(two independent samples\), while labels still present1=1\{=\}LLM and2=2\{=\}TradML\.LDST\\mathrm\{LDS\}\_\{T\}denotes movement toward the label–favored side\[1,2\]\[1,2\],TST\\mathrm\{TS\}\_\{T\}movement into Tie, andOLST\\mathrm\{OLS\}\_\{T\}movement toward\[2,1\]\[2,1\]\. LargeLDST\\mathrm\{LDS\}\_\{T\}withTST≈0\\mathrm\{TS\}\_\{T\}\{\\approx\}0andOLST≈0\\mathrm\{OLS\}\_\{T\}\{\\approx\}0evidences pure label anchoring; PBP nearly eliminates it\.Model \(Judge\)BaselineSCoTPBPL​D​STLDS\_\{T\}T​STTS\_\{T\}O​L​STOLS\_\{T\}L​D​STLDS\_\{T\}T​STTS\_\{T\}O​L​STOLS\_\{T\}L​D​STLDS\_\{T\}T​STTS\_\{T\}O​L​STOLS\_\{T\}Gemma\-2\-9B0\.5090\.0000\.0000\.5830\.0000\.0410\.0050\.0000\.007Llama\-3\.1\-8B0\.3900\.0000\.0000\.5500\.0000\.0560\.0050\.0000\.005Mistral\-7B0\.4500\.0000\.0000\.7140\.0000\.0000\.0000\.0040\.000Qwen2\.5\-7B0\.2900\.0000\.0000\.6990\.0000\.0000\.0100\.0000\.003Zephyr\-7B0\.4900\.0000\.0000\.5830\.0000\.0410\.0010\.0000\.001Table 7:Decomposition of outcome shifts from Blind under*FLIP*labels when both candidates are LLM outputs \(two independent draws\)\.LDSF\\mathrm\{LDS\}\_\{F\}= movement toward\[1,2\]\[1,2\]\(label–favored\);TSF\\mathrm\{TS\}\_\{F\}= into Tie;OLSF\\mathrm\{OLS\}\_\{F\}= toward\[2,1\]\[2,1\]\. PBP≈0\\approx 0on all components; Baseline moves slightly opposite the cue; SCoT shows mixed behavior with large label–directed shifts for some models \(e\.g\., Qwen\)\.Model \(Judge\)BaselineSCoTPBPL​D​SFLDS\_\{F\}T​SFTS\_\{F\}O​L​SFOLS\_\{F\}L​D​SFLDS\_\{F\}T​SFTS\_\{F\}O​L​SFOLS\_\{F\}L​D​SFLDS\_\{F\}T​SFTS\_\{F\}O​L​SFOLS\_\{F\}Gemma\-2\-9B0\.0000\.0000\.2400\.0440\.0000\.1420\.0090\.0000\.005Llama\-3\.1\-8B0\.0000\.0000\.2300\.0810\.0000\.2480\.0030\.0000\.004Mistral\-7B0\.0000\.0000\.2500\.0000\.0000\.0280\.0000\.0060\.000Qwen2\.5\-7B0\.0000\.0000\.2500\.6660\.0000\.0000\.0030\.0000\.007Zephyr\-7B0\.0000\.0000\.2100\.0190\.0000\.2250\.0000\.0030\.000Table 8:Decomposition of outcome shifts from Blind under*placebo*labels with both candidates drawn from the same LLM\.LDSP\\mathrm\{LDS\}\_\{P\}= movement toward the cue\-favored outcome\[1,2\]\[1,2\];TSP\\mathrm\{TS\}\_\{P\}= into*Tie*;OLSP\\mathrm\{OLS\}\_\{P\}= toward\[2,1\]\[2,1\]\. PBP≈0\\approx 0across components \(best\); Baseline shows moderate placebo anchoring; SCoT is most susceptible\.Model \(Judge\)BaselineSCoTPBPL​D​SPLDS\_\{P\}T​SPTS\_\{P\}O​L​SPOLS\_\{P\}L​D​SPLDS\_\{P\}T​SPTS\_\{P\}O​L​SPOLS\_\{P\}L​D​SPLDS\_\{P\}T​SPTS\_\{P\}O​L​STOLS\_\{T\}Gemma\-2\-9B0\.4200\.0000\.0000\.5860\.0000\.2790\.0050\.0000\.005Llama\-3\.1\-8B0\.1800\.0000\.0840\.5600\.0000\.3020\.0050\.0000\.004Mistral\-7B0\.4900\.0000\.0000\.6880\.0000\.2360\.0000\.0050\.000Qwen2\.5\-7B0\.0000\.2700\.0000\.0000\.0000\.5760\.0100\.0000\.006Zephyr\-7B0\.4200\.0000\.0000\.5870\.0000\.3260\.0000\.0040\.000 ### A\.5Decisions with Placebo Labels \(Placebo labels,1=1\{=\}LLM,2=2\{=\}TradML\)\. Table[5](https://arxiv.org/html/2605.23970#Ax2.T5)reports raw outcomes when summaries carry*placebo*labels that should be ignored\. A robust judge ought to preserve its Blind behavior, i\.e\., maintain high abstention and near\-balanced non–tie choices\. TheBaselinefails this desideratum: it never abstains and remains strongly polarized toward\[1,2\]\[1,2\]for most models \(e\.g\., Gemma960:40960\{:\}40, Mistral1000:01000\{:\}0, Zephyr920:80920\{:\}80\), with a notable reversal for*Qwen2\.5\-7B*\(230:770230\{:\}770toward\[2,1\]\[2,1\]\), indicating substantial sensitivity even to Placebo labels\.SCoTintroduces some abstention \(No Selection9595–240240\) yet still displays large placebo\-driven preferences \(e\.g\., Gemma540:220540\{:\}220, Llama620:167620\{:\}167, Mistral880:25880\{:\}25\), and again markedly favors\[2,1\]\[2,1\]for*Qwen2\.5\-7B*\(127:770127\{:\}770\), consistent with strong cue susceptibility\. In contrast,PBPlargely preserves Blind neutrality: abstention remains high \(No Selection816816–928928\) and the residual non–tie decisions are both small and nearly balanced \(e\.g\., Gemma39:3339\{:\}33, Llama44:3944\{:\}39, Mistral97:9297\{:\}92, Qwen62:5962\{:\}59, Zephyr54:5654\{:\}56\)\. These counts align with the tie–aware, directional analysis \(LAOPlacebo\\mathrm\{LAO\}\_\{\\text\{Placebo\}\}low for PBP, high for Baseline/SCoT\): placebo labels spur outcome shifts for Baseline and SCoT but are largely defused by PBP’s evidence–first workflow\. ### A\.6Label anchoring when both candidates are LLM \(two independent draws\)\. To isolate*pure*label effects, we compare two summaries sampled independently from the same LLM \(candidate11and22are both LLM outputs\) while keeping the display labels fixed as1=1\{=\}LLM and2=2\{=\}TradML\. We decompose the change from Blind to True into*label–directed shift*\(LDST\\mathrm\{LDS\}\_\{T\}\),*movement into Tie*\(TST\\mathrm\{TS\}\_\{T\}\), and*movement toward the opposite side*\(OLST\\mathrm\{OLS\}\_\{T\}\)\. As Table[6](https://arxiv.org/html/2605.23970#Ax2.T6)shows, theBaselinechannels a large fraction of probability mass toward the label–favored decision\[1,2\]\[1,2\]\(e\.g\., Gemma0\.5090\.509, Llama0\.3900\.390, Mistral0\.4500\.450, Qwen0\.2900\.290, Zephyr0\.4900\.490\) with essentially no countervailing movement \(TST=0\\mathrm\{TS\}\_\{T\}\{=\}0,OLST=0\\mathrm\{OLS\}\_\{T\}\{=\}0\), indicating strong label anchoring even though the candidates are content–equivalent in provenance\.SCoTamplifies this effect:LDST\\mathrm\{LDS\}\_\{T\}is larger than Baseline for every model \(up to0\.7140\.714for Mistral\), whileOLST\\mathrm\{OLS\}\_\{T\}is only marginally nonzero \(0\.0410\.041–0\.0560\.056\), yielding a net pull toward the label–favored outcome\. In contrast,PBPsubstantially suppresses label influence:LDST\\mathrm\{LDS\}\_\{T\}is near zero \(Gemma0\.0050\.005, Llama0\.0050\.005, Mistral0\.0000\.000, Qwen0\.0100\.010, Zephyr0\.0010\.001\), with small compensating moves into Tie or the opposite side \(e\.g\., MistralTST=0\.004\\mathrm\{TS\}\_\{T\}\{=\}0\.004\), effectively preserving Blind behavior\. Because the two candidates are generated by the*same*model, these shifts cannot be attributed to content quality; rather, they quantify*rationalization bias*driven solely by the label framing, with PBP offering the strongest mitigation, Baseline moderate susceptibility, and SCoT the highest susceptibility\. Table 9:Head\-to\-head decisions for candidate11\(LLM\) versus candidate22\(TradML\) across four extractive baselines\. Entries are counts of\[1,2\]\[1,2\]\(LLM≻\\succTradML\) and\[2,1\]\[2,1\]\(TradML≻\\succLLM\)\. The LLM dominates universally, with only two minor exceptions \(Llama vs\. SumBasic:995:5995\{:\}5; Qwen vs\. LexRank:970:30970\{:\}30\)\.Model \(Judge\)LexRankTextRankKL\-SumSumBasic\[1,2\]\[2,1\]\[1,2\]\[2,1\]\[1,2\]\[2,1\]\[1,2\]\[2,1\]Gemma\-2\-9B10000100001000010000Llama\-3\.1\-8B1000010000100009955Mistral\-7B10000100001000010000Qwen2\.5\-7B97030100001000010000Zephyr\-7B10000100001000010000Table 10:Identical\-summary test: raw counts when both candidates contain the*same*text but are labeled11\(LLM\) and22\(TradML\)\. The very high abstention and near\-balanced rare picks indicate minimal bias when content is controlled\.Model \(Judge\)No Selection\[1, 2\]\[2,1\]Gemma\-2\-9B98767Llama\-3\.1\-8B98596Mistral\-7B9761311Qwen2\.5\-7B982108Zephyr\-7B981910 ### A\.7FLIP\-Label Outcome Shift with Same\-Model Candidates \(LDS/TS/OLS\)\. Table[7](https://arxiv.org/html/2605.23970#Ax2.T7)decomposes the change from Blind when*FLIP*\(misleading\) labels are displayed, for the case in which both candidates are independently sampled from the same LLM \(labels fixed as1=1\{=\}LLM,2=2\{=\}TradML\)\. We reportLDSF\\mathrm\{LDS\}\_\{F\}\(positive movement toward the label–favored side\[1,2\]\[1,2\]\),TSF\\mathrm\{TS\}\_\{F\}\(into Tie\), andOLSF\\mathrm\{OLS\}\_\{F\}\(toward the opposite side\[2,1\]\[2,1\]\)\. TheBaselineshows*no*label–ward movement \(LDSF=0\\mathrm\{LDS\}\_\{F\}\{=\}0across models\) and a moderate shift*against*the misleading cue \(OLSF≈0\.21\\mathrm\{OLS\}\_\{F\}\\\!\\approx\\\!0\.21–0\.250\.25\), with no mass going to Tie \(TSF=0\\mathrm\{TS\}\_\{F\}\{=\}0\)\.SCoTis mixed but notably vulnerable: while some models still move opposite the cue \(e\.g\., GemmaOLSF=0\.142\\mathrm\{OLS\}\_\{F\}\{=\}0\.142, Llama0\.2480\.248, Zephyr0\.2250\.225\), others exhibit substantial*label–directed*movement despite the labels being incorrect \(e\.g\., QwenLDSF=0\.666\\mathrm\{LDS\}\_\{F\}\{=\}0\.666; LlamaLDSF=0\.081\\mathrm\{LDS\}\_\{F\}\{=\}0\.081\)\. In contrast,PBPkeeps all components near zero \(LDSF≤0\.009\\mathrm\{LDS\}\_\{F\}\\\!\\leq\\\!0\.009,TSF≤0\.006\\mathrm\{TS\}\_\{F\}\\\!\\leq\\\!0\.006,OLSF≤0\.007\\mathrm\{OLS\}\_\{F\}\\\!\\leq\\\!0\.007\), effectively preserving Blind behavior under misleading cues\. Overall, these results confirm that PBP minimizes framing\-only rationalization under FLIP labels, Baseline exhibits modest corrective movement away from the cue, and SCoT remains the most susceptible to misleading label pressure\. ### A\.8Placebo\-Label Outcome Shift with Same\-Model Candidates \(LDS/TS/OLS\)\. Table[8](https://arxiv.org/html/2605.23970#Ax2.T8)decomposes the change from Blind when*placebo*\(irrelevant\) labels are displayed, holding provenance fixed by sampling both candidates from the same LLM \(1=1\{=\}LLM,2=2\{=\}TradML\)\. We reportLDSP\\mathrm\{LDS\}\_\{P\}\(positive movement toward the cue\-favored side\[1,2\]\[1,2\]\),TSP\\mathrm\{TS\}\_\{P\}\(movement into*Tie*\), andOLSP\\mathrm\{OLS\}\_\{P\}\(movement toward the opposite side\[2,1\]\[2,1\]\)\. Ideally, placebo cues should be ignored, yieldingLDSP≈0\\mathrm\{LDS\}\_\{P\}\\\!\\approx\\\!0and at most smallTSP\\mathrm\{TS\}\_\{P\}\(benign\) orOLSP\\mathrm\{OLS\}\_\{P\}\(counter\-cue\)\. TheBaselineshows moderate placebo anchoring with sizeableLDSP\\mathrm\{LDS\}\_\{P\}for most judges \(Gemma0\.4200\.420, Mistral0\.4900\.490, Zephyr0\.4200\.420; Llama0\.1800\.180plus some opposite shiftOLSP=0\.084\\mathrm\{OLS\}\_\{P\}\{=\}0\.084\), while*Qwen2\.5\-7B*pushes mass into*Tie*\(TSP=0\.270\\mathrm\{TS\}\_\{P\}\{=\}0\.270\), reflecting partial cue immunity\.SCoTis markedly susceptible:LDSP\\mathrm\{LDS\}\_\{P\}is large across models \(Gemma0\.5860\.586, Llama0\.5600\.560, Mistral0\.6880\.688, Zephyr0\.5870\.587\), with nontrivial opposite shifts for some \(e\.g\., GemmaOLSP=0\.279\\mathrm\{OLS\}\_\{P\}\{=\}0\.279, Llama0\.3020\.302, Zephyr0\.3260\.326\), and a striking counter\-cue for*Qwen2\.5\-7B*\(OLSP=0\.576\\mathrm\{OLS\}\_\{P\}\{=\}0\.576\) but*no*move into*Tie*\. In contrast,PBPremains near Blind: all components are tiny \(LDSP≤0\.010\\mathrm\{LDS\}\_\{P\}\\leq 0\.010,TSP≤0\.005\\mathrm\{TS\}\_\{P\}\\leq 0\.005,OLSP≤0\.006\\mathrm\{OLS\}\_\{P\}\\leq 0\.006\), indicating that Placebo labels are largely ignored\. Overall, placebo labels elicit outcome shifts for Baseline and especially SCoT, whereas PBP effectively neutralizes them\. ### A\.9Sanity check with identical summaries \(labels only\)\. To isolate pure label effects, we present*identical*summaries while assigning them different labels \(1=1\{=\}LLM,2=2\{=\}TradML\)\. As shown in Table[10](https://arxiv.org/html/2605.23970#Ax2.T10), judges overwhelmingly abstain \(*No Selection*==\\\!976–987/1000\), yielding very high equality detection \(EDR≈0\.976\\mathrm\{EDR\}\\\!\\approx\\\!0\.976–0\.9870\.987\)\. The residual non–tie choices are rare and nearly balanced \(e\.g\., Gemma 6 vs\. 7; Llama 9 vs\. 6; Zephyr 9 vs\. 10\), implying a near–zero neutrality deviation and confirming that, when content is strictly controlled, decisions are not driven by the label identity\. This sanity check supports our interpretation of label–anchoring results: outcome shifts observed underTrue/Flip/Placeboarise from cue susceptibility rather than an inherent preference for the LLM \(1\) or TradML \(2\) label\. ### A\.10LLM vs\. Classical Extractive Baselines\. We compare candidate11\(LLM\) against candidate22\(TradML\) across four extractive baselines,*LexRank*,*TextRank*,*KL\-Sum*, and*SumBasic*\. Across judges and datasets, the preference is overwhelmingly in favor of the LLM summaries\. ForGemma\-2\-9B,Mistral\-7B, andZephyr\-7B, the LLM is selected in*all*evaluations for all four baselines \(\[1,2\]=1000\[1,2\]=1000,\[2,1\]=0\[2,1\]=0\)\.Llama\-3\.1\-8Bshows the same unanimity for LexRank, TextRank, and KL\-Sum \(1000:01000\{:\}0each\), with a single minor deviation against SumBasic \(995:5995\{:\}5\)\.Qwen2\.5\-7Bis likewise unanimous for TextRank, KL\-Sum, and SumBasic \(1000:01000\{:\}0\), and remains decisively in favor of the LLM against LexRank \(970:30970\{:\}30\)\. These results indicate a consistent and substantive quality advantage of LLM\-generated summaries over classical extractive methods, rather than an effect of label susceptibility: when content differs, judges nearly always prefer the LLM output\.

Similar Articles

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Hugging Face Daily Papers

This paper investigates central tendency bias in multimodal LLMs used for clinical ordinal scoring of the Clock Drawing Test, finding that LLMs compress predictions toward the middle of the scale, disproportionately affecting critical extremes. The study extends the LLM-as-judge bias literature to clinical assessment, highlighting the need for calibration-aware evaluation before deployment.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.