Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA
Summary
This paper tests the assumption that LLMs judge better than they generate in in-context QA, finding generation accuracy exceeds self-evaluation on most benchmarks, with evaluation attending less to context. The findings challenge core assumptions in self-evaluation pipelines.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA
Source: [https://arxiv.org/html/2606.28050](https://arxiv.org/html/2606.28050)
###### Abstract
LLM\-as\-a\-Judge and self\-evaluation pipelines implicitly assume that evaluation is easier than generation\. We test this in a controlled in\-context QA setting where a context passage is the sole information source and each model judges the answer it generated, removing the parametric\-knowledge confound of open\-domain comparisons\. Across four benchmarks \(SQuAD 2\.0, DROP, HotpotQA, MuSiQue\) and two models, evaluation is not uniformly easier: generation accuracy exceeds self\-evaluation on three of four, with multi\-hop MuSiQue the exception\. Attention analysis reveals why: evaluation attends to context 3–5x less than generation does and barely reads the candidate answer\. LoRA fine\-tuning confirms the asymmetry is not a training artifact: generation fine\-tuning induces over\-acceptance and evaluation fine\-tuning degrades generation\. These findings challenge core assumptions in self\-evaluation pipelines\.
Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In\-Context QA
Sambaran BandyopadhyayAdobe Researchsambaranb@adobe\.com
## 1Introduction
LLMs are now deployed not just as generators of content, but as evaluators of it—a shift with sweeping implications for both research and industry\. LLM\-as\-a\-Judge pipelines\(Zhenget al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib7); Liuet al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib8)\)have rapidly become the workhorse of large\-scale text evaluation, powering leaderboards, automatic benchmarking, and production quality assessment by substituting for prohibitively costly human annotation\(Bavarescoet al\.,[2025](https://arxiv.org/html/2606.28050#bib.bib6)\)\. Self\-reflection and self\-correction frameworks\(Asaiet al\.,[2024](https://arxiv.org/html/2606.28050#bib.bib18); Baiet al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib17); Huanget al\.,[2024](https://arxiv.org/html/2606.28050#bib.bib9)\)use the same model to critique and iteratively revise its own outputs, underpinning the agentic and reasoning systems that have proliferated in recent product releases\. Reinforcement learning from human feedback\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib16)\)relies on a reward model—often itself an LLM—to provide the preference signals that align frontier models to human values, foundational to the post\-training pipeline of nearly every modern instruction\-tuned model\. Together, these uses make LLM\-based evaluation not a peripheral capability but a load\-bearing component of how LLMs are trained, deployed, and trusted\. Yet all of them rest on an*implicit foundational assumption*, rarely tested directly: that LLMs can reliably—and often more easily—judge the correctness of an answer than produce one from scratch\.
Yet recent empirical work has pushed back\.Ohet al\.\([2024](https://arxiv.org/html/2606.28050#bib.bib3)\)find LLMs achieve*lower*accuracy judging generated answers than producing them on TriviaQA;Jianget al\.\([2025](https://arxiv.org/html/2606.28050#bib.bib4)\)show discriminative selection among self\-generated candidates is not reliably superior to direct generation; andLinet al\.\([2025](https://arxiv.org/html/2606.28050#bib.bib2)\)find only weak correlation between generation and judgment ability across 21 tasks\. These works share an important confound, however: they operate in theopen\-domainsetting, where both tasks draw on parametric knowledge\. A model that has memorized a fact can evaluate answers about it even when free recall fails, attributing any gap to differential memory retrieval rather than to intrinsic task difficulty\.
#### Our contributions\.
We address this gap with a controlled in\-context QA framework and three complementary studies\.First, we measure the GA–EA gap on four benchmarks spanning extractive, multi\-hop, and numerical reasoning, where each model judges the answer it just generated—a direct test of self\-evaluation\.Second, we probe last\-token attention patterns on both tasks to understandwhythe gap exists; to our knowledge,this is the first application of mechanistic interpretability tools to the generation–evaluation gap\.Third, we fine\-tune with LoRA on each task individually and jointly, then evaluate all checkpoints on both tasks, testing whether the parametric structure for generation and evaluation is shared or distinct\.
We scope the study to in\-context factual QA with explicit gold answers: correctness is binary and unambiguous, sidestepping the subjectivity of long\-form evaluation\(Nandy and Bandyopadhyay,[2025](https://arxiv.org/html/2606.28050#bib.bib1)\), and the context passage as the sole information source ensures any GA–EA gap reflects synthesis difficulty versus self\-verification rather than differential memory retrieval\.
## 2Related Work
#### LLMs as evaluators in modern NLP pipelines\.
LLMs are now used as judges, critics, and reward models throughout the model development lifecycle\.LLM\-as\-a\-Judgeframeworks\(Zhenget al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib7); Liuet al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib8); Bavarescoet al\.,[2025](https://arxiv.org/html/2606.28050#bib.bib6)\)score or rank generated text at scale, replacing costly human annotation\.Self\-reflectionandself\-correctionpipelines\(Asaiet al\.,[2024](https://arxiv.org/html/2606.28050#bib.bib18); Baiet al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib17); Huanget al\.,[2024](https://arxiv.org/html/2606.28050#bib.bib9)\)have a model critique and revise its own outputs\. LLM\-based reward models drive alignment via RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib16)\), supplying preference signals for post\-training\. Known biases such asself\-preference\(Panicksseryet al\.,[2024](https://arxiv.org/html/2606.28050#bib.bib10)\)make systematic characterization of LLM evaluation behavior consequential\.
#### Generation–evaluation asymmetry\.
A growing literature compares the same model’s generation and evaluation abilities directly\.Ohet al\.\([2024](https://arxiv.org/html/2606.28050#bib.bib3)\)find that LLMs perform worse at evaluating their TriviaQA generations than at producing them\.Jianget al\.\([2025](https://arxiv.org/html/2606.28050#bib.bib4)\)show that discriminative selection among self\-generated candidates is not reliably better than direct generation\.Linet al\.\([2025](https://arxiv.org/html/2606.28050#bib.bib2)\)examine 21 tasks across 11 models and report only weak generation–judgment correlation\. All three operate in the open\-domain setting where parametric knowledge confounds task difficulty; we complement them by restricting both tasks to a shared in\-context source and adding mechanistic and transfer\-based analyses\.
#### Mechanistic interpretability of LLM decisions\.
Mechanistic interpretability seeks to explain how transformers route information internally\.Elhageet al\.\([2021](https://arxiv.org/html/2606.28050#bib.bib23)\)formalize attention heads as read/write circuits;Olssonet al\.\([2022](https://arxiv.org/html/2606.28050#bib.bib24)\)identify induction heads as a key mechanism for in\-context learning\.Menget al\.\([2022](https://arxiv.org/html/2606.28050#bib.bib25)\)andGevaet al\.\([2023](https://arxiv.org/html/2606.28050#bib.bib27)\)localize factual associations to MLP layers, andBelroseet al\.\([2023](https://arxiv.org/html/2606.28050#bib.bib26)\)show that semantic representations consolidate in later layers of decoder\-only transformers\. None of these works target the evaluation task or the generation–evaluation gap specifically\.
#### Cross\-task transfer with parameter\-efficient fine\-tuning\.
LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib22)\)adapters isolate the parametric structure each task induces by training on one task and evaluating on another\.Dymkiewiczet al\.\([2025](https://arxiv.org/html/2606.28050#bib.bib19)\)document asymmetric donor–recipient transfer patterns across NLP benchmarks\. We apply this setup to test whether generation and evaluation share parametric structure within a single QA domain\.
## 3Methodology
We address the three studies introduced above in Sections[3\.1](https://arxiv.org/html/2606.28050#S3.SS1)–[3\.3](https://arxiv.org/html/2606.28050#S3.SS3)\. Figure[1](https://arxiv.org/html/2606.28050#S3.F1)sketches the shared pipeline they build on\.
Input:\(c,q,a∗\)\(c,\\,q,\\,a^\{\*\}\)Generatorℒ\\mathcal\{L\}𝒯gen\\mathcal\{T\}\_\{\\mathrm\{gen\}\}:\(c,q\)→a\(c,q\)\\\!\\to\\\!aOracleℒ∗\\mathcal\{L\}^\{\*\}\(c,q,a,a∗\)→y∗\(c,q,a,a^\{\*\}\)\\\!\\to\\\!y^\{\*\}Evaluatorℒ\\mathcal\{L\}𝒯eval\\mathcal\{T\}\_\{\\mathrm\{eval\}\}:\(c,q,a\)→y\(c,q,a\)\\\!\\to\\\!yGAP\(y∗=Cor\.\)P\(y^\{\*\}\\\!=\\\!\\textsc\{Cor\.\}\)EAP\(y=y∗\)P\(y\\\!=\\\!y^\{\*\}\)𝚫=EA−GA\\boldsymbol\{\\Delta\}=\\mathrm\{EA\}\-\\mathrm\{GA\}\(c,q\)\(c,q\)aaa∗a^\{\*\}aay∗y^\{\*\}yy
Figure 1:Core task\-asymmetry pipeline\. Modelℒ\\mathcal\{L\}is tested on two tasks per instance:generation\(𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}\) produces answeraafrom\(c,q\)\(c,q\);self\-evaluation\(𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}\) judges whetheraais correct given\(c,q,a\)\(c,q,a\)\. Oracleℒ∗\\mathcal\{L\}^\{\*\}scoresaaagainst golda∗a^\{\*\}, yieldingy∗y^\{\*\}as ground truth for both metrics\. Dashed arrows are data\-passing operations with no LLM call\.𝚫=EA−GA\\boldsymbol\{\\Delta\}=\\mathrm\{EA\}\-\\mathrm\{GA\}is the primary asymmetry measure\. The mechanistic and transferability studies extend this pipeline\.### 3\.1Task Asymmetry Analysis
Each benchmark instance is a triple\(c,q,a∗\)\(c,q,a^\{\*\}\), whereccis a context passage,qqis a question, anda∗a^\{\*\}is the gold answer synthesized directly fromcc\. We measure a language modelℒ\\mathcal\{L\}on two sequential tasks over the same instances\.
#### Generation task \(𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}\)\.
Given\(c,q\)\(c,q\),ℒ\\mathcal\{L\}produces a free\-text answer:
a=ℒ\(c,q\)\.a=\\mathcal\{L\}\(c,\\,q\)\.\(1\)
#### Evaluation task \(𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}\)\.
The generated answeraafrom𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}is passed directly toℒ\\mathcal\{L\}as a candidate to be judged\. Given\(c,q,a\)\(c,q,a\),ℒ\\mathcal\{L\}produces a binary judgment:
y=ℒ\(c,q,a\)∈\{Correct,Incorrect\}\.y=\\mathcal\{L\}\(c,\\,q,\\,a\)\\;\\in\\;\\\{\\textsc\{Correct\},\\,\\textsc\{Incorrect\}\\\}\.\(2\)
This sequential design makes𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}aself\-evaluationtask:ℒ\\mathcal\{L\}judges the quality of an answer it has just produced\. The same sampled instances are used for both tasks, ensuring direct comparability\. We summarize the resulting asymmetry byΔ=EA−GA\\Delta=\\mathrm\{EA\}\-\\mathrm\{GA\}, with EA and GA formally defined in Section[4\.4](https://arxiv.org/html/2606.28050#S4.SS4)\.
#### Oracle scoring\.
We use an oracle LLMℒ∗\\mathcal\{L\}^\{\*\}to score the generated answeraaagainst the gold answera∗a^\{\*\}:
y∗=ℒ∗\(c,q,a,a∗\)∈\{Correct,Incorrect\}\.\\begin\{split\}y^\{\*\}&=\\mathcal\{L\}^\{\*\}\(c,q,a,a^\{\*\}\)\\\\ &\\in\\\{\\textsc\{Correct\},\\textsc\{Incorrect\}\\\}\.\\end\{split\}\(3\)y∗y^\{\*\}serves a dual purpose: it definesgeneration accuracy\(whetheraais correct\) and provides theground\-truth evaluation labelagainst whichℒ\\mathcal\{L\}’s judgmentyyis compared\. Using a single oracle signal for both tasks ensures GA and EA are grounded in a consistent external reference\.
#### Prompting\.
Both tasks use zero\-shot prompting with fixed templates\. The generation prompt instructsℒ\\mathcal\{L\}to answer using only the provided context, responding concisely\. The evaluation prompt presents\(c,q,a\)\(c,q,a\)and instructsℒ\\mathcal\{L\}to judge whetheraais correct given the context, responding with exactly“Correct”or“Incorrect\.”Identical context formatting is used across both tasks\.
### 3\.2Mechanistic Interpretation
To understand the mechanism behind any observed gap, we examine howℒ\\mathcal\{L\}internally allocates attention on the two tasks\. For each prompt we perform a single forward pass with attention outputs enabled and extract thelast\-token attention row—the distribution over all input positions the model uses when predicting the first output token\. This is the natural decision point for both tasks, since𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}both produce short outputs \(typically 1–3 tokens for𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}gold answers and a single token for𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}\), making the first\-token decision the most consequential one\.
We measure the fraction of attention mass falling on three labeled prompt spans—contextcc, questionqq, and candidate answeraa\(the latter only in𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}\)—yielding a span\-level fingerprint of each task’s allocation strategy\. Attention is averaged over all heads and over the final eight transformer layers \(layers 24–31 of the 32\-layer Llama\-3\.1\-8B\-Instruct\)\. Late layers in decoder\-only transformers consolidate high\-level semantic and task\-relevant representations, while earlier layers focus on lexical and positional features\(Belroseet al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib26); Gevaet al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib27)\); we therefore restrict the ratio computation to this band to isolate the semantic, decision\-relevant attention pattern from low\-level token\-matching behavior\.
### 3\.3Transferability Analysis
To test whether the parametric structure used for generation and evaluation is shared, we fine\-tuneℒ\\mathcal\{L\}with LoRA adapters in three configurations:
- •LoRA\-Gen: trained only on\(c,q\)→a∗\(c,q\)\\to a^\{\*\}examples\.
- •LoRA\-Eval: trained only on\(c,q,a\)→y∗\(c,q,a\)\\to y^\{\*\}examples, withaadrawn either from a hard\-negative pool \(described below\) or, when unavailable, from answer rotation\.
- •LoRA\-Both: trained on a balanced 1:1 union of the two task formats\.
Each adapter is then evaluated on*both*𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}using the same prompting templates and oracle scoring described in Section[3\.1](https://arxiv.org/html/2606.28050#S3.SS1)\. Asymmetric cross\-task gains would indicate that the two abilities draw on partially distinct parametric resources; symmetric gains would indicate a shared underlying capability\.
#### Hard\-negative generation for evaluator training\.
A naive negative for𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}training—an arbitrarily picked wrong answer—is trivially distinguishable from a correct one and produces an evaluator that does not generalize to plausible mistakes\. We instead generateplausible but incorrectcandidates by queryingℒ\\mathcal\{L\}*without*the context passage, forcing it to rely on parametric memory and produce hallucinated answers\. Up to five hallucinations are attempted per record; the first whose answer does*not*match the golda∗a^\{\*\}\(judged by the same oracleℒ∗\\mathcal\{L\}^\{\*\}\) is kept as the hard negative for that record\. Refusal\-style outputs \(e\.g\. “I need more information”\) are filtered, since they teach the evaluator to detect non\-answers rather than to verify factual content; records that yield no usable hard negative fall back to answer rotation, whereaais sampled from another instance’s gold answer\. The resulting training set thus mirrors the distribution of mistakes the evaluator will face at test time, where it judges actual model\-generated answers\.
## 4Experimental Setup
### 4\.1Datasets
We use the validation split of each dataset exclusively for the task\-asymmetry and attention analyses, and the training split only for LoRA fine\-tuning\. A detailed summary of dataset usage—including per\-experiment sample counts and the rationale for downsampling—is given in Appendix[B](https://arxiv.org/html/2606.28050#A2)\(Table[8](https://arxiv.org/html/2606.28050#A2.T8)\)\.
SQuAD 2\.0\(Rajpurkaret al\.,[2018](https://arxiv.org/html/2606.28050#bib.bib12)\)contains extractive reading comprehension over Wikipedia paragraphs; we retain only answerable questions, yielding 5,928 instances\. Both tasks make minimal synthesis demands, providing our most extractive benchmark\.
DROP\(Duaet al\.,[2019](https://arxiv.org/html/2606.28050#bib.bib14)\)tests discrete numerical reasoning over passage\-length texts; we retain number\-type answers, yielding 5,889 instances\. Generating the correct number requires multi\-step arithmetic; evaluating it reduces to numeric comparison\.
HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.28050#bib.bib13)\)requires 2\-hop reasoning across two supporting Wikipedia passages plus eight distractors \(all provided\); we retain non\-yes/no questions, yielding 6,947 instances\.
MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.28050#bib.bib11)\)is a multi\-hop benchmark with annotated hop counts \(1,252×\\times2\-hop; 760×\\times3\-hop; 405×\\times4\-hop\), allowing us to stratify the gap by reasoning depth\. We use the full answerable validation split of 2,417 instances, with 20 paragraphs provided per question\.
### 4\.2Models
We evaluate two models asℒ\\mathcal\{L\}:Llama\-3\.1\-8B\-Instruct\(Dubey and others,[2024](https://arxiv.org/html/2606.28050#bib.bib21)\), a capable open\-source model at modest scale, andGPT\-4o\-mini, a stronger proprietary model\. Comparing these two lets us examine whether the generation–evaluation gap varies with capability\. We useGPT\-4oas the oracleℒ∗\\mathcal\{L\}^\{\*\}throughout\(Zhenget al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib7)\)\. For the attention analysis, we use only the open\-source Llama\-3\.1\-8B\-Instruct, as it provides the internal attention weights required by the analysis\.
### 4\.3Implementation Details
All models are queried at temperatureτ=0\\tau=0\. We sample 500 instances uniformly from each validation split\. Each instance is processed sequentially:ℒ\\mathcal\{L\}generates answeraa, the oracle assignsy∗y^\{\*\}, andℒ\\mathcal\{L\}evaluates the sameaa\. Responses that do not begin withCorrectorIncorrectare treated as abstentions and excluded from metric computation\.
For the attention analysis, we expose raw attention weights via a single forward pass per prompt and extract the last\-token attention row\. Span boundaries are identified via character\-to\-token offset mapping\. Analysis uses the 184 jointly verified samples \(Section[5\.1](https://arxiv.org/html/2606.28050#S5.SS1)\)\.
For the transferability study, we fine\-tune Llama\-3\.1\-8B\-Instruct with LoRA \(rank 16,α=32\\alpha=32\) on the query, key, value, and output projections, training on 5,000 instances per dataset\. Full hyperparameters and hardware details are in Appendix[A](https://arxiv.org/html/2606.28050#A1)\.
### 4\.4Metrics
For each instance, we have the oracle labely∗y^\{\*\}andℒ\\mathcal\{L\}’s judgmentyy, from which we compute:
Generation Accuracy \(GA\):GA=1n∑i𝟏\[yi∗=Correct\]\\mathrm\{GA\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\mathbf\{1\}\[y^\{\*\}\_\{i\}=\\textsc\{Correct\}\], the fraction of generated answers judged correct by the oracle\.
Evaluation Accuracy \(EA\):EA=1n∑i𝟏\[yi=yi∗\]\\mathrm\{EA\}=\\frac\{1\}\{n\}\\sum\_\{i\}\\mathbf\{1\}\[y\_\{i\}=y^\{\*\}\_\{i\}\], the fraction whereℒ\\mathcal\{L\}’s self\-judgment matches the oracle\.Gap𝚫=EA−GA\\boldsymbol\{\\Delta\}=\\mathrm\{EA\}\-\\mathrm\{GA\}is our primary asymmetry measure \(positive⇒\\Rightarrowself\-evaluation exceeds generation\)\.
Evaluation Precision, Recall, and F1 \(EP / ER / EF1\)characterizeℒ\\mathcal\{L\}as a binary classifier withCorrectas the positive class, and are used to diagnose over\-acceptance bias\(Liuet al\.,[2023](https://arxiv.org/html/2606.28050#bib.bib8)\)\.
## 5Results and Discussion
We organize our findings around five research questions, beginning with the reliability of the oracle that grounds GA and EA\.
### 5\.1R1: Is the Oracle LLM Reliable?
Before turning to the substantive questions, we validate the oracleℒ∗\\mathcal\{L\}^\{\*\}that grounds both GA and EA\. A concern with any LLM\-as\-oracle framework is that the oracle’s judgments may themselves be unreliable\. In our setting this concern is mitigated by the nature of the task: answers on all four benchmarks are short factual responses—typically a single word, number, or key phrase—paired with an explicit gold answera∗a^\{\*\}, making the oracle’s task closer to fuzzy string matching than open\-ended evaluation\.
To quantify this empirically, we randomly sampled 50 instances per dataset \(200 total\) from the Llama\-3\.1\-8B\-Instruct results and presented each\(c,q,a,a∗\)\(c,q,a,a^\{\*\}\)tuple to two judges independently: GPT\-4o \(our main oracle\) and GPT\-5\.4 \(a stronger, more recent model, used as a super\-oracle reference\)\. Table[1](https://arxiv.org/html/2606.28050#S5.T1)reports inter\-model agreement\.
Table 1:GPT\-4o vs\. GPT\-5\.4 oracle agreement on 50 sampled Llama\-3\.1\-8B\-Instruct outputs per dataset\. FP = GPT\-4oCor\., GPT\-5\.4Inc\.; FN = opposite\. GPT\-4o intra\-model consistency \(fresh vs\. stored call\) is 99\.5% across all 200 samples\.GPT\-4o and GPT\-5\.4 agree on 92% of judgments overall, with near\-perfect agreement on extractive SQuAD 2\.0 \(98%\) and no systematic directional bias across the 16 disagreements\. GPT\-4o’s re\-call consistency is 99\.5%\. We therefore treat GPT\-4o as a reliable oracle and use the 184 jointly verified samples \(where both judges agree\) as the basis for the attention analysis\.
### 5\.2R2: Is Self\-Evaluation Easier Than Generation?
Table 2:Generation Accuracy \(GA\) and Evaluation Accuracy \(EA\) \(%\) per model–dataset pair\.Δ=EA−GA\\Delta=\\mathrm\{EA\}\-\\mathrm\{GA\}; positive indicates self\-evaluation exceeds generation\.Table[2](https://arxiv.org/html/2606.28050#S5.T2)reports GA, EA andΔ\\Deltaacross all model–dataset pairs\.The answer to our titular question is: not in general\.Δ<0\\Delta<0on three of four benchmarks for both models, indicating that generation accuracy*exceeds*self\-evaluation accuracy in the majority of settings\. The lone exception is MuSiQue, whereΔ=\+3\.6\\Delta=\+3\.6\(Llama\) and\+4\.0\+4\.0\(GPT\-4o\-mini\)\.
#### Task\-type dependence\.
The pre\-experiment intuition—that verifying a numeric answer reduces to exact\-match checking—is sharply overturned on DROP:Δ=−1\.0\\Delta=\-1\.0\(Llama\) and−5\.8\-5\.8\(GPT\-4o\-mini\), indicating that numeric self\-evaluation is*harder*, not easier, than generation\. A plausible explanation is that a model relying on a faulty arithmetic procedure to produce an answer cannot detect the resulting error during evaluation, because the same flawed procedure underlies its verification attempt\. The largest negative gap overall belongs to HotpotQA for Llama \(−14\.2\-14\.2\); the presence of the candidate answeraain the evaluation prompt appears to disrupt the model’s ability to independently trace the 2\-hop reasoning chain needed to assess it\. SQuAD 2\.0 shows the smallest absolute gaps for both models, consistent with minimal synthesis demands in the extractive setting\.
Table 3:Full evaluation breakdown: Precision \(EP\), Recall \(ER\), and F1 \(EF1, %\) per model–dataset pair\.
#### Acquiescence bias\.
The EP/ER decomposition \(Table[3](https://arxiv.org/html/2606.28050#S5.T3)\) reveals opposite biases: Llama exhibits EP\>\>ER on all four datasets—most pronounced on DROP \(74\.5/61\.874\.5/61\.8\) and MuSiQue \(67\.5/51\.467\.5/51\.4\)—indicating a*conservative bias*that over\-predictsIncorrectand directly explains its severe negativeΔ\\Deltaon HotpotQA\. GPT\-4o\-mini shows the inverse on three of four datasets, most pronounced on MuSiQue \(ER=88\.3=88\.3%, EP=70\.8=70\.8%\)—a mild over\-acceptance bias that inflates EA there and partially drivesΔ\>0\\Delta\>0\. Counter\-intuitively, the stronger model is the more over\-accepting evaluator\.
### 5\.3R3: How Does Reasoning Complexity Modulate the Gap?
MuSiQue’s per\-question hop annotations let us stratify the gap by reasoning depth within a single dataset—an analysis impossible with HotpotQA, which is uniformly 2\-hop\.
Table 4:Generation–evaluation gap on MuSiQue stratified by hop count\. The sharpΔ\\Deltaincrease from 2 to 3 hops for Llama reflects generation quality degrading faster than evaluation accuracy as depth grows\.Table[4](https://arxiv.org/html/2606.28050#S5.T4)shows the evolution ofΔ\\Deltaacross hop counts\. For Llama,Δ\\Deltatransitions sharply from−1\.1\-1\.1at 2 hops to\+10\.7\+10\.7at 3 hops and\+7\.7\+7\.7at 4 hops: generation accuracy falls to 55\.0% and 48\.7% at 3 and 4 hops respectively, while evaluation accuracy remains comparatively stable \(65\.7%, 56\.4%\)\. This asymmetry is consistent with a*candidate\-answer\-insulation*hypothesis: whenaais already present in the evaluation prompt, the model need not reconstruct the full reasoning chain, gaining a partial advantage over generation at high depth\. GPT\-4o\-mini exhibits a consistently positiveΔ\\Deltaat all hop counts \(\+4\.6\+4\.6,\+2\.1\+2\.1,\+5\.1\+5\.1\), with less variation, reflecting its greater robustness on both tasks\. The modest compression at 3 hops for both models suggests a qualitative threshold between 2\-hop and higher\-order reasoning rather than smooth linear degradation\.
### 5\.4R4: What Internal Mechanisms Underlie the Gap?
To probe the mechanism behind the observed gap, we analyze last\-token attention patterns of Llama\-3\.1\-8B\-Instruct on𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}across the 184 jointly verified samples \(Section[5\.1](https://arxiv.org/html/2606.28050#S5.SS1)\)\. The single forward pass per prompt extracts the attention distribution that the model assigns over input positions when predicting the first output token\. Attention is averaged over all heads and over layers 24–31\. We measure the fraction of attention mass directed at three labeled spans: contextcc, questionqq, and candidate answeraa\(𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}only\)\. The remaining mass—typically 80–98%—is absorbed by tokens outside these spans \(system prompt, chat template markers, instruction text\)\.
Figure 2:Mean last\-token attention fraction directed to context \(cc\) and candidate answer \(aa\) for𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}, averaged over layers 24–31 and 184 jointly verified Llama\-3\.1\-8B\-Instruct samples\.𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}consistently de\-attends to context by 3–5×\\timesrelative to𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and allocates negligible attention \(0\.3–0\.5%\) to the candidate answer it is judging\.Figure[2](https://arxiv.org/html/2606.28050#S5.F2)reports the mean attention fractions\.𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}directs 11\.3–20\.7% of last\-token attention to the context passage, with the fraction growing with task complexity \(SQuAD 2\.0/DROP: 11\.3%<<HotpotQA: 16\.7%<<MuSiQue: 20\.7%\), consistent with multi\-hop generation requiring deeper context engagement\.𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}, by contrast, allocates only 1\.4–5\.4% of attention to context—a 3–5×\\timesreduction across all four datasets\. More strikingly,𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}devotes only 0\.3–0\.5% of attention to the candidate answeraait is tasked with judging\.
#### Interpretation\.
The consistent context de\-attention in𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}is a*structural signature*of the evaluation task: rather than re\-reading the passage to verify the answer, the model’s last\-token decision relies primarily on structural and instructional tokens\. This pattern directly explains whyΔ\\Deltais negative on three of the four benchmarks\. On SQuAD 2\.0, DROP, and HotpotQA,𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}allocates 11\.3–16\.7% of last\-token attention to the context while𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}allocates only 1\.4–4\.1%—insufficient to re\-trace the lookup𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}performs at synthesis time\. The gap widens precisely where context engagement matters most: HotpotQA’s largest negativeΔ=−14\.2\\Delta=\-14\.2\(Llama\) coincides with the largest gen\-to\-eval reduction in context attention \(16\.7%→\\to4\.1%\), and SQuAD 2\.0’s smallerΔ=−3\.4\\Delta=\-3\.4matches the smallest such reduction \(11\.3%→\\to1\.4%\) on a task whose synthesis demands are minimal\. The lone exception is MuSiQue, whereΔ\>0\\Delta\>0despite an even sharper context\-attention drop \(Δctx=−0\.153\\Delta\_\{\\text\{ctx\}\}=\-0\.153\); the attention pattern alone is therefore not sufficient to explain this inversion, and we defer a fuller discussion of it to the Limitations section\.
#### Extended analyses\.
Appendix[C](https://arxiv.org/html/2606.28050#A3)extends this analysis along three orthogonal axes—paragraph\-level selectivity \(supporting vs\. distractor\), answer\-mention concentration within the context, and causal ablations of the candidate\-answer slot—which together sharpen the mechanistic picture sketched above\.
Table 5:GA, EA, andΔ\\Delta\(%\) under LoRA fine\-tuning\. Joint trains on all datasets; per\-ds on one at a time\. EA gains under LoRA\-Gen/Both are acquiescence\-driven \(Table[6](https://arxiv.org/html/2606.28050#S5.T6)\), not genuine transfer\.Table 6:EP, ER, and EF1 \(%\) per checkpoint,Correctas positive\. LoRA\-Gen/Both drive ER→100%\{\\to\}\\,100\\%\(acquiescence\); LoRA\-Eval drives it too low\. MuSiQue abstention: 13\.0% \(Gen/Both\), 15\.4% \(Eval\)\.Table 7:Average gain over base across four datasets \(pp\), joint adapters only\.
### 5\.5R5: Does Cross\-Task Transfer Occur Under LoRA Fine\-Tuning?
To test whether the parametric structure underlying generation and evaluation is shared, we fine\-tune Llama\-3\.1\-8B\-Instruct with LoRA in three configurations \(LoRA\-Gen, LoRA\-Eval, LoRA\-Both; see Section[3\.3](https://arxiv.org/html/2606.28050#S3.SS3)\) and evaluate each on both tasks across all four datasets\.
#### Hypotheses\.
We pre\-register three expectations: \(i\) LoRA\-Gen should raise GA; whether EA follows tests generation\-to\-evaluation transfer\. \(ii\) LoRA\-Eval should raise EA; whether GA improves tests the reverse\. \(iii\) LoRA\-Both tests whether joint training delivers both gains or one ability dominates the parameter budget\. Asymmetric outcomes would indicate the two abilities draw on overlapping but not identical parametric structure\(Dymkiewiczet al\.,[2025](https://arxiv.org/html/2606.28050#bib.bib19)\)\.
#### LoRA\-Gen flipsΔ\\Deltaon every dataset—but via acquiescence\.
After LoRA\-Gen,Δ≥0\\Delta\\geq 0on all four datasets \(Table[5](https://arxiv.org/html/2606.28050#S5.T5)\), with the largest swing on HotpotQA \(−14\.2→\+0\.4\-14\.2\\to\+0\.4, driven by EA rising from69\.069\.0to84\.4%84\.4\\%\)\. However, the EP/ER decomposition \(Table[6](https://arxiv.org/html/2606.28050#S5.T6)\) reveals this is entirely a recall\-side bias shift: evaluator recall reaches100%100\\%on HotpotQA and MuSiQue,99\.6%99\.6\\%on SQuAD 2\.0, and94\.8%94\.8\\%on DROP—the adapters predictCorrectfor nearly every input\. LoRA\-Both shows the same acquiescence pattern \(ER≥97%\\mathrm\{ER\}\\geq 97\\%on every dataset\)\. EA rises only because the underlying GA is high enough that near\-uniformCorrectresponses happen to match the oracle most of the time\. This is the acquiescence failure mode flagged in Section[5\.2](https://arxiv.org/html/2606.28050#S5.SS2), driven to an extreme by training\.
#### LoRA\-Eval degrades both tasks\.
Training only on the evaluation task is the most striking failure mode\. GA falls by16\.816\.8pp on DROP,17\.617\.6pp on MuSiQue, and6\.66\.6pp on HotpotQA, even though the adapter never saw a generation objective\. EA also drops on SQuAD 2\.0 and HotpotQA\. The EP/ER pattern is the inverse of LoRA\-Gen: ER drops to34\.8%34\.8\\%on DROP and58\.8%58\.8\\%on MuSiQue, making the model overly conservative\. LoRA\-Eval as configured is not a usable adapter\.
#### The MuSiQueΔ\>0\\Delta\>0inversion is structural\.
MuSiQue is the only dataset where the base model hadΔ\>0\\Delta\>0\. The inversion persists across all three adapters at progressively larger magnitudes \(\+7\.8\+7\.8,\+22\.8\+22\.8,\+9\.0\+9\.0\)\. The invariance across very different training regimes directly evidences that this reflects the underlying task pair—multi\-hop generation is intrinsically harder than verifying a presented answer—rather than a base\-model artifact, supporting the candidate\-answer\-insulation hypothesis of Section[5\.3](https://arxiv.org/html/2606.28050#S5.SS3)\. Detailed per\-dataset analyses \(why LoRA\-Gen fails to improve GA, the HotpotQA evaluator collapse, and per\-dataset LoRA\-Both calibration\) are given in Appendix[D](https://arxiv.org/html/2606.28050#A4)\.
#### Takeaway\.
Generation SFT suppresses evaluation discrimination \(acquiescence,ER→100%\\mathrm\{ER\}\\to 100\\%\); evaluation SFT suppresses generation and can destroy the evaluator on individual datasets\. Per\-dataset LoRA\-Both is the only configuration achieving\|Δ\|≤1\|\\Delta\|\\leq 1pp without acquiescence on three of four datasets—but cannot recover base GA on harder datasets\. The generation–evaluation asymmetry is an intrinsic property of in\-context QA that SFT can redistribute but not eliminate\.
## 6Conclusion
Across four benchmarks and two models, generation accuracy exceeds self\-evaluation on three of four—self\-evaluation is not uniformly easier—with MuSiQue the exception due to a generation\-difficulty ceiling at high hop counts\. Attention analysis reveals a consistent structural cause: the evaluation task attends to context 3–5x less than generation does and barely reads the candidate answer\. LoRA fine\-tuning confirms this asymmetry is not a training artifact: generation fine\-tuning induces over\-acceptance and evaluation fine\-tuning degrades generation\.
Evaluation failures in self\-assessment pipelines are structurally rooted in the model’s attention\-routing strategy at inference time, not in insufficient training signal\. Pipelines relying on self\-assessment—especially on multi\-hop or numerical tasks—should account for the direction and magnitude of theΔ\\Deltagap reported here, and treat acquiescence bias as a real risk whenever generation fine\-tuning is part of the training recipe\.
## Limitations
The attention analysis is restricted to the open\-source Llama\-3\.1\-8B\-Instruct; GPT\-4o\-mini’s internals are inaccessible\. We also analyze only the first\-token decision point—i\.e\. the attention distribution at the position predicting the first output token\. For short\-answer QA this is the most consequential position \(gold answers are typically 1–3 tokens, and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}’s output is exactly one token\), but the analysis does not generalize directly to long\-form generation, where later positions could exhibit qualitatively different attention patterns\.
Our mechanistic account is most informative for the three benchmarks whereΔ<0\\Delta<0\(SQuAD 2\.0, DROP, HotpotQA\); we do not isolate a corresponding mechanism for the positiveΔ\\Deltaon MuSiQue\. The MuSiQue inversion is more parsimoniously explained by generation difficulty reaching a floor at high hop counts \(Section[5\.3](https://arxiv.org/html/2606.28050#S5.SS3)\) than by any attention\-level advantage in𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}specific to that dataset\. The fact that the model barely attends to the candidate answer is itself surprising and a target for follow\-up: it suggests that decisions on𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}may often be driven by structural priors rather than by a genuine semantic comparison\. Locating the layer at which the candidate slot fuses with the context \(e\.g\. via logit\-lens or tuned\-lens analysis\) is a natural next step we leave to future work\.
All prompts are fixed and zero\-shot; results may differ with few\-shot demonstrations or paraphrased prompts, and we leave prompt sensitivity analysis to future work\. All benchmarks are English\-only; cross\-lingual generalization is not guaranteed\. Our framework is scoped to QA settings where answer correctness admits an unambiguous binary label; abstractive or opinion\-based QA presents a fundamentally different evaluation challenge outside our current scope\.
## Ethical Considerations
This work studies the behavior of publicly available LLMs \(Llama\-3\.1\-8B\-Instruct, GPT\-4o\-mini, GPT\-4o\) on existing English\-language QA benchmarks \(SQuAD 2\.0, DROP, HotpotQA, MuSiQue\), all of which are released for research use\. No new data was collected, no human subjects were involved, and the findings are descriptive and analytical in nature\. We do not foresee any direct harms arising from this research\. There are no major ethical concerns with this submission\.
## References
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.11511)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.External Links:[Link](https://arxiv.org/abs/2212.08073)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Bavaresco, R\. Bernardi, L\. Bertolazzi, D\. Elliott, R\. Fernández, A\. Gatt, E\. Ghaleb, M\. Giulianelli, M\. Hanna, A\. Koller, A\. F\. T\. Martins, P\. Mondorf, V\. Neplenbroek, S\. Pezzelle, B\. Plank, D\. Schlangen, A\. Suglia, A\. K\. Surikuchi, E\. Takmaz, and A\. Testoni \(2025\)LLMs instead of human judges? A large scale empirical study across 20 NLP evaluation tasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Vienna, Austria,pp\. 238–255\.Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt \(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.External Links:[Link](https://arxiv.org/abs/2303.08112)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.28050#S3.SS2.p2.4)\.
- D\. Dua, Y\. Wang, P\. Dasigi, G\. Stanovsky, S\. Singh, and M\. Gardner \(2019\)DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2368–2378\.External Links:[Link](https://aclanthology.org/N19-1246)Cited by:[§4\.1](https://arxiv.org/html/2606.28050#S4.SS1.p3.1)\.
- A\. Dubeyet al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.2](https://arxiv.org/html/2606.28050#S4.SS2.p1.2)\.
- K\. Dymkiewicz, I\. Vulic, H\. Yannakoudakis, E\. Shapira, R\. Reichart, and A\. Korhonen \(2025\)Donors and recipients: on asymmetric transfer across tasks and languages with parameter\-efficient fine\-tuning\.External Links:2511\.13368Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px4.p1.1),[§5\.5](https://arxiv.org/html/2606.28050#S5.SS5.SSS0.Px1.p1.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://aclanthology.org/2023.emnlp-main.751)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.28050#S3.SS2.p2.4)\.
- X\. Ho, A\. Duong Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics \(COLING\),pp\. 6609–6625\.External Links:[Link](https://aclanthology.org/2020.coling-main.580)Cited by:[§C\.4](https://arxiv.org/html/2606.28050#A3.SS4.SSS0.Px1.p1.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.01798)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Jiang, J\. Zhang, O\. Weller, N\. Weir, B\. Van Durme, and D\. Khashabi \(2025\)SELF\-\[IN\]CORRECT: LLMs struggle with discriminating self\-generated responses\.InProceedings of the 39th AAAI Conference on Artificial Intelligence,External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/34603)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p2.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Lin, S\. Wei, H\. Huang, and H\. Chen \(2025\)Do before you judge: self\-reference as a pathway to better LLM evaluation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 24651–24672\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1342)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p2.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 2511–2522\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.153)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1),[§4\.4](https://arxiv.org/html/2606.28050#S4.SS4.p4.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Nandy and S\. Bandyopadhyay \(2025\)Language models of code are few\-shot planners and reasoners for multi\-document summarization with attribution\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24930–24938\.Cited by:[§1](https://arxiv.org/html/2606.28050#S1.SS0.SSS0.Px1.p2.1)\.
- J\. Oh, E\. Kim, I\. Cha, and A\. Oh \(2024\)The generative AI paradox on evaluation: “what it can solve, it may not evaluate”\.InProceedings of the Student Research Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics,St\. Julian’s, Malta,pp\. 248–257\.External Links:[Link](https://aclanthology.org/2024.eacl-srw.19)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p2.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell,et al\.\(2022\)In\-context learning and induction heads\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Panickssery, S\. R\. Bowman, and S\. Feng \(2024\)LLM evaluators recognize and favor their own generations\.InAdvances in Neural Information Processing Systems,Vol\.37\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/7f1f0218e45f5414c79c0679633e47bc-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Rajpurkar, R\. Jia, and P\. Liang \(2018\)Know what you don’t know: unanswerable questions for SQuAD\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 784–789\.External Links:[Link](https://aclanthology.org/P18-2124)Cited by:[§4\.1](https://arxiv.org/html/2606.28050#S4.SS1.p2.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multistep question answering via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.External Links:[Link](https://aclanthology.org/2022.tacl-1.31)Cited by:[§4\.1](https://arxiv.org/html/2606.28050#S4.SS1.p5.3)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259)Cited by:[§4\.1](https://arxiv.org/html/2606.28050#S4.SS1.p4.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§1](https://arxiv.org/html/2606.28050#S1.p1.1),[§2](https://arxiv.org/html/2606.28050#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.28050#S4.SS2.p1.2)\.
## Appendix
## Appendix AReproducibility
#### General setup\.
All models are queried at temperatureτ=0\\tau=0with greedy decoding\. API calls use fixed seeds where supported\. Evaluation responses that do not begin withCorrectorIncorrectafter stripping leading whitespace are treated as abstentions and excluded from metric computation; abstention rates are reported alongside all results\.
#### Attention analysis\.
We setattn\_implementation="eager"in the Llama\-3\.1\-8B\-Instruct forward pass to expose raw attention tensors\. A single forward pass is performed per prompt; only the last\-token attention row is retained to keep memory bounded\. Context, question, and answer spans are located via the tokenizer’s character\-to\-token offset mapping\. Attention weights are averaged over all heads and over layers 24–31 \(the final eight layers of the 32\-layer model\)\.
#### LoRA fine\-tuning\.
We fine\-tune Llama\-3\.1\-8B\-Instruct with LoRA adapters \(rank 16,α=32\\alpha=32\) applied toq\_proj,k\_proj,v\_proj, ando\_proj\. Training uses cross\-entropy loss on response tokens only \(completion\_only\_loss=True\), 3 epochs, effective batch size 16, learning rate 2e\-4 with a cosine schedule, and bfloat16 precision\. Hard negatives are generated with up to 5 hallucination attempts per record \(Section[3\.3](https://arxiv.org/html/2606.28050#S3.SS3)\)\. After training, each adapter is evaluated on the same 500\-instance validation split used for the task\-asymmetry analysis\.
#### Hardware\.
All API\-based experiments \(oracle scoring, GPT\-4o\-mini generation and evaluation, oracle reliability study\) are run from a standard CPU host\. Llama\-3\.1\-8B\-Instruct inference, the attention analysis, and all LoRA training and evaluation are run on a server with4×4\\timesNVIDIA A100\-SXM4\-80GB GPUs \(CUDA 13\.0\)\.
## Appendix BUse of Datasets
This appendix documents the per\-experiment sampling choices and the rationale behind them\. Table[8](https://arxiv.org/html/2606.28050#A2.T8)lists the full official split sizes, the per\-experiment sample counts, and the average context length per dataset\.
Table 8:Extended split usage\. \#Train is the official training split size; \#Val \(filtered\) is the validation size after the dataset\-specific filters described in §[4](https://arxiv.org/html/2606.28050#S4)\(e\.g\., answerable only for SQuAD 2\.0, number\-type only for DROP\)\.LoRA train: per\-dataset budget drawn from the train split for LoRA\-Gen, LoRA\-Eval, and LoRA\-Both\.Task\-asym\. / LoRA eval: shared evaluation sample, drawn once and reused across the base model and all LoRA checkpoints for direct comparability\.Attention \(verified\): subset of the 50\-per\-dataset oracle\-reliability sample where GPT\-4o and GPT\-5\.4 agree, used as the basis for the attention analysis \(184 total\)\. MuSiQue validation hop distribution: 1,252 / 760 / 405 for 2 / 3 / 4 hops\.#### Cost and compute\.
The most expensive operation per evaluation instance is the oracle call \(GPT\-4o\), invoked once for every\(c,q,a,a∗\)\(c,q,a,a^\{\*\}\)tuple\. Across two models \(Llama\-3\.1\-8B\-Instruct and GPT\-4o\-mini\) and four datasets, the full filtered validation sets would total roughly 42,000 oracle\-bearing instances, against∼\\sim4,000 at 500 per \(model, dataset\)\. Each task\-asymmetry instance further triggers a generation call and a self\-evaluation call, tripling the API budget\. For the attention analysis the binding constraint is GPU memory rather than API cost: each forward pass withoutput\_attentions=TruerequiresO\(L⋅H⋅S2\)O\(L\\cdot H\\cdot S^\{2\}\)activations, which for MuSiQue’s∼\\sim2–4 k\-token contexts is already GPU\-heavy on an A100\. The LoRA fine\-tuning runs are similarly compute\-bound—three independent adapter trainings, 3 epochs each at effective batch 16, total∼\\sim30 GPU\-hours even at the 5,000\-per\-dataset training budget\.
#### Statistical sufficiency for a binary outcome\.
GA, EA, EP, ER, andΔ\\Deltaare all binomial\-proportion quantities\. Atn=500n=500the 95% confidence half\-width is at most±4\.4\\pm 4\.4pp \(atp=0\.5p=0\.5\) and tighter at the observed values \(e\.g\.±2\.4\\pm 2\.4pp atGA=95%\\mathrm\{GA\}=95\\%\)\. TheΔ\\Deltavalues reported in Table[2](https://arxiv.org/html/2606.28050#S5.T2)range from−14\.2\-14\.2to\+4\.0\+4\.0pp; on MuSiQue stratified by hop count \(Table[4](https://arxiv.org/html/2606.28050#S5.T4)\) they reach\+10\.7\+10\.7pp—all well outside the noise floor of a 500\-instance sample\. Even the smallest cell in the hop analysis \(MuSiQue 4\-hop,n=78n=78\) has a half\-width of±11\\pm 11pp, still narrower than its observedΔ=\+7\.7\\Delta=\+7\.7\. Increasingnnbeyond 500 would refine the precision of individual cells but would not change the sign or qualitative pattern of any reportedΔ\\Delta\. Binary\-outcome inference is unusually efficient in this respect: in contrast to continuous metrics such as ROUGE or BLEU where 500 samples are often borderline, accuracy\-style metrics over Bernoulli outcomes are well\-resolved at this scale\.
#### Comparability across models, datasets, and conditions\.
Using a fixed seed and the same 500\-instance sample per dataset across all conditions gives every model and every checkpoint exactly the same evaluation set\. Two consequences follow\. First, comparisons across models \(Llama vs\. GPT\-4o\-mini\) and across LoRA checkpoints \(Base vs\. LoRA\-Gen / LoRA\-Eval / LoRA\-Both\) isolate the variable of interest from sampling variance—if a metric moves, it moved because of the model or the training, not because the underlying instance pool changed\. Second, datasets carry equal weight in cross\-dataset averages: a free\-floating evaluation on full validation sets would let HotpotQA \(6,947 filtered\) and SQuAD 2\.0 \(5,928\) dominate MuSiQue \(2,417\) when reporting any aggregate\.
#### LoRA training budget\.
The 5,000\-per\-dataset training budget \(20,000 total for LoRA\-Gen, 40,000 for LoRA\-Eval after hard negatives, 40,000 for LoRA\-Both\) serves the same comparability goal: it balances the four datasets in joint training, preventing the much larger SQuAD 2\.0 and HotpotQA training pools from dwarfing MuSiQue\. It also sits comfortably within the data regime where rank\-16 LoRA adapters reliably saturate, so the choice trades off little expected adapter quality for substantial training\-cost savings\.
#### Attention\-analysis sub\-sample\.
The further restriction to the 184 jointly verified samples \(where GPT\-4o and GPT\-5\.4 agree\) is amethodologicalchoice rather than a budgetary one\. The attention analysis interpretation depends on knowing whether𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}’s output is genuinely correct and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}’s judgment genuinely right or wrong; restricting to instances where two strong oracles agree removes ambiguity introduced by single\-oracle noise\. The per\-dataset verified counts \(49, 46, 44, 45\) yield per\-cell standard errors of∼\\sim2–3 pp on the attention\-fraction estimates—comfortably tighter than the 9–15 ppΔctx\\Delta\_\{\\text\{ctx\}\}effects we report in Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)\.
## Appendix CExtended Attention and Causal Ablation Analyses
This appendix reports three additional analyses that decompose the last\-token attention findings of Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)and provide a more granular mechanistic account of whyΔ<0\\Delta<0on three of the four benchmarks\. The analyses are: \(i\) a paragraph\-level decomposition of context attention into gold\-supporting vs\. distractor spans \(Appendix[C\.1](https://arxiv.org/html/2606.28050#A3.SS1)\); \(ii\) attention to context tokens that mention the candidate answer \(Appendix[C\.2](https://arxiv.org/html/2606.28050#A3.SS2)\); and \(iii\) causal ablations of the candidate\-answer slot in𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}\(Appendix[C\.3](https://arxiv.org/html/2606.28050#A3.SS3)\)\. All experiments use the same Llama\-3\.1\-8B\-Instruct model and analysis configuration \(layers 24–31, head\-averaged, last\-token attention row\) as Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)\. Analyses \(i\) and \(ii\) reuse the 184 jointly verified samples; \(iii\) is run on the full 500\-instance subset per dataset for statistical power on accuracy estimates\. Across the three analyses, MuSiQue’s positiveΔ\\Deltais consistently the case the experiments are*not*sufficient to explain; we revisit this limitation at the end of Appendix[C\.3](https://arxiv.org/html/2606.28050#A3.SS3)\.
### C\.1Paragraph\-level selectivity: supporting vs\. distractor
#### Motivation\.
Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)reports that𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}directs only 1\.4–5\.4% of last\-token attention at the context—a 3–5×\\timesreduction relative to𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}\. A natural hypothesis is that this smaller budget is more*selectively*concentrated on the gold supporting paragraphs \(with the candidate answer acting as a retrieval anchor\), which would be especially advantageous on MuSiQue, where retrieval cost is highest \(2 of 20 paragraphs are gold\)\.
#### Method\.
For HotpotQA and MuSiQue we re\-fetch the per\-paragraphis\_supportingannotations from the source HuggingFace datasets, split the concatenated context into paragraphs by the join delimiter used in preprocessing, and sum the head\-averaged last\-token attention mass into each paragraph’s token range\. Per\-token concentration normalizes for paragraph length:sup\_per\_tokis the mass on supporting paragraphs divided by the number of supporting\-paragraph tokens, analogously for distractors\.
Table 9:Paragraph\-level selectivity of last\-token attention\.sup shareis the fraction of paragraph\-level context attention falling on gold supporting paragraphs \(uniform\-attention baseline: 0\.20 for HotpotQA’s 2\-of\-10, 0\.10 for MuSiQue’s 2\-of\-20\)\.ratiois the per\-token attention concentration on supporting paragraphs divided by the same quantity on distractors; a value of 1 indicates no per\-token preference\.𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}is sharply selective \(3\.6–4\.6×\\times\);𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}is essentially uniform \(1\.2–1\.4×\\times\)\.
#### Result\.
Table[9](https://arxiv.org/html/2606.28050#A3.T9)reports the per\-token concentration ratio\.𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}allocates 3\.6–4\.6×\\timesmore attention per token to supporting paragraphs than to distractors\.𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}collapses this ratio to 1\.2–1\.4×\\times—essentially uniform across the context\. The supporting share itself drops similarly: from 0\.41 to 0\.19 on HotpotQA, from 0\.31 to 0\.16 on MuSiQue \(both eval values barely above the respective uniform baselines of 0\.20 and 0\.10\)\.
#### Interpretation\.
On HotpotQA the generator’s 4\.59×\\timesper\-token preference for supporting paragraphs reflects the targeted retrieval needed to chain the two relevant passages\. The evaluator’s near\-uniform ratio \(1\.38×\\times\) shows that no equivalent re\-retrieval happens at the verification decision point, providing a paragraph\-level mechanism for HotpotQA’s large negativeΔ=−14\.2\\Delta=\-14\.2: the generator selectively reads what it needs to answer; the evaluator does not selectively read what it would need to verify\. On MuSiQue,𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}exhibits the same collapse in selectivity \(1\.21×\\times\) butΔ\>0\\Delta\>0, so the inversion is*not*attributable to sharper context routing\. As Section[5\.3](https://arxiv.org/html/2606.28050#S5.SS3)argues, MuSiQue’s inversion is more parsimoniously explained by generation accuracy bottoming out at higher hop counts\.
### C\.2Answer\-mention concentration in the context
#### Motivation\.
A second hypothesis is that𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}performs needle\-in\-haystack verification: locate tokens in the context that mention the candidate answer and concentrate attention there\. The candidate’s presence in the prompt should make this strategy especially natural for𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}relative to𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}\.
#### Method\.
For each prompt we identify token positions in the context that match content words \(≥3\\geq 3characters\) of the candidate answer, case\-insensitively and with non\-alphanumeric word boundaries \(with a fallback to shorter words for single\-digit DROP answers\)\. We then compare the per\-token attention concentration on those positions to that of the remaining \(non\-mention\) context tokens\.
Table 10:Per\-token attention concentration ratio between context tokens that match content words of the candidate answer \(“mentions”\) and the remainder of the context\. A value of 1 indicates no per\-token preference\.𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}concentrates sharply on answer\-mention positions \(5–23×\\times\);𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}is near\-uniform \(1\.0–2\.1×\\times\)\.Figure 3:Per\-token attention ratio on answer\-mention tokens vs\. non\-mention context tokens\.𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}exhibits sharp needle\-in\-haystack lookup behaviour;𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}does not\.
#### Result\.
Table[10](https://arxiv.org/html/2606.28050#A3.T10)and Figure[3](https://arxiv.org/html/2606.28050#A3.F3)show that𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}concentrates 5–23×\\timesmore attention per token on answer\-mention positions than on the rest of the context, with the ratio largest on the multi\-hop benchmarks \(HotpotQA 17\.24, MuSiQue 22\.66\) where the answer span must be located within long, distractor\-heavy contexts\.𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}collapses this ratio to 1\.0–2\.1×\\times\.
#### Interpretation\.
𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}, not𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}, is the task that performs the answer\-locating lookup\. This asymmetry provides a complementary mechanistic explanation forΔ<0\\Delta<0on all three benchmarks where the gap is negative: the generator’s sharp lookup converts context engagement into a correct span—most strongly on the multi\-hop HotpotQA \(17\.24×\\times,Δ=−14\.2\\Delta=\-14\.2\) and the extractive SQuAD 2\.0 \(13\.03×\\times,Δ=−3\.4\\Delta=\-3\.4\), with a milder but qualitatively identical pattern on DROP \(5\.05×\\times,Δ=−1\.0\\Delta=\-1\.0\)—while the evaluator’s near\-uniform attention leaves it unable to perform the same lookup at verification time, so subtly\-correct answers are misjudged as wrong \(and vice versa\) more often than the generator misses them in the first place\. On MuSiQue the eval\-side ratio is similarly diffuse \(2\.12×\\times\) yetΔ\>0\\Delta\>0, again indicating that the inversion is not produced by the attention pattern at the last\-token decision and is instead consistent with the generation\-floor account of Section[5\.3](https://arxiv.org/html/2606.28050#S5.SS3)\. Together with Appendix[C\.1](https://arxiv.org/html/2606.28050#A3.SS1), the picture across negative\-Δ\\Deltadatasets is consistent:𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}’s last\-token attention is*structurally diffuse*, not selectively targeted at task\-relevant content, and this diffuseness is what costs the evaluator the accuracy that synthesis enjoys\.
### C\.3Causal ablations of the candidate\-answer slot
#### Motivation\.
The analyses above are correlational\. To probe*causally*how much of𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}’s behavior is driven by the candidate answeraain the prompt, we intervene on the candidate\-answer slot in two complementary ways\.
- •C\-MASKreplacesaawith the placeholder\[REDACTED\]\. The original oracle verdict remains the reference label, so EA under C\-MASK measures how often the model recovers the right answer*without*access to the candidate\.
- •C\-SWAPreplacesaawith the gold answer of another instance from the same dataset, selected by a fixed index offset for reproducibility \(swap\-collisions with the original gold are dropped\)\. The new candidate is almost always wrong, so the reference label isIncorrect; the rejection rate measures whether𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}is performing real verification or merely rubber\-stamping the candidate\.
Both ablations are run on all 500 instances per dataset; the binary judgment is recovered from a single forward pass by comparing the logits of the first BPE tokens of “Correct” and “Incorrect”\.
Table 11:C\-MASK ablation: candidateaareplaced by\[REDACTED\]\.Δ=EAC\-MASK−EAbase\\Delta=\\mathrm\{EA\}\_\{\\text\{C\-MASK\}\}\-\\mathrm\{EA\}\_\{\\text\{base\}\}\(all four are drops\)\.flip: fraction of instances where the C\-MASK judgment differs from baseline\. MuSiQue’s EA collapses to exactly chance \(50\.0%\)\.Table 12:C\-SWAP ablation: candidateaareplaced by another instance’s gold answer \(same dataset; 9 collisions with the original gold were dropped\)\. Reference label isIncorrect;Rej\.%is the fraction of judgments that correctly sayIncorrect\. The final column conditions on the baseline judgment beingCorrect, isolating cases where the model flipped from endorsement to rejection\. C\-SWAP rejection on MuSiQue is universal under this conditioning\.
#### Result\.
C\-MASK \(Table[11](https://arxiv.org/html/2606.28050#A3.T11)\) drops EA by 5\.8–23\.4 pp and flips 31–39% of judgments; the candidate\-answer slot is materially load\-bearing on every dataset\. MuSiQue’s EA collapses from59\.259\.2% to chance \(50\.0%\) under C\-MASK—the strongest single signal of candidate dependence in the experiment\. C\-SWAP \(Table[12](https://arxiv.org/html/2606.28050#A3.T12)\) rejects the swapped \(incorrect\) candidate on 95\.7–99\.8% of instances, and*universally*\(100%\) on the MuSiQue subset where the baseline had saidCorrect\.
#### Interpretation\.
The two probes are mutually supportive\. C\-SWAP shows that𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}*is*performing genuine verification when the candidate is unambiguously wrong \(it does not rubber\-stamp\), and C\-MASK shows that the candidate slot is also*required*—when it is removed, EA degrades sharply on every dataset\. The evaluator’s strategy is therefore best characterized ascandidate\-anchored shallow verification: a fast check anchored at the candidate\-answer slot in the prompt rather than at the last\-token attention row over the context\. This account directly explains the consistent pattern ofΔ<0\\Delta<0on SQuAD 2\.0, DROP, and HotpotQA: when the generator produces a near\-miss, the evaluator’s shallow check is insufficient to detect the subtle error—even though, as C\-SWAP confirms, it would readily reject an unrelated answer\. Combined with Appendices[C\.1](https://arxiv.org/html/2606.28050#A3.SS1)–[C\.2](https://arxiv.org/html/2606.28050#A3.SS2), the mechanistic picture across these three datasets is consistent: verification is integrated into the residual stream by earlier layers around the candidate\-answer slot, after which the last\-token attention serves mainly to read out a decision that has already been formed, with the evaluator paying for that efficiency in lost accuracy relative to the more attention\-engaged generator\.
#### Note on MuSiQue\.
The eval\-side attention pattern on MuSiQue is qualitatively the same as on the negative\-Δ\\Deltadatasets \(diffuse paragraph attention, low mention concentration\), the C\-MASK drop is in the same direction \(post\-ablation EA reaches exactly chance\), and C\-SWAP rejection is similarly near\-universal\. The three analyses thus do not isolate a MuSiQue\-specific mechanism; we discuss this limitation and the most plausible alternative account in the Limitations section\.
### C\.4Additional multi\-hop dataset: 2WikiMultiHopQA
#### Motivation\.
The four benchmarks of the main analysis include two multi\-hop datasets \(HotpotQA, MuSiQue\) that already display divergent behavior:Δ<0\\Delta<0on HotpotQA butΔ\>0\\Delta\>0on MuSiQue\. To test whether the mechanistic findings of Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)generalize beyond these two datasets, we replicate the task\-asymmetry and attention analyses on a third multi\-hop benchmark, 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.28050#bib.bib15)\), which has the additional benefit of providing human\-annotated question\-type labels \(bridge\-comparison,comparison,compositional,inference\)\. This enables a within\-dataset breakdown of how reasoning structure modulates the gap and the underlying attention pattern\.
#### Method\.
We sample 500 validation instances from 2WikiMultiHopQA following the same protocol used in Section[4](https://arxiv.org/html/2606.28050#S4), run Llama\-3\.1\-8B\-Instruct asℒ\\mathcal\{L\}for both𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}and𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}atτ=0\\tau=0, and score the generations with GPT\-4o as the oracleℒ∗\\mathcal\{L\}^\{\*\}\. The attention analysis follows Appendix[C](https://arxiv.org/html/2606.28050#A3):attn\_implementation="eager", last\-token attention row, head\-averaged, restricted to layers 24–31\. We use alln=495n=495valid instances \(those with a non\-error oracle verdict\) rather than the GPT\-4o∩\\capGPT\-5\.4 verified subset; the GPT\-5\.4 endpoint used in Section[5\.1](https://arxiv.org/html/2606.28050#S5.SS1)was unavailable for this supplementary run\.
#### Behavioral result\.
Table[13](https://arxiv.org/html/2606.28050#A3.T13)reports the task\-asymmetry metrics\. The overallΔ=−11\.1\\Delta=\-11\.1on 2WikiMultiHopQA places this benchmark in the negative\-Δ\\Deltaregime, qualitatively closer to HotpotQA \(−14\.2\-14\.2\) than to MuSiQue \(\+3\.6\+3\.6\)\. However, the within\-dataset breakdown reveals substantial heterogeneity: three of the four question types yield negativeΔ\\Delta\(with bridge\-comparison reaching−25\.2\-25\.2\), while theinferencesubset—requiring composing a one\-hop deduction over an explicitly stated relation—exhibits a strongly positiveΔ=\+21\.6\\Delta=\+21\.6\. This mirrors the MuSiQue inversion within a single dataset and isolates it to a specific reasoning structure rather than a dataset\-level artifact\.
Table 13:2WikiMultiHopQA task asymmetry for Llama\-3\.1\-8B\-Instruct, overall and by question type\. Theinferencesubset is the only one withΔ\>0\\Delta\>0, replicating the MuSiQue\-style inversion within a single multi\-hop dataset\.
#### Attention result\.
Table[14](https://arxiv.org/html/2606.28050#A3.T14)reports the last\-token attention fractions\. The aggregate pattern of Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)reproduces cleanly:𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}allocates only 3\.8% of last\-token attention to context, compared with 15\.6% for𝒯gen\\mathcal\{T\}\_\{\\text\{gen\}\}—a4\.08×4\.08\\timesdrop, squarely within the 3–5×\\timesrange reported on the four primary benchmarks\. The candidate answeraareceives only0\.330\.33% of attention overall \(range0\.210\.21–0\.430\.43% across question types\), matching the0\.30\.3–0\.50\.5% range from the main analysis\. The context\-attention de\-allocation is consistent across all four question types \(−8\.5\-8\.5to−14\.2\-14\.2pp\)\.
Table 14:2WikiMultiHopQA last\-token attention fractions \(%\) directed to contextccand candidate answeraa, averaged over layers 24–31 and all heads\.Δctx\\Delta\_\{\\text\{ctx\}\}is the eval\-minus\-gen difference on context \(in percentage points\)\. The 3–5×\\timescontext de\-attention pattern of Section[5\.4](https://arxiv.org/html/2606.28050#S5.SS4)reproduces uniformly across all four question types\.
#### Interpretation\.
The structural attention signature of𝒯eval\\mathcal\{T\}\_\{\\text\{eval\}\}—a 3–5×\\timesreduction in context attention coupled with negligible attention on the candidate answer—generalizes to a third multi\-hop dataset, supporting its status as a general property of the evaluation task rather than a feature of any single benchmark\. The within\-dataset breakdown additionally suggests that the sign ofΔ\\Deltais governed by an interaction between this structural pattern \(which is broadly invariant across reasoning types\) and the difficulty of the generation task itself: oninferencequestions, where Llama’s generation accuracy collapses to29\.429\.4%, evaluation becomes easier than generation despite the same attention de\-allocation\. This is consistent with the generation\-floor account proposed for MuSiQue in Section[5\.3](https://arxiv.org/html/2606.28050#S5.SS3), and indicates that the same mechanism produces both signs of the gap depending on where generation accuracy lands\.
#### Caveat and outlook\.
Two qualifications constrain the conclusions of this section\. First, then=495n=495samples here are not filtered through a GPT\-5\.4 super\-oracle \(Section[5\.1](https://arxiv.org/html/2606.28050#S5.SS1)\), so the verdicts that enter the analysis are slightly noisier than those of the main paper; we do not expect this to invert any sign, but exact magnitudes \(especially in the smallinferencesubset,n=51n=51\) should be treated as indicative\. Second, this remains a single additional dataset and a single model, with one specific multi\-hop ontology of question types; the bridge\-comparison vs\. inference dissociation observed here is suggestive but does not yet establish a typology\. More broadly, the pattern emerging across our three multi\-hop benchmarks—HotpotQA, MuSiQue, and 2WikiMultiHopQA—is that the mechanistic attention signature is robust, but the behavioralΔ\\Deltadepends sensitively on the underlying reasoning structure and on the model’s generation ceiling for that structure\. Disentangling these two factors—ideally with controlled variations of hop count, distractor density, and supporting\-fact configuration on a common context substrate, and with stronger models—is a natural direction for future work\.
## Appendix DDetailed LoRA Transfer Analysis
This appendix provides the per\-dataset analyses that underlie the summary results of Section[5\.5](https://arxiv.org/html/2606.28050#S5.SS5)\.
#### Why does LoRA\-Gen not improve GA?
Contrary to the pre\-experiment expectation, LoRA\-Gen*lowers*GA on three of four datasets \(SQuAD 2\.0:−1\.6\-1\.6, DROP:−2\.2\-2\.2, MuSiQue:−6\.6\-6\.6\); only HotpotQA improves \(\+0\.8\+0\.8\)\. Three mechanisms plausibly contribute\. \(i\)Saturation: Llama\-3\.1\-8B’s pretraining and instruction\-tuning data very likely cover these benchmarks closely, leaving little headroom \(SQuAD 2\.0 starts at GA=95\.6%=95\.6\\%\)\. \(ii\)Format shift: cross\-entropy on response tokens pushes the model toward the exact gold\-string form, which can be more brittle to the oracle’s paraphrase\-tolerant scoring than the base model’s pre\-finetuning output\. \(iii\)Joint\-training interference: the four datasets span extractive, numerical,22\-hop, and2−42\{\-\}4\-hop reasoning with very different response distributions; the rank\-1616adapter must encode all four styles simultaneously, and MuSiQue—the most heterogeneous response distribution—shows the largest GA regression \(−6\.6\-6\.6pp\), consistent with averaging across styles\. Training loss converges smoothly on all three runs \(final1\.291\.29,1\.161\.16,1\.161\.16for LoRA\-Gen, LoRA\-Eval, LoRA\-Both\); validation loss, tracked with a 500\-instance held\-out split in the per\-dataset experiments below, is monotonically decreasing across all twelve per\-dataset adapters, ruling out overfitting\.
#### Per\-dataset adapters confirm over\-acceptance is intrinsic, not interference\-driven\.
To isolate mechanism \(iii\), we train one adapter per dataset per task variant \(twelve adapters total; Table[5](https://arxiv.org/html/2606.28050#S5.T5), per\-ds rows\)\. If joint\-training interference were the primary cause of GA regression, per\-dataset LoRA\-Gen should substantially recover GA\. Instead, GA regressions persist on SQuAD 2\.0 \(−3\.2\-3\.2pp vs\. base\), HotpotQA \(−0\.2\-\\,0\.2pp\), and MuSiQue \(−7\.2\-7\.2pp\); only DROP improves \(\+0\.6\+0\.6pp,64\.064\.0vs\.63\.463\.4base\), suggesting that dataset had genuine headroom that joint training was suppressing\. Crucially, evaluator recall remains near100%100\\%under per\-dataset LoRA\-Gen on SQuAD 2\.0 and HotpotQA, and the MuSiQue adapter—despite having ER=40\.6%=40\.6\\%and a conservative13%13\\%abstention rate—still shows a GA regression of7\.27\.2pp\. Over\-acceptance is therefore an intrinsic consequence of generation\-only SFT rather than a multi\-dataset averaging artifact\.
#### Per\-dataset LoRA\-Eval exposes a HotpotQA evaluator collapse\.
Per\-dataset LoRA\-Eval degrades GA even more severely than the joint variant \(MuSiQue:36\.8%36\.8\\%vs\.38\.0%38\.0\\%; DROP:41\.4%41\.4\\%vs\.46\.6%46\.6\\%\), confirming that evaluation\-only SFT universally suppresses generation\. The most striking finding is HotpotQA: the per\-dataset evaluator collapses to EA=47\.5%=47\.5\\%\(Δ=−29\.7\\Delta=\-29\.7pp\), well below the base model’s69\.0%69\.0\\%, despite GA remaining reasonable at77\.2%77\.2\\%\. The joint variant, trained on all four datasets, had EA=63\.6%=63\.6\\%on HotpotQA—worse than base but far less catastrophic\. Isolating to HotpotQA therefore*worsens*the evaluator on that very dataset, which is paradoxical if evaluation is task\-specific\. The dissociation—competent generation, broken evaluation—on the same model and dataset is consistent with22\-hop reasoning drawing on representational resources that are shared between the two tasks; evaluation\-only SFT disrupts that shared substrate without the counterbalance of a generation signal\.
#### Per\-dataset LoRA\-Both is the only calibrated regime\.
Joint LoRA\-Both already reduced GA regressions relative to LoRA\-Gen, but its ER remained near100%100\\%\(Table[6](https://arxiv.org/html/2606.28050#S5.T6)\), indicating persistent over\-acceptance\. Per\-dataset LoRA\-Both breaks this pattern:\|Δ\|≤1\|\\Delta\|\\leq 1pp on SQuAD 2\.0 \(−0\.2\-0\.2\), DROP \(−0\.8\-0\.8\), and HotpotQA \(0\.00\.0\), with no ER collapse\. MuSiQue retainsΔ=\+9\.2\\Delta=\+9\.2, consistent across all variants and attributable to the structural difficulty of multi\-hop generation rather than training bias\. However, even the most balanced training regime cannot recover the base GA: MuSiQue falls from55\.655\.6to44\.3%44\.3\\%, a11\.311\.3pp regression that survives both joint and per\-dataset training\. The generation–evaluation gap is therefore not an artifact that SFT can straightforwardly correct; it reflects an allocation of parametric capacity that fine\-tuning reallocates but does not expand\.
## Appendix ECase Studies: Self\-Evaluation Error Analysis
We present four representative instances from the base Llama\-3\.1\-8B\-Instruct evaluation \(SQuAD 2\.0 and DROP\) illustrating the four possible outcomes of the generation–self\-evaluation pipeline\. Each example shows the contextcc, questionqq, gold answera∗a^\{\*\}, generated answeraa, oracle verdicty∗y^\{\*\}\(Eq\. 3\), and self\-evaluation judgmentyy\(Eq\. 2\)\. Key spans arebolded\.
#### True Positive \(TP\) — DROP\.
The model generates the correct answer and correctly judges it as correct\.
> Context:… Baltimore continued to suppress Tampa Bay in the second half, as Stover kicked a42\-yardfield goal in the third quarter to put Baltimore up 20–0\. … Question:How many yards did Stover kick in the third? Gold \(a∗a^\{\*\}\):42Generated \(aa\):42 yards\. Oracle \(y∗y^\{\*\}\):CorrectSelf\-eval \(yy\):Correct√\\surd
The answer is explicitly stated in the context\. The self\-evaluator has no difficulty confirming a direct numerical match\.
#### True Negative \(TN\) — SQuAD 2\.0\.
The model confuses two entities and the self\-evaluator correctly catches the error\.
> Context:Gasquet\(1908\) claimed that the Latin nameatra mors\(Black Death\) first appeared in modern times in 1631 in a book on Danish history byJ\.I\. Pontanus: “Vulgo & ab effectu atram mortem vocatibant\.” The name spread through Scandinavia and then Germany, gradually becoming attached to the mid 14th\-century epidemic as a proper name\. Question:Who allegedly coined the name Black Death? Gold \(a∗a^\{\*\}\):J\.I\. PontanusGenerated \(aa\):Gasquet Oracle \(y∗y^\{\*\}\):IncorrectSelf\-eval \(yy\):Incorrect√\\surd
The model is misled by sentence position:Gasquetis the grammatical subject of the opening clause whileJ\.I\. Pontanus—the alleged coiner—appears only in a subordinate clause\. The model returns the historian who*cited*Pontanus rather than Pontanus himself\. Crucially, the self\-evaluator correctly identifies “Gasquet”≠\\neq“J\.I\. Pontanus” and flags the answer as incorrect\.
#### False Positive \(FP\) — SQuAD 2\.0\.
The model returns a context\-grounded but wrong answer; the self\-evaluator fails to detect the error\.
> Context:British imperialism often used the concept ofTerra nullius\(Latin expression which stems fromRoman lawmeaning ‘empty land’\)\. The country of Australia serves as a case study: British settlement was premised onterra nullius, and its settlers considered it unused by its sparse Aboriginal inhabitants\. Question:What law justified British imperialism? Gold \(a∗a^\{\*\}\):Terra nulliusGenerated \(aa\):Roman law\. Oracle \(y∗y^\{\*\}\):IncorrectSelf\-eval \(yy\):Correct×\\times
The model confuses the*origin*of the legal concept \(“stems from Roman law”\) with the concept itself \(Terra nullius\)\. Because “Roman law” appears verbatim in the context and is semantically adjacent to the correct answer, the self\-evaluator accepts it as plausible\. Detecting the error requires composing the relation “XXstems fromYY”≠\\neq“X=YX=Y”—an inference the evaluator fails to make, illustrating how surface\-level context grounding misleads self\-evaluation\.
#### False Negative \(FN\) — SQuAD 2\.0\.
The model generates a factually correct but concise answer; the self\-evaluator over\-penalises the omission of descriptive detail\.
> Context:… Washington left with a small party, picking up along the wayJacob Van Braamas an interpreter;Christopher Gist, a company surveyor working in the area; and a fewMingo led by Tanaghrisson\. Question:Who was added to the party as Washington went on his way? Gold \(a∗a^\{\*\}\):Jacob Van Braam as an interpreter; Christopher Gist, a company surveyor working in the area; and a few Mingo led by Tanaghrisson Generated \(aa\):Jacob Van Braam, Christopher Gist, and a few Mingo led by Tanaghrisson\. Oracle \(y∗y^\{\*\}\):CorrectSelf\-eval \(yy\):Incorrect×\\times
The generated answer names all three correct parties but omits the role descriptors \(“as an interpreter”, “a company surveyor working in the area”\)\. The oracle correctly judges the core factual content as sufficient; the self\-evaluator applies a stricter criterion and marks the answer wrong\. This illustrates a systematic source of false negatives: the evaluator conflates*completeness*with*correctness*, penalising concise but accurate answers\.Similar Articles
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge
This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.