Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Summary
This paper investigates the phenomenon where large language models hallucinate despite having the correct answer available in their generation-time distribution. By introducing a semantic notion of answer availability, the authors show that 16-47% of instruction-tuned model hallucinations occur when the correct concept is already represented, and that this rate increases with scale. They identify that instruction tuning sharpens answer commitment, making helpfulness and confident hallucination two sides of the same coin.
View Cached Full Text
Cached at: 05/22/26, 08:45 AM
# Larger LLMs Misfire Despite Knowing the Answer
Source: [https://arxiv.org/html/2605.22007](https://arxiv.org/html/2605.22007)
## Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Jewon Yeom1Jaewon Sok2Heejun Kim3 Seonghyeon Park4Jeongjae Park1Taesup Kim1, 1Graduate School of Data Science, Seoul National University 2Department of Rural Systems Engineering, Seoul National University 3Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology 4Department of Aerospace Engineering, Seoul National University
###### Abstract
Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation\-time distribution, and correctly when it is present\. We test this assumption by introducing a semantic notion of answer availability that aggregates token\-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer\. Across Qwen and Llama models from 0\.8B to 72B in both Instruct and Base variants, 16–47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale\. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives\. The same sharpening asymmetry extends across multi\-token generation and is detectable in pre\-generation hidden states\. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition\.
## 1Introduction
Large language models \(LLMs\) frequently produce fluent but factually incorrect outputs—hallucinations—that undermine their reliability in safety\-critical applications\(Jiet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib1)\)\. A natural first question is*where in the generation trajectory*a hallucination is decided\. If hallucination emerges everywhere, post\-hoc analysis of full sequences is necessary; if it localizes at specific steps, both detection and intervention should target those steps\.
Recent work in the reasoning literature suggests sharp localization\. Inspecting token\-level entropyH\(yt∣Q,y<t\)H\(y\_\{t\}\\mid Q,y\_\{<t\}\)during greedy decoding reveals that entropy is highly non\-uniform across the sequence: at most steps it is near zero \(the next token is essentially deterministic\) but at a small number of steps it spikes sharply\.Wanget al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib2)\)report that∼\\sim20% of tokens in chain\-of\-thought traces carry high entropy and act as “forking tokens” that determine reasoning paths;Vassoyanet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib3)\)identify “critical tokens” as decision points where models are most error\-prone\. Figure[1](https://arxiv.org/html/2605.22007#S1.F1)shows the same phenomenon in a QA setting: an early spike fixes the domain of the answer \(Britain\), a later spike selects the answer entity \(Nicola\), and the steps in between are syntactic continuations following automatically\.
Figure 1:Token entropyH\(yt∣Q,y<t\)H\(y\_\{t\}\\mid Q,y\_\{<t\}\)across a representative generation trajectory \(Qwen3\.5\-9B Instruct\)\. Entropy is near zero at most steps but spikes sharply at a small number of*commitment steps*\.A natural follow\-up is whether entropy at these spikes is itself a hallucination signal\. Existing work has established that it is not, in a stronger form than we will need:Simhiet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib14)\)document hallucinations produced with high certainty even when the model demonstrably has the correct knowledge,Xuet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib15)\)formalize “high\-belief hallucinations” as a distinct phenomenon from confidence\-based detection, and the original semantic entropy work ofFarquharet al\.\([2024](https://arxiv.org/html/2605.22007#bib.bib10)\)explicitly notes that confidently wrong outputs are a separate phenomenon from the confabulation regime that semantic entropy targets\. The literature’s response has been to develop more sophisticated estimators\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib9); Maet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib12)\)or perturbation\-based diagnostics\(Simhiet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib14)\)that better separate hallucinated from non\-hallucinated outputs\. We pursue an orthogonal direction: rather than designing a better detector, we ask what the model’s distribution is doing at the commitment step when the final answer is hallucinated, regardless of whether the wrong answer is emitted with high confidence or not\.
We refer to the answer\-emission step as the*commitment step*tct\_\{c\}\. In short\-form QA with instruction\-tuned models,tc=1t\_\{c\}=1\(§[4\.1](https://arxiv.org/html/2605.22007#S4.SS1)\), so the prompt format fixes the commitment step att=1t=1and lets us inspect the distribution at a single, known step\. What it reveals is that “high entropy” has two structurally different sources: mass spread over genuinely different answers, and mass spread over different surface forms of the same answer \(Paris,Paris,paris, orSt,Saint,CforSt\. Basil’s Cathedral, Figure[2](https://arxiv.org/html/2605.22007#S1.F2)\)\. To distinguish them, we define a*concept*as the equivalence class of token completions denoting the same answer and introduce the*per\-step semantic probability mass*Pmass\(t;c\)=∑v∈ScPθ\(v∣Q,y<t\)P\_\{\\mathrm\{mass\}\}\(t;c\)=\\sum\_\{v\\in S\_\{c\}\}P\_\{\\theta\}\(v\\mid Q,y\_\{<t\}\), whereScS\_\{c\}collects the first\-token IDs of the concept’s surface forms\. WithScS\_\{c\}built from ground\-truth aliases,Pmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)is an analytical probe—requiring the answer concept as input—rather than a deployable detector, but precisely this property lets us ask whether the model put substantial mass on the right answer at the moment of commitment\.
Figure 2:Vocabulary fragmentation at the commitment step: the correct concept’s mass \(0\.501 total\) is split acrossSaint\(0\.244\),St\(0\.115\), andC\(0\.131\); the greedy token is the competingMos\(0\.312\)\. Greedy decoding hides what the distribution says about the correct concept\. This pattern is most pronounced in small and Base models; in large Instruct models, fragmentation collapses \(§[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\) and commitment failures arise instead from a wrong concept’s token being even sharper than the correct concept’s collapsed mass\.The headline finding is that across nine instruction\-tuned Qwen and Llama models from 0\.8B to 72B, 16% to 47% of hallucinated outputs havePmass\(tc;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq 0\.2: the model assigned non\-trivial mass to the correct concept yet produced a wrong final answer\. We call these*commitment failures*, and the rate rises monotonically with scale across both Qwen and Llama families\. Commitment failures decompose into two cases: in∼\\sim20%, the greedy first token does not match any surface form ofc∗c^\{\*\}at all \(*first\-token selection failures*\); in the remaining∼\\sim80%, the greedy first token does land on a surface form ofc∗c^\{\*\}but the continuation diverges \(*multi\-token divergences*\)\. The first sub\-population isolates a particularly clean question: when the model put substantial mass onc∗c^\{\*\}at the commitment step yet selected a token outsideSc∗S\_\{c^\{\*\}\}, what does its distribution look like? We compare against*matched correct samples*—correct outputs whosePmass\(tc;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq 0\.2, drawn from the same range of correct\-concept mass—and find that selection failures have a three\-fold lower maximum probability on any single surface\-form token ofc∗c^\{\*\}\(0\.26 vs\. 0\.78\)\. The model has the same*amount*of mass on the correct concept; it just has it spread across alias forms \(Saint,St,C\) rather than concentrated on one, so a competing concept’s single dominant token wins the argmax\.
The empirical driver of the scale trend is instruction tuning, not scale itself\. The probability assigned to the \(wrong\) greedy token in first\-token selection failures rises monotonically across Instruct models—Qwen: 0\.31 \(0\.8B\) to 0\.57 \(72B\); Llama: 0\.33 \(1B\) to 0\.49 \(70B\)—but stays flat at∼\\sim0\.30 across Base models of the same sizes\. The same pattern extends to multi\-token answers: in 70B\+ Instruct,Ht=2H\_\{t=2\}is≈0\.05\\approx 0\.05when the bigram\(y1,y2\)\(y\_\{1\},y\_\{2\}\)stays on a valid alias prefix ofc∗c^\{\*\}and substantially higher when it diverges \(Cohen’sd=1\.29d=1\.29across 18 models\)—instruction tuning sharpens commitment specifically alongc∗c^\{\*\}\-aligned phrases\. Instruction tuning sharpens commitments at multiple levels, and the same sharpening produces confident correctness when the committed phrase is right and confident misselection when it is wrong—making confident hallucination one face of the broader “alignment tax” that has been documented as accuracy loss\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.22007#bib.bib38)\), calibration loss\(OpenAI,[2023](https://arxiv.org/html/2605.22007#bib.bib35); Huet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib39)\), and mode collapse\. The analytical probePmassP\_\{\\mathrm\{mass\}\}\(§[3](https://arxiv.org/html/2605.22007#S3)\), the commitment\-failure phenomenon and its scale dependence \(§[4\.2](https://arxiv.org/html/2605.22007#S4.SS2)\), the within\-population characterization \(§[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\), and the representation\-level evidence for instruction\-induced sharpening \(§[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)\) together constitute the contribution of this paper\.
## 2Related Work
Confident hallucination and uncertainty\-based detection\.Token\- and sequence\-level uncertainty has been the dominant lens for hallucination detection: perplexity\(Renet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib4)\), length\-normalized NLL\(Malinin and Gales,[2021](https://arxiv.org/html/2605.22007#bib.bib5)\), predictive entropy\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.22007#bib.bib6)\), importance\-weighted scoring\(MARS; Bakmanet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib7)\), semantic entropy and its variants\(Kuhnet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib9); Farquharet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib10); Maet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib12)\), sampling consistency\(SelfCheckGPT; Manakulet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib11)\), consistency\-confidence aggregation\(CoCoA; Vashurinet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib13)\), and confidence elicitation\(Xionget al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib8)\)\. A growing line of work documents that confident hallucination is itself a phenomenon:Farquharet al\.\([2024](https://arxiv.org/html/2605.22007#bib.bib10)\)note that semantic entropy does not address confidently wrong outputs,Simhiet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib14)\)formalize “CHOKE” \(certain hallucinations overriding known evidence\), andXuet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib15)\)characterize “delusions” as high\-belief hallucinations\.Calderonet al\.\([2026](https://arxiv.org/html/2605.22007#bib.bib16)\)characterize a closely related distinction \(“empty shelves” vs\. “lost keys”\), finding that recall—not encoding—is the dominant bottleneck even in frontier models, through behavioral fact\-level profiling; we provide complementary distributional analysis at the commitment step\. These works establish the phenomenon at the response level; we provide its structural account at the commitment step \(§[4\.2](https://arxiv.org/html/2605.22007#S4.SS2),[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\)\.
Calibration and the alignment tax\.Kadavathet al\.\([2022](https://arxiv.org/html/2605.22007#bib.bib6)\)showed pretrained LLMs are well\-calibrated under appropriate elicitation, whileOpenAI \([2023](https://arxiv.org/html/2605.22007#bib.bib35)\)reported that pretraining yields well\-calibrated probabilities but RLHF post\-training degrades calibration substantially—a finding extended inXieet al\.\([2024](https://arxiv.org/html/2605.22007#bib.bib36)\)for the token\-level case andChhikara \([2025](https://arxiv.org/html/2605.22007#bib.bib37)\)for the question\-type case\. More broadly, the “alignment tax” originally framed as a drop in task accuracy after RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.22007#bib.bib38)\)is increasingly understood to include calibration loss and mode collapse:Huet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib39)\)document that alignment makes models overconfident with reduced output diversity, framing this as a calibration–alignment trade\-off\. We give this its mechanism at the moment of commitment \(§[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\): the same sharpening drives both confident correctness and confident misselection\.
Token\-level decision points\.Wanget al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib2)\)show that high\-entropy “forking tokens” in chain\-of\-thought reasoning carry a disproportionate share of the learning signal in RLVR\.Vassoyanet al\.\([2025](https://arxiv.org/html/2605.22007#bib.bib3)\)identify “critical tokens” as decision points where models are most error\-prone\. We extend the same phenomenon to short\-form QA \(Figure[1](https://arxiv.org/html/2605.22007#S1.F1)\) and show that the relevant signal at the step is concept\-grouped mass, not individual\-token entropy\.
Internal representations and first\-token signal\.Truthfulness is linearly decodable from hidden states\(Burnset al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib17); Azaria and Mitchell,[2023](https://arxiv.org/html/2605.22007#bib.bib18); Marks and Tegmark,[2024](https://arxiv.org/html/2605.22007#bib.bib19)\); DoLa\(Chuanget al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib20)\)and ITI\(Liet al\.,[2023](https://arxiv.org/html/2605.22007#bib.bib21)\)act on this for decoding\-time intervention\. SEP\(Kossenet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib23)\)introduces token\-before\-generation \(TBG\) probing\. Token\-level analyses have converged on first\-token importance\(Snel and Oh,[2025](https://arxiv.org/html/2605.22007#bib.bib24); Zhaoet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib25)\); HaMI\(Niuet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib26)\)adaptively selects informative tokens\. We refine this: what matters is not position but the answer\-level commitment step \(first token in instruction\-tuned short\-form QA, but migrating in long\-form generation; §[4\.1](https://arxiv.org/html/2605.22007#S4.SS1)\), and we provide the first systematic Instruct–Base comparison for the TBG setting \(§[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)\)\.
## 3Setup
Concept andPmassP\_\{\\mathrm\{mass\}\}\.Given a queryQQ, an autoregressive LLMPθP\_\{\\theta\}generates tokensy1,y2,…y\_\{1\},y\_\{2\},\\ldotsA*concept*ccis an equivalence class of token completions denoting the same answer; its first\-token surface forms collect into a*concept token set*ScS\_\{c\}\. We study the*per\-step semantic probability mass*
Pmass\(t;c\)=∑v∈ScPθ\(v∣Q,y<t\),P\_\{\\mathrm\{mass\}\}\(t;c\)=\\sum\_\{v\\in S\_\{c\}\}P\_\{\\theta\}\(v\\mid Q,y\_\{<t\}\),\(1\)the total mass at steptton any first\-token surface form ofcc\. To analyze hallucination we setc=c∗c=c^\{\*\}, the ground\-truth concept, withSc∗S\_\{c^\{\*\}\}built deterministically from the dataset’s aliases \(Appendix[I](https://arxiv.org/html/2605.22007#A9)\)\. Under a latent\-concept generation model,Pmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)approximates the \(unobservable\)*concept belief*Pθ\(c∗∣Q,y<t\)P\_\{\\theta\}\(c^\{\*\}\\mid Q,y\_\{<t\}\)whenSc∗S\_\{c^\{\*\}\}is alias\-complete and concepts are well\-separated; we make this precise in Appendix[A](https://arxiv.org/html/2605.22007#A1)\(Proposition[1](https://arxiv.org/html/2605.22007#Thmproposition1)\)\.
Models and data\.Our primary scale ablation uses Qwen3\.5\(Yanget al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib32)\)at four sizes \(0\.8B, 2B, 4B, 9B\) in both Instruct and Base variants, with all 4\-bit NF4 quantization, paired with Llama\-3\.2 \(1B, 3B\) and Llama\-3\.1 \(8B\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib33)\)in both variants—fourteen models total at small to mid scale\. We extend the scale ablation to four large models \(Qwen2\.5\-72B\(Yanget al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib31)\)and Llama\-3\.1\-70B in both Instruct and Base\) for the wrong\-token sharpening and commitment\-failure rate analyses\. We use TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2605.22007#bib.bib27)\)and NQ\-Open\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.22007#bib.bib28)\)for short\-form QA \(3,000 samples per model\) and MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.22007#bib.bib29)\)together with ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.22007#bib.bib30)\)for multiple\-choice QA \(2,672 samples per model\); the representation analyses in §[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)use a 1,500\-sample subset \(1,000 MCQA \+ 500 Short\-QA\)\.
Hallucination\.Throughout, a response is a*hallucination*if it fails substring matching against the ground\-truth aliases \(Short\-QA\) or selects a wrong option \(MCQA\); we use “hallucinated” and “incorrect” interchangeably, following the convention of SE\(Farquharet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib10)\)and SEP\(Kossenet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib23)\)\.
Metrics\.Detection performance is AUROC with hallucination as the positive class\. Probes are 5\-fold CV logistic regression on hidden states\. Calibration uses ECE\(Guoet al\.,[2017](https://arxiv.org/html/2605.22007#bib.bib34)\)and Brier score\.
## 4Results
### 4\.1Where in the trajectory does correctness signal live?
Before turning to the central analysis, we verify the entropy/PmassP\_\{\\mathrm\{mass\}\}picture in our own data using long\-form generation, where the commitment steptct\_\{c\}is not att=1t=1\. We collected 500 long\-form Qwen3\.5\-9B Instruct responses \(Answer in a complete sentence\) and centered each trajectory on its commitment steptct\_\{c\}\.
Figure[3](https://arxiv.org/html/2605.22007#S4.F3)\(a\) plots entropy as a function of step relative totct\_\{c\}\. Entropy peaks sharply attct\_\{c\}for both correct and hallucinated samples, and hallucinated trajectories carry uniformly higher entropy at every relative step, but the per\-sampleH\(tc\)H\(t\_\{c\}\)distributions overlap substantially \(Wilcoxonp=0\.055p=0\.055\)\. The max\-entropy steptHt\_\{H\}exactly matchestct\_\{c\}in only 20% of samples, within one step in 32%—a noisy localization at best\.
Figure[3](https://arxiv.org/html/2605.22007#S4.F3)\(b\) re\-plots the same trajectories usingPmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)instead\.PmassP\_\{\\mathrm\{mass\}\}is essentially zero everywhere excepttct\_\{c\}, where it spikes to 0\.92 for correct samples and 0\.77 for hallucinated ones; generated\-token probabilityP\(yt\)P\(y\_\{t\}\)is nearly flat across both groups \(Appendix Figure[7](https://arxiv.org/html/2605.22007#A8.F7)\)\. The relevant fact is not thatPmassP\_\{\\mathrm\{mass\}\}separates the two classes \(it has access to ground\-truth aliases\) but that the gap is small: at the commitment step, hallucinated samples place a substantial 0\.77 average mass on the correct concept yet still emit a different one\. This sub\-population is what we analyze in the rest of the paper\.
The commitment step is where the information lives\.Figure[3](https://arxiv.org/html/2605.22007#S4.F3)\(c\) makes this concrete: per\-step detection AUROC ofPmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)is at chance two or more steps beforetct\_\{c\}, climbs to its peak attct\_\{c\}, and decays back to chance within a few steps after\. Generated\-token probabilityP\(yt\)P\(y\_\{t\}\)over the same window is flat \(0\.55–0\.62 throughout, no peak\): token\-level confidence barely moves aroundtct\_\{c\}, while concept\-grouped mass rises and falls sharply\. This is the structural justification for studying the model’s distribution attct\_\{c\}specifically rather than aggregating across the trajectory\.
Figure 3:500 long\-form Qwen3\.5\-9B Instruct responses aligned to each trajectory’s commitment steptct\_\{c\}\.*\(a\) Entropy*peaks attct\_\{c\}for both groups; hallucinated trajectories run uniformly higher, but per\-sample distributions attct\_\{c\}overlap\.*\(b\)Pmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)*is essentially zero except attct\_\{c\}, where it concentrates to 0\.92 \(correct\) and 0\.77 \(hallucinated\)\. The relevant fact is not the separation between groups but the highPmassP\_\{\\mathrm\{mass\}\}on the hallucinated side: at the commitment step, the model often has substantial mass on the correct concept yet emits a competing one\.*\(c\) Detection AUROC*ofPmassP\_\{\\mathrm\{mass\}\}peaks sharply attct\_\{c\}and decays to chance off\-step, showing that the predictive information about correctness is localized to the commitment step itself\.
### 4\.2Does the model have the answer when it hallucinates?
We now turn to short\-form QA, where instruction\-tuned models answer immediately and the commitment step is fixed att=1t=1\(Appendix Table[8](https://arxiv.org/html/2605.22007#A12.T8)\)\. This lets us inspect the model’s distribution at a single, known step, and ask the central question: among hallucinated outputs, what doesPmass\(tc;c∗\)P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)look like?
PmassP\_\{\\mathrm\{mass\}\}recovers a coherent confidence signal\.PmassP\_\{\\mathrm\{mass\}\}is well\-calibrated \(ECE 0\.023–0\.096 across 7 Instruct models; Appendix Figure[9](https://arxiv.org/html/2605.22007#A12.F9)\), with accuracy increasing monotonically acrossPmassP\_\{\\mathrm\{mass\}\}bins, and it assigns substantially more mass to the correct concept than the generated greedy token’s probability whenever the answer has multiple surface forms \(Appendix Table[10](https://arxiv.org/html/2605.22007#A12.T10)\)—it captures the concept\-level structure that token\-level confidence misses\. The per\-step result in Figure[3](https://arxiv.org/html/2605.22007#S4.F3)\(c\) confirmstct\_\{c\}is the right inspection point: aggregatingPmassP\_\{\\mathrm\{mass\}\}across the trajectory underperforms the single\-step quantity attct\_\{c\}\(Appendix[M](https://arxiv.org/html/2605.22007#A13)\)\.
Commitment failures\.A hallucinated sample is a*commitment failure*ifPmass\(tc;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq 0\.2: substantial mass on the correct concept yet a wrong final answer\. The phenomenon arises because greedy emission depends on the maximum single\-token probability, not onPmassP\_\{\\mathrm\{mass\}\}, so individual surface\-form tokens ofc∗c^\{\*\}may each be small enough that a competing concept’s single dominant token wins, or the multi\-token continuation may diverge after a correct first token\. CF% rises monotonically from 16% at 0\.8B to 47% at 70B Instruct \(Table[1](https://arxiv.org/html/2605.22007#S4.T1)\), holding across both Qwen and Llama families\. Larger models do not produce uniformly fewer errors; they shift the error distribution toward commitment failures\. The 0\.2 threshold is conservative; the trend is robust to threshold choice in\[0\.1,0\.4\]\[0\.1,0\.4\]\(Appendix[G](https://arxiv.org/html/2605.22007#A7)\)\.
Two structural sub\-populations\.Commitment failures decompose into two cases at the token level\. In a*first\-token selection failure*, the greedy emissiony1y\_\{1\}does not match any surface form ofc∗c^\{\*\}at all \(e\.g\., the answer isSaint Petersburgand the model emitsMosto beginMoscow\)\. In a*multi\-token divergence*,y1y\_\{1\}does land on a surface form ofc∗c^\{\*\}but the multi\-token continuation diverges from any valid alias \(e\.g\., the answer isAdam Smithbut the model emitsAdam Levine\)\. The two are distributionally different and we treat them separately\. Across the scale ablation, first\-token selection failures account for roughly 20% of commitment failures, rising monotonically with scale: from 2\.9% of all hallucinations at 0\.8B to 6\.0% at 72B in Qwen Instruct, and from 3\.3% at 1B to 10\.2% at 70B in Llama Instruct \(Table[1](https://arxiv.org/html/2605.22007#S4.T1)\)\. The remaining∼\\sim80% are multi\-token divergences—hallucinations where the first token is on track but the continuation is not\. Multi\-token divergences are analyzed in detail in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)and Appendix[F](https://arxiv.org/html/2605.22007#A6)\.
Table 1:Commitment\-failure rate \(CF%\) and decomposition into first\-token selection failures \(SF, greedy∉Sc∗\\notin S\_\{c^\{\*\}\}\) and multi\-token divergences \(Div, greedy∈Sc∗\\in S\_\{c^\{\*\}\}but final answer wrong\) across the Instruct scale ablation\. Acc: Short\-QA accuracy\. AUROC:Pmass\(t=1\)P\_\{\\mathrm\{mass\}\}\(t\{=\}1\)AUROC\. Halluc: hallucinated sample count\. SF%: SF as fraction of all hallucinations\. Full per\-model details including Base models in Appendix Table[9](https://arxiv.org/html/2605.22007#A12.T9)\.
### 4\.3Why does the model commit to the wrong token?
We focus here on first\-token selection failures, the strict sub\-population where greedy emissiony1∉Sc∗y\_\{1\}\\notin S\_\{c^\{\*\}\}despitePmass\(tc;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq 0\.2\. These cases are directly inspectable at the token level: the model assigned substantial mass to the correct concept but a single token of a competing concept won the argmax\. We ask two questions: \(i\) within a model, what distinguishes the distribution at a selection failure from a correct\-but\-comparable sample \(one with a similar amount of mass onc∗c^\{\*\}\)? \(ii\) How do these distributions change with scale?
Within\-population: less concentrated correct mass\.To isolate the token\-level structure, we compare two groups in Qwen3\.5\-9B Instruct, both restricted to samples withPmass≥0\.2P\_\{\\mathrm\{mass\}\}\\geq 0\.2: first\-token selection failures \(those that hallucinated,N=128N=128\) and*matched correct samples*\(those that answered correctly,N=840N=840\)\. Both groups have substantial mass on the correct concept; the only difference is the outcome\. The maximum probability assigned to any single surface form ofc∗c^\{\*\},maxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\), is dramatically smaller in the failure group: mean 0\.26 vs\. 0\.78 \(Welcht=47\.6t=47\.6,p<10−180p<10^\{\-180\}, Cohen’sd=2\.98d=2\.98; Appendix[B](https://arxiv.org/html/2605.22007#A2)\)\. The same comparison across all 18 models givesd<0d<0in 100% of cases \(median\|d\|=1\.93\|d\|=1\.93, range\[1\.01,4\.30\]\[1\.01,4\.30\]\); within Instruct models\|d\|\|d\|tends to grow with scale \(Qwen Inst: 1\.34→\\to2\.98→\\to4\.30 from 0\.8B to 72B; Appendix Table[2](https://arxiv.org/html/2605.22007#A4.T2)\)\. Vocabulary fragmentation across alias forms \(Figure[2](https://arxiv.org/html/2605.22007#S1.F2), e\.g\.Saint,St,C\) is the most concrete realization in small and Base models; in mid\-to\-large Instruct models, the within\-concept distribution has typically already collapsed onto a single alias and the SF–Corr gap arises from a different route, characterized below\.
Figure 4:Within first\-token selection failures, mean wrong\-token probabilityPθ\(y1\)P\_\{\\theta\}\(y\_\{1\}\)across models\.*\(a\) Instruct models*: monotonic sharpening from∼\\sim0\.31 at 1B to∼\\sim0\.49–0\.57 at 70B\+, in both families\.*\(b\) Base models*: flat at∼\\sim0\.26–0\.33 across the same scale range\. Marker size∝\\proptonumber of true selection failures; the dotted line at 0\.31 marks the typical Base level\. The contrast is family\-independent: instruction tuning—not scale alone—is the driver\.Across scale: monotonic sharpening, modulated by instruction tuning\.Within first\-token selection failures, the greedy emitted token is by definition outsideSc∗S\_\{c^\{\*\}\}; we call its probabilityPθ\(y1\)P\_\{\\theta\}\(y\_\{1\}\)the*wrong\-token probability*\. This rises monotonically with model size in instruction\-tuned models: Qwen Instruct 0\.31 \(0\.8B\)→\\to0\.36→\\to0\.40→\\to0\.44→\\to0\.57 \(72B\); Llama Instruct 0\.33 \(1B\)→\\to0\.43→\\to0\.46→\\to0\.49 \(70B\) \(Figure[4](https://arxiv.org/html/2605.22007#S4.F4), left\)\. Base models behave differently: across the same size range, wrong\-token probability stays near 0\.30 \(Llama Base: 0\.26 at 1B to 0\.33 at 70B; Qwen Base 0\.8B/72B: 0\.27/0\.31\)\. Scale alone does not produce sharpening; the combination of scale and instruction tuning does\. The same sharpening that produces front\-loaded correctness signal in §[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)produces decisive misselection when the committed concept is wrong\.
Sharpening extends to multi\-token answers\.The same sharpening operates beyond the first token\. Within multi\-token divergences,Ht=2H\_\{t=2\}\(entropy of the next\-token distribution aftery1y\_\{1\}\) measures whether the model has committed to a specific multi\-token phrase att=1t=1\(lowHt=2H\_\{t=2\}\) or is still deciding\. We classify each divergence by whether the second token continues an alias ofc∗c^\{\*\}\.*Type A*: bigram\(y1,y2\)\(y\_\{1\},y\_\{2\}\)matches the start of a valid alias but the continuation diverges into a different entity \(e\.g\.,George Washington Carver—the agricultural scientist—when the answer isGeorge Washingtonthe first U\.S\. president; the bigramGeorge Washingtonis shared with the alias prefix, butCarverfixes a different person\)\.*Type B*: bigram diverges already aty2y\_\{2\}\(e\.g\.,Adam Lambertwhen the answer isAdam Smith:y1=y\_\{1\}=Adamis inSc∗S\_\{c^\{\*\}\}butLambertbreaks the alignment\)\. Across 18 Instruct/Base models, Type A divergences have substantially lowerHt=2H\_\{t=2\}than Type B \(median Cohen’sd=1\.29d=1\.29,d\>1d\>1in 100% of models with sufficientNN; Appendix[F](https://arxiv.org/html/2605.22007#A6)\)—when the bigram is on track, the continuation is near\-deterministic\. The Type A fraction grows with both scale and instruction tuning, from 21–44% at 0\.8B/1B to 77–83% at 70B\+ Instruct\. At 70B\+ Instruct,Ht=2H\_\{t=2\}within Type A divergences is 0\.05–0\.10—the model commits to wrong multi\-token phrases with residual entropy comparable to deterministic continuations\.
Two faces of commitment failure\.The within\-population analysis above measures top1 alias mass without normalizing byPmass\(c∗\)P\_\{\\mathrm\{mass\}\}\(c^\{\*\}\)\. To separate*within\-concept*structure \(how the correct mass is distributed across alias forms\) from*between\-concept*structure \(how the wrong greedy token compares to the correct concept\), we measure two ratios on SF samples across all 18 models:D2=maxv∈Sc∗P\(v\)/Pmass\(c∗\)D\_\{2\}=\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\(v\)/P\_\{\\mathrm\{mass\}\}\(c^\{\*\}\)andD3=P\(greedy\)/Pmass\(c∗\)D\_\{3\}=P\(\\text\{greedy\}\)/P\_\{\\mathrm\{mass\}\}\(c^\{\*\}\)\. Within Llama Instruct,D2D\_\{2\}grows monotonically \(1B 0\.76→\\to70B 0\.99\) while Llama Base plateaus \(0\.66→\\to0\.81\); Qwen2\.5\-72B shows the same contrast \(Inst 1\.00 vs Base 0\.76\)\.D3D\_\{3\}follows the same pattern: Inst grows monotonically with scale \(Llama 1\.13→\\to1\.45; Qwen 1\.13→\\to1\.65\), Base stays flat \(∼\\sim1\.0–1\.13\)\. The two ratios separate two failure modes:*fragmentation\-driven failures*\(lowD2D\_\{2\}, lowD3D\_\{3\}\) in small or Base models where the correct mass is split across alias tokens \(the regime Figure[2](https://arxiv.org/html/2605.22007#S1.F2)illustrates\), and*wrong\-attractor failures*\(highD2D\_\{2\}, highD3D\_\{3\}\) dominating large Instruct models, where the correct concept has collapsed onto a single alias but a wrong concept’s token is even sharper\. Both arise from the same instruction\-induced sharpening: weak sharpening leaves correct mass spread; strong sharpening collapses fragmentation but strengthens wrong attractors at least as fast \(Appendix[E](https://arxiv.org/html/2605.22007#A5)\)\.
Instruction\-induced sharpening therefore acts at three structural levels: at the first token \(selection failures, sharper wrong\-token probability\), across multi\-token answers \(early commitment to specific phrase continuations\), and within the correct concept’s alias distribution \(D2D\_\{2\}collapse\)\.
### 4\.4When does the model “know” it is going to fail?
A separate but related observation is what makestc=1t\_\{c\}=1in instruction\-tuned short\-form QA\. The prompt format alone is not enough: Base models, given the same short\-form prompt, emit filler tokens first and delaytct\_\{c\}to later steps \(Appendix Table[8](https://arxiv.org/html/2605.22007#A12.T8)\)\. The Instruct–Base contrast lets us ask whether the front\-loading is purely an output\-formatting effect or reflects deeper changes in the representation\.
Output\-level detection\.On MCQA, Instruct models reach near\-perfect detection AUROC \(0\.974–0\.999\) using onlyP\(correct option\)P\(\\text\{correct option\}\), while Base models span 0\.558–0\.748 \(Figure[5](https://arxiv.org/html/2605.22007#S4.F5), right; \+0\.29 average gap\)\.
Attention to the question\.Att=1t=1, Instruct models allocate a higher fraction of last\-layer attention to question tokens \(\+0\.09 average; Figure[5](https://arxiv.org/html/2605.22007#S4.F5), middle\), consistent with retrieving the answer concept fromQQrather than first emitting filler\.
Hidden\-state probes \(pre\-generation\)\.Logistic regression on the last\-layer hidden state att=1t=1, before any token is generated, yields Instruct\>\>Base in all four sizes \(\+0\.08 average; Figure[5](https://arxiv.org/html/2605.22007#S4.F5), left\)\. The pattern holds across nearly all layers \(Appendix[J](https://arxiv.org/html/2605.22007#A10)\), peaking at mid\-layers, ruling out a purely output\-formatting explanation\. This is the first systematic Instruct–Base comparison for the TBG setting ofKossenet al\.\([2024](https://arxiv.org/html/2605.22007#bib.bib23)\): pre\-generation probe AUROC is 0\.61–0\.87 for Instruct versus 0\.50–0\.63 for Llama Base, and Base models gain substantially more from one generated token \(avg pre→\\topostΔB=\+0\.030\\Delta\_\{B\}=\+0\.030vs\.ΔI=\+0\.005\\Delta\_\{I\}=\+0\.005; Appendix Table[7](https://arxiv.org/html/2605.22007#A11.T7)\)\. Correctness\-predictive information is genuinely front\-loaded in Instruct models\.
Figure 5:Three\-level Instruct–Base comparison att=1t=1\. Hidden Probe: 5\-fold CV AUROC on last\-layer hidden states \(MCQA,N=1,000N\{=\}1\{,\}000\)\. Q\-attn: fraction of last\-layer attention on question tokens \(Short\-QA,N=500N\{=\}500\)\. Output AUROC:P\(correct option\)P\(\\text\{correct option\}\)on MCQA\. Average Instruct–Base gaps:\+0\.08\+0\.08\(Hidden Probe\),\+0\.09\+0\.09\(Q\-attn\),\+0\.29\+0\.29\(Output AUROC\)\.The output\-level gap \(\+0\.29\) is larger than the hidden\-state gap \(\+0\.08\): instruction tuning’s effect is partly representational \(information is in the hidden states\) and substantially in the output mapping \(the projection amplifies it sharply\)\.
## 5Discussion and Limitations
Hallucination as commitment failure\.Across 18 models from 0\.8B to 72B, 16% to 47% of Instruct hallucinations leave non\-trivial mass on the correct concept at the commitment step yet produce a wrong final answer, with the rate rising monotonically with scale\. As models scale, more hallucinations come from commitment failures despite the population\-level distribution including the correct answer, not from the answer being absent\. The within\-population finding sharpens this: at matchedPmassP\_\{\\mathrm\{mass\}\}, failures consistently have lower top\-token mass onc∗c^\{\*\}\(d<0d<0in 100% of 18 models,\|d\|\|d\|growing with scale within each Instruct family from 1\.34 to 4\.30; Appendix[D](https://arxiv.org/html/2605.22007#A4)\)\. The structural difference between hallucination and correctness at comparable concept\-level mass is whether any single surface form is concentrated enough to win\. The empirical driver is instruction\-induced sharpening: Instruct models sharpen first\-token commitments with scale \(0\.31 to 0\.57\), Base models remain flat at∼\\sim0\.30\. The same sharpening produces front\-loaded correctness signal \(§[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)\) and decisive misselection when the committed concept is wrong\.
Sharpening operates at multiple granularities\.The first\-token effect extends both inward and forward\. Within multi\-token divergences, the second\-token entropy after an alias\-prefix\-aligned bigram falls steeply with scale and instruction tuning, reaching 0\.05–0\.10 at 70B\+ Instruct \(Appendix[F](https://arxiv.org/html/2605.22007#A6)\); by the second token, the model has effectively committed to a specific multi\-token continuation, and a wrong continuation is selected with the same residual entropy as a deterministic one\. The same sharpening also operates within the correct concept’s alias distribution: in 70B\+ Instruct, mass on the correct concept has typically collapsed onto a single alias token \(Appendix[E](https://arxiv.org/html/2605.22007#A5)\), removing fragmentation as a recoverable failure mode\. Confident hallucination is therefore not a momentary slip att=1t=1but the natural endpoint of a sharpening process operating at three structural levels—first\-token selection, multi\-token phrase commitment, and within\-concept alias collapse—more decisive answers when right, more decisive misselections when wrong\.
Implications for the picture of confident hallucination\.A confident hallucination is one where the model places high probability on a wrong answer’s tokens, and the standard framing treats this as evidence that the model has the wrong answer in its distribution and not the right one\. Our results complicate this picture: in commitment failures the model has placed substantial mass on the*correct*concept \(Pmass≥0\.2P\_\{\\mathrm\{mass\}\}\\geq 0\.2\) while still emitting a wrong token confidently\. Token\-level confidence is not coming from the absence of the right concept but from concentration on the wrong concept in spite of it\. The same sharpness produces confident correctness when the committed concept is right\. “Confidently wrong” and “confidently right” are two outcomes of one distributional disposition, not two different epistemic states—which may explain why uncertainty does not flag confident hallucinations\(Simhiet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib14); Xuet al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib15); Farquharet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib10)\)\.
Limitations\.PmassP\_\{\\mathrm\{mass\}\}is an analytical probe, not a deployable detector: it requires the ground\-truth alias setSc∗S\_\{c^\{\*\}\}as input, so any practical use depends on inferringSc∗S\_\{c^\{\*\}\}from context \(e\.g\., by clustering top\-kktokens attct\_\{c\}by semantic equivalence\)\. Within multi\-token divergences,Pmass\(t=1;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t=1;c^\{\*\}\)\\geq 0\.2does not distinguish concept\-level belief from phrase\-level commitment to specific multi\-token continuations \(Appendix[F](https://arxiv.org/html/2605.22007#A6)\); the within\-population analysis in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)controls for this by restricting to first\-token selection failures\. Our experiments use 1–3 token answers and greedy decoding; concept\-segmentedPmassP\_\{\\mathrm\{mass\}\}for longer generation, and the behavior of commitment failures under temperature sampling or top\-pptruncation, are natural extensions\. The analyses also span only Qwen and Llama families; we expect the same instruction\-induced sharpening pattern in other open\-weight families, but cannot verify it for closed models without first\-token distribution access\.
Future work\.The commitment\-failure phenomenon directly suggests two directions\. First, greedy decoding has a structural limitation whenPmass\(c∗\)P\_\{\\mathrm\{mass\}\}\(c^\{\*\}\)is high but split across surface forms: a*concept\-aware*decoding rule that argmaxes over alias\-clustered top\-kktokens—rather than over individual vocabulary entries—would convert a meaningful fraction of selection failures into correct answers without retraining\. Quantifying the upper bound and approximating it with semantic\-similarity clustering of top\-kkis a concrete next step\. Second, our setup fixes the commitment step att=1t=1via short\-form prompting; long\-form generation has multiple commitment events—domain commitment \(Britain\), answer commitment \(Nicola\), possibly others \(rhetorical\-frame, sub\-claim\)—that a richer typology should distinguish\. Identifying these reliably is itself a problem: entropy is a noisy localizer \(exact match in only 20% of long\-form trajectories, §[4\.1](https://arxiv.org/html/2605.22007#S4.SS1)\), so non\-entropy signals—hidden\-state probes, attention concentration onQQ, orPmassP\_\{\\mathrm\{mass\}\}rate of changeΔPmass\(t\)\\Delta P\_\{\\mathrm\{mass\}\}\(t\)at each candidate spike—are likely better candidates and worth systematic comparison\.
## 6Conclusion
We asked what is happening at the moment of hallucination, viewed through the model’s distribution at the commitment step\. Defining concepts as equivalence classes of token completions and introducing per\-step semantic probability mass as an analytical probe, we found that a substantial fraction of Instruct hallucinations are commitment failures: the model puts non\-trivial mass on the correct concept yet produces a wrong final answer, with the rate rising monotonically with scale\. Larger models do not just know more; they also misfire more often on what they know\. Within these failures, the structural difference from matched correct generations is not whether the correct concept is represented, but how its mass is distributed across surface forms\. Across scale, the same sharpening pattern operates at the first token, across multi\-token continuations, and within the correct concept’s alias distribution—uniformly in Instruct models, absent in Base\. This reframes confident hallucination as a structural consequence of how mass is shaped at the commitment step, and situates it as one face of the broader alignment tax—suggesting concept\-aware decoding and finer commitment\-step typologies as natural follow\-ups\.
## References
- A\. Azaria and T\. Mitchell \(2023\)The internal state of an LLM knows when it’s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- Y\. F\. Bakman, D\. N\. Yaldiz, B\. Buyukates, C\. Tao, D\. Dimitriadis, and S\. Avestimehr \(2024\)MARS: meaning\-aware response scoring for uncertainty estimation in generative LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt \(2023\)Discovering latent knowledge in language models without supervision\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- N\. Calderon, E\. Ben\-David, Z\. Gekhman, E\. Ofek, and G\. Yona \(2026\)Empty shelves or lost keys? Recall is the bottleneck for parametric factuality\.arXiv preprint arXiv:2602\.14080\.Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- P\. Chhikara \(2025\)Mind the confidence gap: overconfidence, calibration, and distractor effects in large language models\.Transactions on Machine Learning Research \(TMLR\)\.Note:arXiv:2502\.11028Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- Y\. Chuang, Y\. Xie, H\. Luo, Y\. Kim, J\. Glass, and P\. He \(2024\)DoLa: decoding by contrasting layers improves factuality in large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- J\. Cohen \(1988\)Statistical power analysis for the behavioral sciences\.2 edition,Lawrence Erlbaum Associates\.Cited by:[Appendix B](https://arxiv.org/html/2605.22007#A2.p2.10)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p3.1),[§2](https://arxiv.org/html/2605.22007#S2.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p3.1),[§5](https://arxiv.org/html/2605.22007#S5.p3.1)\.
- R\. A\. Fisher \(1925\)Statistical methods for research workers\.Oliver and Boyd\.Cited by:[Appendix B](https://arxiv.org/html/2605.22007#A2.p6.6)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix12.p1.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§3](https://arxiv.org/html/2605.22007#S3.p4.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- T\. Hu, B\. Minixhofer, and N\. Collier \(2025\)Navigating the alignment\-calibration trade\-off: a Pareto\-superior frontier via model merging\.arXiv preprint arXiv:2510\.17426\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p6.8),[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, D\. Chen, W\. Dai, H\. S\. Chan, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1),[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal \(2024\)Semantic entropy probes: robust and cheap hallucination detection in LLMs\.arXiv preprint arXiv:2406\.15927\.Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1),[§3](https://arxiv.org/html/2605.22007#S3.p3.1),[§4\.4](https://arxiv.org/html/2605.22007#S4.SS4.p4.5)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p3.1),[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2306\.03341Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- H\. Ma, J\. Pan, J\. Liu, Y\. Chen, J\. T\. Zhou, G\. Wang, Q\. Hu, H\. Wu, C\. Zhang, and H\. Wang \(2025\)Semantic energy: detecting LLM hallucination beyond entropy\.arXiv preprint arXiv:2508\.14496\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p3.1),[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- A\. Malinin and M\. Gales \(2021\)Uncertainty estimation in autoregressive structured prediction\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. J\. F\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- H\. B\. Mann and D\. R\. Whitney \(1947\)On a test of whether one of two random variables is stochastically larger than the other\.The Annals of Mathematical Statistics18\(1\),pp\. 50–60\.Cited by:[Appendix B](https://arxiv.org/html/2605.22007#A2.p4.2)\.
- S\. Marks and M\. Tegmark \(2024\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.InConference on Language Modeling \(COLM\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- M\. Niu, H\. Haddadi, and G\. Pang \(2025\)Robust hallucination detection in LLMs via adaptive token selection\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2504\.07863Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p6.8),[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p6.8),[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- J\. Ren, J\. Luo, Y\. Zhao, K\. Krishna, M\. Saleh, B\. Lakshminarayanan, and P\. J\. Liu \(2023\)Out\-of\-distribution detection and selective generation for conditional language models\.International Conference on Learning Representations \(ICLR\)\.Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- A\. Simhi, I\. Itzhak, F\. Barez, G\. Stanovsky, and Y\. Belinkov \(2025\)Trust me, I’m wrong: LLMs hallucinate with certainty despite knowing the answer\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 14665–14688\.Note:arXiv:2502\.12964Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p3.1),[§2](https://arxiv.org/html/2605.22007#S2.p1.1),[§5](https://arxiv.org/html/2605.22007#S5.p3.1)\.
- J\. Snel and S\. J\. Oh \(2025\)First hallucination tokens are different from conditional ones\.arXiv preprint arXiv:2507\.20836\.Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
- R\. Vashurin, M\. Goloburda, A\. Ilina, A\. Rubashevskii, P\. Nakov, A\. Shelmanov, and M\. Panov \(2025\)CoCoA: a minimum Bayes risk framework bridging confidence and consistency for uncertainty quantification in LLMs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2502\.04964Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- J\. Vassoyan, N\. Beau, and R\. Plaud \(2025\)Ignore the KL penalty\! boosting exploration on critical tokens to enhance RL fine\-tuning\.Findings of the North American Chapter of the Association for Computational Linguistics \(NAACL\)\.Note:arXiv:2502\.06533Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p2.2),[§2](https://arxiv.org/html/2605.22007#S2.p3.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang, Y\. Liu, A\. Yang, A\. Zhao, Y\. Yue, S\. Song, B\. Yu, G\. Huang, and J\. Lin \(2025\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for LLM reasoning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2506\.01939Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p2.2),[§2](https://arxiv.org/html/2605.22007#S2.p3.1)\.
- B\. L\. Welch \(1947\)The generalization of ‘Student’s’ problem when several different population variances are involved\.Biometrika34\(1/2\),pp\. 28–35\.Cited by:[Appendix B](https://arxiv.org/html/2605.22007#A2.p3.4)\.
- J\. Xie, A\. S\. Chen, Y\. Lee, E\. Mitchell, and C\. Finn \(2024\)Calibrating language models with adaptive temperature scaling\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:arXiv:2409\.19817Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p2.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p1.1)\.
- H\. Xu, Z\. Yang, Z\. Zhu, K\. Lan, Z\. Wang, M\. Wu, Z\. Ji, L\. Chen, P\. Fung, and K\. Yu \(2025\)Delusions of large language models\.arXiv preprint arXiv:2503\.06709\.Cited by:[§1](https://arxiv.org/html/2605.22007#S1.p3.1),[§2](https://arxiv.org/html/2605.22007#S2.p1.1),[§5](https://arxiv.org/html/2605.22007#S5.p3.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix12.p1.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix12.p1.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.22007#Ax1.I1.ix36.p1.1),[§3](https://arxiv.org/html/2605.22007#S3.p2.1)\.
- Q\. Zhao, M\. Xu, K\. Gupta, A\. Asthana, L\. Zheng, and S\. Gould \(2024\)The first to know: how token distributions reveal hidden knowledge in large vision\-language models?\.InEuropean Conference on Computer Vision \(ECCV\),Note:arXiv:2403\.09037Cited by:[§2](https://arxiv.org/html/2605.22007#S2.p4.1)\.
## Appendix ATheoretical Analysis
We work under a latent\-concept generation model: at each step the model implicitly considers candidate conceptsc∈𝒞c\\in\\mathcal\{C\}before emitting tokens, soPθ\(yt∣Q,y<t\)=∑c∈𝒞Pθ\(yt∣c,Q,y<t\)⋅Pθ\(c∣Q,y<t\)P\_\{\\theta\}\(y\_\{t\}\\mid Q,y\_\{<t\}\)=\\sum\_\{c\\in\\mathcal\{C\}\}P\_\{\\theta\}\(y\_\{t\}\\mid c,Q,y\_\{<t\}\)\\cdot P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\. We refer toPθ\(c∣Q,y<t\)P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)as the model’s*concept belief*and toPθ\(yt∣c,Q,y<t\)P\_\{\\theta\}\(y\_\{t\}\\mid c,Q,y\_\{<t\}\)as the concept\-conditioned emission distribution\.
###### Proposition 1\(PmassP\_\{\\mathrm\{mass\}\}as Concept\-Belief Proxy\)\.
Letγc=Pθ\(yt∈Sc∣c,Q,y<t\)\\gamma\_\{c\}=P\_\{\\theta\}\(y\_\{t\}\\in S\_\{c\}\\mid c,Q,y\_\{<t\}\)\(*completeness*ofScS\_\{c\}undercc\) andϵ=maxc′≠cPθ\(yt∈Sc∣c′,Q,y<t\)\\epsilon=\\max\_\{c^\{\\prime\}\\neq c\}P\_\{\\theta\}\(y\_\{t\}\\in S\_\{c\}\\mid c^\{\\prime\},Q,y\_\{<t\}\)\(*leakage*from competing concepts\)\. WithK=\|𝒞\|K=\|\\mathcal\{C\}\|,
\|Pmass\(t;c\)−Pθ\(c∣Q,y<t\)\|≤\(1−γc\)\+\(K−1\)ϵ\.\\bigl\|P\_\{\\mathrm\{mass\}\}\(t;c\)\-P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\\bigr\|\\;\\leq\\;\(1\-\\gamma\_\{c\}\)\+\(K\-1\)\\epsilon\.
###### Proof of Proposition[1](https://arxiv.org/html/2605.22007#Thmproposition1)\.
By the law of total probability,
Pmass\(t\)=∑v∈Sc∑c∈𝒞Pθ\(v∣c,Q,y<t\)⋅Pθ\(c∣Q,y<t\)=∑c∈𝒞Pθ\(c∣Q,y<t\)⋅αc\(t\),P\_\{\\mathrm\{mass\}\}\(t\)=\\sum\_\{v\\in S\_\{c\}\}\\sum\_\{c\\in\\mathcal\{C\}\}P\_\{\\theta\}\(v\\mid c,Q,y\_\{<t\}\)\\cdot P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)=\\sum\_\{c\\in\\mathcal\{C\}\}P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\\cdot\\alpha\_\{c\}\(t\),\(2\)whereαc\(t\)=Pθ\(yt∈Sc∣c,Q,y<t\)∈\[γc,1\]\\alpha\_\{c\}\(t\)=P\_\{\\theta\}\(y\_\{t\}\\in S\_\{c\}\\mid c,Q,y\_\{<t\}\)\\in\[\\gamma\_\{c\},1\]for the target concept andαc′\(t\)∈\[0,ϵ\]\\alpha\_\{c^\{\\prime\}\}\(t\)\\in\[0,\\epsilon\]forc′≠cc^\{\\prime\}\\neq c\. Then
Pmass\(t\)\\displaystyle P\_\{\\mathrm\{mass\}\}\(t\)≤Pθ\(c∣Q,y<t\)\+\(1−Pθ\(c∣Q,y<t\)\)⋅ϵ,\\displaystyle\\leq P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\+\(1\-P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\)\\cdot\\epsilon,\(3\)Pmass\(t\)\\displaystyle P\_\{\\mathrm\{mass\}\}\(t\)≥Pθ\(c∣Q,y<t\)⋅γc\.\\displaystyle\\geq P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\\cdot\\gamma\_\{c\}\.\(4\)Combining gives\|Pmass\(t\)−Pθ\(c∣Q,y<t\)\|≤\(1−γc\)\+\(K−1\)ϵ\|P\_\{\\mathrm\{mass\}\}\(t\)\-P\_\{\\theta\}\(c\\mid Q,y\_\{<t\}\)\|\\leq\(1\-\\gamma\_\{c\}\)\+\(K\-1\)\\epsilon\. ∎
###### Proposition 2\(Posterior Concentration at Concept Emission, Auxiliary\)\.
Lettct\_\{c\}denote the first step at which a token from some concept’s first\-token set is emitted,ytc∈Sc^y\_\{t\_\{c\}\}\\in S\_\{\\hat\{c\}\}\. Under the latent\-concept model with completenessγc^\\gamma\_\{\\hat\{c\}\}and leakageϵ\\epsilon,
Pθ\(c^∣Q,y≤tc\)≥γc^⋅Pθ\(c^∣Q,y<tc\)γc^⋅Pθ\(c^∣Q,y<tc\)\+\(K−1\)ϵ\.P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{\\leq t\_\{c\}\}\)\\;\\geq\\;\\frac\{\\gamma\_\{\\hat\{c\}\}\\cdot P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{<t\_\{c\}\}\)\}\{\\gamma\_\{\\hat\{c\}\}\\cdot P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{<t\_\{c\}\}\)\+\(K\-1\)\\epsilon\}\.\(5\)Withϵ=0\\epsilon=0, the posterior concentrates to a delta function onc^\\hat\{c\}, regardless of correctness\. This formalizes the deterministic continuation between entropy spikes in Figure[1](https://arxiv.org/html/2605.22007#S1.F1); it is auxiliary to the empirical analysis in §[4](https://arxiv.org/html/2605.22007#S4)\.
###### Proof\.
Bayes’ rule at steptct\_\{c\}givesPθ\(c^∣Q,y≤tc\)=Pθ\(ytc∣c^,Q,y<tc\)⋅Pθ\(c^∣Q,y<tc\)/Pθ\(ytc∣Q,y<tc\)P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{\\leq t\_\{c\}\}\)=P\_\{\\theta\}\(y\_\{t\_\{c\}\}\\mid\\hat\{c\},Q,y\_\{<t\_\{c\}\}\)\\cdot P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{<t\_\{c\}\}\)/P\_\{\\theta\}\(y\_\{t\_\{c\}\}\\mid Q,y\_\{<t\_\{c\}\}\)\. The numerator is at leastγc^⋅Pθ\(c^∣Q,y<tc\)\\gamma\_\{\\hat\{c\}\}\\cdot P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{<t\_\{c\}\}\)by completeness\. The denominator decomposes via total probability and is bounded above byPθ\(ytc∣c^,Q,y<tc\)⋅Pθ\(c^∣Q,y<tc\)\+\(K−1\)ϵP\_\{\\theta\}\(y\_\{t\_\{c\}\}\\mid\\hat\{c\},Q,y\_\{<t\_\{c\}\}\)\\cdot P\_\{\\theta\}\(\\hat\{c\}\\mid Q,y\_\{<t\_\{c\}\}\)\+\(K\-1\)\\epsilon, usingPθ\(ytc∣c′,Q,y<tc\)≤ϵP\_\{\\theta\}\(y\_\{t\_\{c\}\}\\mid c^\{\\prime\},Q,y\_\{<t\_\{c\}\}\)\\leq\\epsilonforc′≠c^c^\{\\prime\}\\neq\\hat\{c\}\. WithPθ\(ytc∣c^,Q,y<tc\)≤1P\_\{\\theta\}\(y\_\{t\_\{c\}\}\\mid\\hat\{c\},Q,y\_\{<t\_\{c\}\}\)\\leq 1in the denominator, the bound follows\. ∎
## Appendix BStatistical Primer
We use a small number of standard statistics throughout this paper; this appendix is a brief reference for readers unfamiliar with them\.
Cohen’sdd\(effect size,Cohen,[1988](https://arxiv.org/html/2605.22007#bib.bib40)\)\. For two groups with meansμ1,μ2\\mu\_\{1\},\\mu\_\{2\}and pooled standard deviationσ\\sigma,d=\(μ1−μ2\)/σd=\(\\mu\_\{1\}\-\\mu\_\{2\}\)/\\sigma\. It expresses how far apart the two group means are in units of typical within\-group spread\. Conventional thresholds:\|d\|<0\.2\|d\|<0\.2small,\|d\|≈0\.5\|d\|\\approx 0\.5medium,\|d\|\>0\.8\|d\|\>0\.8large,\|d\|\>1\.5\|d\|\>1\.5very large\. Effect\-size statistics likeddcomplementpp\-values, which only tell you whether a difference is reliably nonzero, not how big it is\.
Welch’stt\-test\[Welch,[1947](https://arxiv.org/html/2605.22007#bib.bib41)\]\. A two\-samplett\-test that does not assume equal variances between the two groups; reports att\-statistic and app\-value for the null hypothesis “the two group means are equal\.”
Mann–WhitneyUUtest\[Mann and Whitney,[1947](https://arxiv.org/html/2605.22007#bib.bib42)\]\. A non\-parametric two\-sample test on whether one group’s values systematically dominate the other’s\. It does not assume normality\. We use it to confirm Welch’stt\-test results in cases where group distributions are skewed\.
Pearson’srr\. Linear correlation coefficient between two scalar quantities, in\[−1,1\]\[\-1,1\]\.
Fisher’s combinedpp\[Fisher,[1925](https://arxiv.org/html/2605.22007#bib.bib43)\]\. A way to combinekkindependentpp\-values into a single one:−2∑ilnpi\-2\\sum\_\{i\}\\ln p\_\{i\}isχ2\\chi^\{2\}\-distributed with2k2kdegrees of freedom under a global null\. Used here for pooled meta\-analysis across models\.
Cohen’sddsign convention\. Throughout, when we report negativeddbetween SF and Correct groups, the sign indicates SF<<Corr \(i\.e\., the failure group has lower top\-token mass onc∗c^\{\*\}\); the magnitude is what matters for the conclusion\.
## Appendix CSpread\-Based Diagnostics
The within\-population result reported in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)uses the simple statisticmaxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\)\. We tested an alternative summary statistic, the inverse Simpson diversity index \(also called effective number of types\):
Spread\(c∗;i,t\)=\(Pmass\(t;c∗\)\)2∑v∈Sc∗Pθ\(v∣Qi,y<t\)2\.\\mathrm\{Spread\}\(c^\{\*\};i,t\)\\;=\\;\\frac\{\\bigl\(P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)\\bigr\)^\{2\}\}\{\\sum\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\\mid Q\_\{i\},y\_\{<t\}\)^\{2\}\}\.
Spread=1\\mathrm\{Spread\}=1when all mass is on a single token;Spread=k\\mathrm\{Spread\}=kwhen uniformly distributed overkktokens\. The two summary statistics measure different things—max\\maxmeasures the dominant token’s probability,Spread\\mathrm\{Spread\}measures the evenness of the distribution—and within first\-token selection failures, the two give different pictures\.
Across\-scale comparison \(Qwen and Llama Instruct\)\.The table below reportsSpread\(c∗\)\\mathrm\{Spread\}\(c^\{\*\}\)*conditioned on the sample being a first\-token selection failure*—i\.e\., the average over the SF subset, not over all samples\. This is the relevant statistic for asking how dispersed the correct concept’s mass is when a selection failure occurs\.
The SF\-conditionalSpread\\mathrm\{Spread\}decreases with scale in both families: Qwen3\.5–Qwen2\.5 from 1\.91 \(0\.8B\) to 1\.11 \(72B\), Llama\-3\.2–Llama\-3\.1 from 1\.83 \(1B\) to 1\.18 \(70B\)\. At first glance this might seem paradoxical:Spread\\mathrm\{Spread\}measures how scattered the correct concept’s mass is across alias tokens, and we showed in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)that selection failures are precisely cases where the correct mass is spread out \(within\-populationd=2\.98d=2\.98on top\-token mass\), so one might expect models that hallucinate more to also have higherSpread\\mathrm\{Spread\}\. The resolution is that the across\-scale decrease in SF\-conditionalSpread\\mathrm\{Spread\}is a*consequence*of wrong\-token sharpening, not a cause of it: as the wrong token’s probability rises \(0\.31 to 0\.57 across Instruct scale\), the correct concept’s remaining mass within selection\-failure samples is squeezed and necessarily concentrated on fewer effective surface forms\. The within\-population fragmentation effect \(which holds across all 18 models,\|d\|≥1\.0\|d\|\\geq 1\.0, Appendix[D](https://arxiv.org/html/2605.22007#A4)\) is a separate phenomenon from the across\-scaleSpread\\mathrm\{Spread\}trend\. Llama Instruct sharpens earlier \(the largest drop is from 1B to 3B\), Qwen more gradually, but both converge near 1\.1–1\.2 at the largest scales\. Note that this SF\-conditionalSpread\\mathrm\{Spread\}is distinct from the unconditionalSpread\\mathrm\{Spread\}averaged over all samples \(which is more sensitive to the bulk of low\-PmassP\_\{\\mathrm\{mass\}\}samples\), and from the within\-population test onmaxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\)in Appendix[D](https://arxiv.org/html/2605.22007#A4)\(which controls forPmassP\_\{\\mathrm\{mass\}\}\)\.
Within\-population comparison usingSpread\\mathrm\{Spread\}\(Qwen3\.5\-9B Instruct,Pmass≥0\.2P\_\{\\mathrm\{mass\}\}\\geq 0\.2\)\.For completeness, we apply the within\-population test of §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)but withSpread\\mathrm\{Spread\}in place ofmaxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\):
Welcht=2\.94t=2\.94,p=3\.4×10−3p=3\.4\\\!\\times\\\!10^\{\-3\}, Cohen’sd=0\.32d=0\.32\. The direction matches expectation \(failures have slightly higherSpread\\mathrm\{Spread\}than correct samples\) and is statistically significant, but the effect size is much smaller than the corresponding test onmaxPθ\(v\)\\max P\_\{\\theta\}\(v\)\(d=2\.98d=2\.98in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\)\. The reason:maxPθ\(v\)\\max P\_\{\\theta\}\(v\)is a sharper signal for selection at the argmax level thanSpread\\mathrm\{Spread\}is—Spread\\mathrm\{Spread\}averages over the whole alias distribution and treats, say, “one alias token at 0\.5 plus three at 0\.05” similarly to “four alias tokens at 0\.15 each” \(similarSpread\\mathrm\{Spread\}\), even though only the first wins the argmax against a competing 0\.4\-probability token\. The main\-text analysis therefore usesmaxPθ\(v\)\\max P\_\{\\theta\}\(v\)as the primary measure; we reportSpread\\mathrm\{Spread\}here for completeness\.
Within\-belief stratification bySpread\\mathrm\{Spread\}\(commitment failures, allPmassP\_\{\\mathrm\{mass\}\}\)\.Among hallucinated samples withPmass≥0\.2P\_\{\\mathrm\{mass\}\}\\geq 0\.2in Qwen3\.5\-9B Instruct \(i\.e\., commitment failures, including both first\-token selection failures and multi\-token divergences\):
A directional positive trend appears in three of four bands \(∼\\sim5–10pp\), inconsistent in the highest bin whereNNis small\.
## Appendix DWithin\-Population Effect Across Models
For each model, we replicate the within\-population test of §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3): among samples withPmass≥0\.2P\_\{\\mathrm\{mass\}\}\\geq 0\.2, comparemaxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\)between first\-token selection failures \(greedy∉Sc∗\\notin S\_\{c^\{\*\}\}, “SF”\) and correct samples in the samePmassP\_\{\\mathrm\{mass\}\}range\. Table[2](https://arxiv.org/html/2605.22007#A4.T2)reports the comparison across all 18 models\. The within\-population effect \(d<0d<0\) is uniform: SF samples have lower top\-token mass onc∗c^\{\*\}than correct samples, in 100% of models\. Effect sizes range from\|d\|=1\.01\|d\|=1\.01\(Qwen3\.5\-4B Base,Ncorr=12N\_\{\\text\{corr\}\}=12\) to\|d\|=4\.30\|d\|=4\.30\(Qwen2\.5\-72B Inst\), with median\|d\|=1\.93\|d\|=1\.93\.
Table 2:Within\-population effect atPmass\(tc;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq 0\.2, comparingmaxv∈Sc∗Pθ\(v\)\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\)between first\-token selection failures \(greedy∉Sc∗\\notin S\_\{c^\{\*\}\}, “SF”\) and correct samples in the samePmassP\_\{\\mathrm\{mass\}\}range\.dd: Cohen’sddon Top1, SF vs\. Corr \(negative means SF<<Corr, the expected direction\)\.pp: Welch’stt\-testpp\-value\. The within\-population effect is consistent across all 18 models \(d<0d<0in 100%,\|d\|≥1\.0\|d\|\\geq 1\.0, allp<10−2p<10^\{\-2\}\)\.Effect\-size patterns\.Two robust patterns emerge: \(1\) within Instruct models,\|d\|\|d\|grows monotonically with scale \(Qwen Inst: 1\.34→\\to1\.88→\\to2\.79→\\to2\.98→\\to4\.30 from 0\.8B to 72B; Llama Inst: 2\.18→\\to2\.73→\\to2\.80→\\to2\.97 from 1B to 70B\); \(2\) Instruct models show larger\|d\|\|d\|than size\-matched Base models in nearly every case \(e\.g\., 9B: 2\.98 Inst vs\. 1\.45 Base; 70B\+: 2\.97–4\.30 Inst vs\. 1\.47–1\.84 Base; 0\.8B is the one exception, with both at\|d\|=1\.34\|d\|=1\.34\)\. Statistical significance is overwhelming throughout: all 18 models givep<10−2p<10^\{\-2\}, withp<10−130p<10^\{\-130\}for the 13 models withNSF,NCorr≥100N\_\{\\text\{SF\}\},N\_\{\\text\{Corr\}\}\\geq 100\. The Instruct–Base contrast in correct\-sample top\-token mass shows that the difference is one of distributional sharpness throughout, not specific to selection failures\.
## Appendix EWithin\-Concept and Between\-Concept Mass Decomposition \(D2D\_\{2\},D3D\_\{3\}\)
This appendix supports the “two faces of commitment failure” analysis in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\. We measure two ratios on first\-token selection failure \(SF\) samples:
D2\\displaystyle D\_\{2\}=maxv∈Sc∗Pθ\(v∣Q\)Pmass\(tc;c∗\)\(within\-concept top\-1 share\)\\displaystyle=\\frac\{\\max\_\{v\\in S\_\{c^\{\*\}\}\}P\_\{\\theta\}\(v\\mid Q\)\}\{P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\}\\quad\\text\{\(within\-concept top\-1 share\)\}\(6\)D3\\displaystyle D\_\{3\}=Pθ\(greedy\)Pmass\(tc;c∗\)\(wrong\-token dominance\)\\displaystyle=\\frac\{P\_\{\\theta\}\(\\text\{greedy\}\)\}\{P\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\}\\quad\\text\{\(wrong\-token dominance\)\}\(7\)D2D\_\{2\}measures how concentrated the correct concept’s mass is on its single most\-probable alias token:D2→1D\_\{2\}\\to 1means the mass has effectively collapsed onto one alias; lowerD2D\_\{2\}indicates fragmentation across multiple alias forms\.D3D\_\{3\}measures the wrong greedy token’s mass relative to the correct concept’s total:D3\>1D\_\{3\}\>1means the wrong token alone exceeds the entire correct concept\.
Two failure modes\.Together,D2D\_\{2\}andD3D\_\{3\}separate two structurally different mechanisms of commitment failure:
- •*Fragmentation\-driven*\(lowD2D\_\{2\}, lowD3D\_\{3\}\): correct mass is spread across alias surface forms \(e\.g\., Figure[2](https://arxiv.org/html/2605.22007#S1.F2)\); no single correct alias is large enough to win, but neither is the wrong token strongly dominant\.
- •*Wrong\-attractor\-driven*\(highD2D\_\{2\}, highD3D\_\{3\}\): correct mass has collapsed onto a single alias, but a wrong concept’s token is even sharper\.
Per\-model statistics\.Table[3](https://arxiv.org/html/2605.22007#A5.T3)reports medianD2D\_\{2\}, medianD3D\_\{3\}, and the fraction of SF samples withD2≥0\.95D\_\{2\}\\geq 0\.95\(correct mass effectively collapsed\) across all 18 models\.
Table 3:Within\- and between\-concept mass ratios for SF samples across 18 models\.D2D\_\{2\}measures how much of the correct concept’s mass is in its top\-1 alias token\.D3D\_\{3\}measures the wrong greedy token’s mass relative to the entire correct concept\. Within Instruct models,D2D\_\{2\}grows monotonically with scale and the high\-D2D\_\{2\}fraction climbs from 28% \(1B\) to 82% \(72B\);D3D\_\{3\}similarly grows from 1\.13 to 1\.65\. Within Base models, both quantities stay flat\. Sharpening collapses fragmentation in Instruct but strengthens wrong attractors in parallel\.Patterns\.Within Llama, the contrast is clean: InstructD2D\_\{2\}grows monotonically \(0\.76→\\to0\.98→\\to0\.97→\\to0\.99\) while Base plateaus \(0\.66→\\to0\.68→\\to0\.77→\\to0\.81\), widening the Inst–Base gap from\+0\.10\+0\.10at 1B to\+0\.18\+0\.18at 70B\. TheD2≥0\.95D\_\{2\}\\geq 0\.95fraction climbs from 28\.6% \(Llama\-1B Inst\) to 66\.5% \(Llama\-70B Inst\), while Llama Base stays in 11\.4%–31\.9% across the entire range\. Qwen2\.5\-72B shows the largest absolute Inst–Base gap \(\+0\.24\+0\.24, Inst 1\.00 vs Base 0\.76\) and confirms the same pattern at the largest scale; for Qwen3\.5 Base in the 2B–9B range,NSFN\_\{\\text\{SF\}\}is small \(13–42\), making medians noisy estimates\.D3D\_\{3\}tracks the same trend: monotonic growth in Instruct \(Llama 1\.13→\\to1\.45; Qwen 1\.13→\\to1\.65\) and flat in Llama Base \(0\.99–1\.13\)\.
Connection to within\-population\|d\|\|d\|\.The within\-population effect \(d=2\.98d=2\.98on Qwen\-9B Inst, §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\) measures the absolute top\-1 alias probability difference between SF and matched\-correct samples \(0\.26 vs\. 0\.78\)\.D2D\_\{2\}normalizes this top\-1 byPmass\(c∗\)P\_\{\\mathrm\{mass\}\}\(c^\{\*\}\)to isolate within\-concept structure, revealing that at Qwen\-9B Inst the SF samples’ top\-1 already accounts for∼\\sim98% of the correct concept’s mass: the SF\-Corr top\-1 gap arises mostly fromPmassP\_\{\\mathrm\{mass\}\}differences \(SF samples havePmass≈0\.26P\_\{\\mathrm\{mass\}\}\\approx 0\.26, correct samples≈0\.79\\approx 0\.79\), not from fragmentation within the correct concept\. This refines the §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)narrative: in mid\-to\-large Instruct models, the structural difference between SF and matched\-correct is the absolute mass onc∗c^\{\*\}\(and therefore on its top alias\), with within\-concept fragmentation playing a secondary role\.
Connection to direct decoding intervention\.Replacing greedy argmax with cluster\-argmax over normalized top\-50 tokens att=tct=t\_\{c\}recovers 5\.7% of SF samples in Llama\-70B Base and 6\.7% in Qwen\-72B Base, but only 2\.4% and 1\.8% in the corresponding Instruct models\. The recovery rate is anti\-correlated withD2D\_\{2\}: whenD2→1D\_\{2\}\\to 1there is no within\-concept aggregation to do at decoding time\. This confirms the mechanism dichotomy and supports the future\-work direction \(§[5](https://arxiv.org/html/2605.22007#S5)\) of concept\-aware decoding for non\-Instruct or smaller models, where fragmentation is recoverable\.
## Appendix FPhrase\-Level Commitment:Ht=2H\_\{t=2\}Analysis on Multi\-Token Divergences
This appendix supports the phrase\-level sharpening claim in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\. We restrict to multi\-token divergences—commitment failures where greedyy1∈Sc∗y\_\{1\}\\in S\_\{c^\{\*\}\}but the final answer is wrong—and use the entropy of the next\-token distributionHt=2H\_\{t=2\}, conditioned on the emittedy1y\_\{1\}, to probe whether commitment is finalized att=1t=1\(lowHt=2H\_\{t=2\}, deterministic continuation\) or distributed across multiple tokens \(highHt=2H\_\{t=2\}, multiple candidate continuations\)\.
Type A vs\. Type B classification\.For each multi\-token divergence sample, we check whether the emitted bigram\(y1,y2\)\(y\_\{1\},y\_\{2\}\)matches the start of any ground\-truth alias ofc∗c^\{\*\}under the model’s tokenizer\.*Type A*: bigram matches an alias prefix—the model is on a trajectory consistent with a valid surface form ofc∗c^\{\*\}at the first two tokens\.*Type B*: bigram does not match any alias—y1y\_\{1\}lies inSc∗S\_\{c^\{\*\}\}but the second token already diverges from any validc∗c^\{\*\}realization\. Type A is the signature of phrase\-level commitment to ac∗c^\{\*\}\-aligned phrase att=1t=1\.
A subtlety: Type A samples are by definition still hallucinations \(final substring match against ground\-truth aliases failed\), so they are cases where the model’s bigram aligns with an alias prefix but the multi\-token completion still diverges into a different entity thanc∗c^\{\*\}\. Common patterns include sharing a personal name with an unrelated figure \(George Washington Carverwhen the answer isGeorge Washington\), sharing a place\-name prefix with a different geographic entity \(Saint Petersburg Beachwhen the answer isSaint Petersburg\), or sharing an entity prefix with a related\-but\-distinct concept \(New York Timeswhen the answer isNew York\)\. In each case, the bigram is consistent with a valid alias ofc∗c^\{\*\}but the continuation commits to a wrong concept\. By contrast,Adam LambertforAdam Smithis Type B—the bigramAdam Lambertmatches no alias ofAdam Smith\.
Per\-model results\.Table[4](https://arxiv.org/html/2605.22007#A6.T4)reports Type A fraction andHt=2H\_\{t=2\}statistics across 18 models\. Two patterns are robust:
1. 1\.Type A has substantially lowerHt=2H\_\{t=2\}than Type B, in 100% of models \(median Cohen’sd=1\.29d=1\.29, range\[0\.71,1\.87\]\[0\.71,1\.87\]\)\. When the bigram aligns with an alias prefix, the continuation is near\-deterministic\.
2. 2\.Type A fraction grows with both scale and instruction tuning\.Small Base: 21–53%\. Small Instruct: 36–72%\. Large Base \(70B\+\): 59–74%\. Large Instruct \(70B\+\):77–83%\.
Table 4:Multi\-token divergence diagnostic across 18 models\.NMultiN\_\{\\text\{Multi\}\}: number of multi\-token divergences\. Type A frac\.: fraction where the bigram\(y1,y2\)\(y\_\{1\},y\_\{2\}\)matches a valid alias prefix\.Ht=2H\_\{t=2\}A / B: mean entropy att=2t=2for Type A / Type B divergences\.dB−Ad\_\{B\-A\}: Cohen’sddcomparingHt=2H\_\{t=2\}Type B vs\. Type A\. ThedB−Ad\_\{B\-A\}for Qwen3\.5\-9B Base is undefined becauseN=4N=4Type B samples is too small\.Pooled meta\-analysis\.Across all 18 models, the Type B vs\. Type AHt=2H\_\{t=2\}comparison gives a median Cohen’sd=1\.29d=1\.29, withd\>0d\>0in 100% of models with sufficientNNand a Fisher\-combinedp<10−200p<10^\{\-200\}\. Phrase\-level commitment is real and pervasive: when the bigram aligns with an alias prefix \(Type A\), the continuation is near\-deterministic; when it does not \(Type B\), the second token retains substantial entropy across multiple candidates\.
Interpretation\.The 70B\+ Instruct numbers are striking: Llama\-3\.1\-70B Instruct hasHt=2=0\.10H\_\{t=2\}=0\.10on Type A divergences and Qwen2\.5\-72B Instruct hasHt=2=0\.05H\_\{t=2\}=0\.05\. These values approach the entropy of deterministic continuations—the model has committed to a specific multi\-token phrase already att=1t=1, with the second token essentially predetermined\. This is the phrase\-level analog of the first\-token sharpening documented in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3): instruction\-induced sharpening operates not just at the token level but across multi\-token phrase commitments, and it grows monotonically with scale\.
Caveat onPmassP\_\{\\mathrm\{mass\}\}interpretation\.Within multi\-token divergences,Pmass\(t=1;c∗\)≥0\.2P\_\{\\mathrm\{mass\}\}\(t=1;c^\{\*\}\)\\geq 0\.2is consistent with two distributional pictures: \(i\) the model placed mass onc∗c^\{\*\}\-aligned alias prefixes att=1t=1as part of a phrase\-level commitment to a specific multi\-token continuation \(Type A\); \(ii\) the model placed mass on alias first tokens that are also the start of competing concepts’ phrases \(Type B—e\.g\., genericSir,Saintshared across many entities\)\.PmassP\_\{\\mathrm\{mass\}\}does not distinguish these, but the within\-population analysis in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3)\(which restricts to first\-token selection failures\) does not depend on this distinction\.
## Appendix GRobustness of CF% to the 0\.2 Threshold
The 0\.2 threshold defining commitment failures is conservative and somewhat arbitrary\. Table[5](https://arxiv.org/html/2605.22007#A7.T5)reports CF% at thresholds\{0\.1,0\.2,0\.3,0\.4\}\\\{0\.1,0\.2,0\.3,0\.4\\\}across the scale ablation: the absolute level shifts but the monotonic increase with model size is preserved at all thresholds\.
Table 5:CF% across thresholdsθ∈\{0\.1,0\.2,0\.3,0\.4\}\\theta\\in\\\{0\.1,0\.2,0\.3,0\.4\\\}defining commitment failures as hallucinated samples withPmass\(tc;c∗\)≥θP\_\{\\mathrm\{mass\}\}\(t\_\{c\};c^\{\*\}\)\\geq\\theta\. The absolute level shifts with threshold, but the monotonic increase with model size is preserved at every column\.
## Appendix HLong\-Form Generation:PmassP\_\{\\mathrm\{mass\}\}Trajectories
This appendix supports the discussion in §[4\.1](https://arxiv.org/html/2605.22007#S4.SS1)on howPmassP\_\{\\mathrm\{mass\}\}behaves when the commitment step is not att=1t=1\. We use Qwen3\.5\-9B Instruct on TriviaQA \+ NQ\-Open and re\-prompt each question with a long\-form instruction \(Answer the following question in a complete sentence\.\)\. The model now produces an answer likeThe capital of France is Paris\.rather thanParis\. The commitment steptct\_\{c\}is identified manually as the position of the answer entity within the generated sentence \(e\.g\.,Parisin the example above\), and we align trajectories aroundtct\_\{c\}\. We pre\-screen samples to those whose long\-form output contains a unique unambiguous reference to a single concept \(correct or wrong\) to maketct\_\{c\}well\-defined\.
Figure[6](https://arxiv.org/html/2605.22007#A8.F6)shows the resultingPmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)trajectories\. Correct samples have a sharp peak inPmassP\_\{\\mathrm\{mass\}\}attct\_\{c\}\(typically 0\.6–0\.9\) and near\-zero mass before and after, consistent withPmassP\_\{\\mathrm\{mass\}\}measuring the model’s commitment to the correct concept at the moment of emission\. Hallucinated samples sit near zero at all aligned steps—the model never put substantial mass onc∗c^\{\*\}in the trajectory\. This is the long\-form analog of “not\-CF” hallucinations: cases where the model truly does not know the answer, distinct from CF/SF samples in short\-form QA\. Long\-formPmassP\_\{\\mathrm\{mass\}\}\-trajectories under the right prompting can therefore separate “model never had it” from “model had it but emitted something else,” but the cleaner signal in our paper comes from instruction\-tuned short\-form QA wheretc=1t\_\{c\}=1is fixed\.
For comparison, Figure[7](https://arxiv.org/html/2605.22007#A8.F7)shows the analogous trajectory of the generated\-token probabilityP\(yt\)P\(y\_\{t\}\)\. Both correct and hallucinated trajectories are at∼\\sim0\.85–0\.95 throughout, with only a small dip attct\_\{c\}\. Token\-level confidence carries little of the correct/hallucinated signal thatPmassP\_\{\\mathrm\{mass\}\}reveals, which is the central methodological point of the paper transferred to the long\-form setting: the relevant signal lives at the level of concept\-grouped mass, not individual\-token entropy\.
Figure 6:Pmass\(t;c∗\)P\_\{\\mathrm\{mass\}\}\(t;c^\{\*\}\)trajectories under long\-form prompting \(Answer in a complete sentence\), Qwen3\.5\-9B Instruct\. Top: correct samples; bottom: hallucinated\. Dashed lines marktct\_\{c\}\(the position of the answer entity within the generated sentence\)\. Correct samples:PmassP\_\{\\mathrm\{mass\}\}is near\-zero beforetct\_\{c\}, peaks attct\_\{c\}, collapses afterward\. Hallucinated samples: essentially no mass onc∗c^\{\*\}at any step\.Figure 7:Generated\-token probabilityP\(yt\)P\(y\_\{t\}\)aligned totct\_\{c\}in long\-form generation\. Both correct and hallucinated trajectories sit at∼\\sim0\.85–0\.95 throughout, with only a small dip attct\_\{c\}\(∼\\sim0\.78 vs\. 0\.89\)\. Token\-level confidence carries little of the correct/hallucinated signal thatPmassP\_\{\\mathrm\{mass\}\}reveals \(cf\. Figure[3](https://arxiv.org/html/2605.22007#S4.F3)b\)\.
## Appendix IScS\_\{c\}Construction
ScS\_\{c\}is constructed deterministically with no LLM involvement:
- •Alias collection\.TriviaQA:answer\.value,answer\.aliases,answer\.normalized\_aliases\. NQ\-Open: all entries inanswer\. Typically 5–20 aliases per question\.
- •Lexical variants\.For each aliasaa, six variants: original, lowercase, capitalized, and the same three with a leading space\. We excludeupper\(\)variants \(single capital letters appear in many unrelatedScS\_\{c\}\) and newline\-prefixed variants \(the\\ntoken appears in allScS\_\{c\}\)\.
- •First\-token extraction\.Each variant is tokenized withadd\_special\_tokens=False; the first token ID is added toScS\_\{c\}\.
- •Deduplication\.Final\|Sc\|\|S\_\{c\}\|is typically 12–20\.
## Appendix JMulti\-Layer Probing
Table 6:Multi\-layer probe AUROC \(MCQA, logistic regression, 5\-fold CV\)\.Figure 8:Layer\-wise probe AUROC \(Qwen3\.5\)\. Instruct \(blue\)\>\>Base \(orange\) at nearly every layer; gap peaks at mid\-layers\.The first\-layer Instruct\>\>Base difference \(\+0\.01 to \+0\.04\) rules out a purely output\-formatting explanation\. Mid\-layer peaks \(≈\+0\.06\\approx\+0\.06to \+0\.12 gap\) indicate that instruction tuning amplifies correctness encoding most strongly in intermediate representations\.
## Appendix KPre\- vs\. Post\-Generation Probes
Table 7:Pre\-gen vs\. post\-gen probe AUROC \(MCQA, last layer,N=1,000N\{=\}1\{,\}000\)\.
## Appendix LExtended Tables
Table 8:First\-token behavior att=1t=1\(Short\-QA\)\. Instruct:tc=1t\_\{c\}=1; Base:tc≥2t\_\{c\}\\geq 2due to a leading filler\.Table 9:Full scale ablation \(Short\-QA,N=3,000N\{=\}3\{,\}000\)\. TokP /PmassP\_\{\\mathrm\{mass\}\}: AUROC for generated\-token probability andPmass\(t=1\)P\_\{\\mathrm\{mass\}\}\(t\{=\}1\)\.Spread\\mathrm\{Spread\}: averageSpread\(c∗\)\\mathrm\{Spread\}\(c^\{\*\}\)\. CF%: commitment\-failure rate\.All 18 models \(14 small \+ 4 large\) now have full metric coverage; CF% values are verified against the data dump in §[4\.2](https://arxiv.org/html/2605.22007#S4.SS2)\.
Table 10:Pmass\(t=1\)P\_\{\\mathrm\{mass\}\}\(t=1\)vs\. generated\-token probability \(AUROC\) with calibration\.PmassP\_\{\\mathrm\{mass\}\}recovers a coherent confidence signal across all 14 models; this is the prerequisite \(not the headline\) for the commitment\-failure analysis\.Figure 9:Pmass\(t=1\)P\_\{\\mathrm\{mass\}\}\(t=1\)calibration for all 14 models\. Accuracy increases monotonically acrossPmassP\_\{\\mathrm\{mass\}\}bins \(Instruct ECE 0\.023–0\.096\)\.
## Appendix MAggregation: Probability Space vs\. Log Space
A natural question is howPmassP\_\{\\mathrm\{mass\}\}relates to standard sequence\-level uncertainty estimators that aggregate per\-step quantities across the full trajectory, typically in log space \(mean log\-probability, length\-normalized NLL\)\. For multi\-token concepts these correspond to two different ways of forming a per\-sequence score:
- •*Probability\-space first\-step:*Pmass\(t=1;c∗\)P\_\{\\mathrm\{mass\}\}\(t=1;c^\{\*\}\), our default\.
- •*Log\-space full sequence:*1T∑t=1TlogPθ\(yt∣Q,y<t\)\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log P\_\{\\theta\}\(y\_\{t\}\\mid Q,y\_\{<t\}\), the standard length\-normalized log\-likelihood \(LN\-NLL\)\.
These differ in two ways: probability vs\. log space, and single\-step vs\. trajectory\-averaged\. The two differences combine to give a sharp empirical contrast\. On Qwen3\.5\-9B Instruct Short\-QA \(N=3,000N\{=\}3\{,\}000\),Pmass\(t=1;c∗\)P\_\{\\mathrm\{mass\}\}\(t=1;c^\{\*\}\)achieves AUROC 0\.887 while LN\-NLL achieves 0\.806 \(the same gap asPmassP\_\{\\mathrm\{mass\}\}vs\. TokProb in Table[10](https://arxiv.org/html/2605.22007#A12.T10)\)\. Decomposing the gap by changing one factor at a time:
Two observations\. First, the single\-step→\\tofull\-sequence drop is large in probability space \(0\.887→\\to0\.738\) but small in log space \(0\.812→\\to0\.806\): aggregating downstream tokens dilutes the probability\-space signal because most steps are deterministic continuations whosePmassP\_\{\\mathrm\{mass\}\}is essentially zero\. Second, at the commitment step itself, probability space \(PmassP\_\{\\mathrm\{mass\}\}\) outperforms log space \(logP\\log P\) by 0\.075 AUROC, because the relevant quantity at the commitment step is the total mass on the correct concept’s surface forms—which is additive in probability, not in log\-probability\. The same picture holds for the relationship between greedy token probability and its log form: the probability is what is being compared by the argmax, so it is the natural quantity to inspect at the moment of commitment\. We use probability space throughout the paper for this reason\.
## Appendix NCompute Resources
All experiments were run on NVIDIA GPUs\. We did not perform any model fine\-tuning—all compute consists of forward passes only\. Per\-model wall\-clock cost scales with model size:
- •Sub\-10B models\(Qwen3\.5\-0\.8B/2B/4B/9B, Llama\-3\.2\-1B/3B, Llama\-3\.1\-8B\): single NVIDIA A100 \(80GB\), fp16, batch size 8; 0\.5–2 hours per \(model, dataset\) combination\.
- •72B/70B models\(Qwen2\.5\-72B, Llama\-3\.1\-70B\): single NVIDIA B200 \(180GB\), fp16, batch size 8; 4–8 hours per \(model, dataset\) combination\. The B200’s larger memory accommodates the 70B\+ models without tensor parallelism\.
The full 18\-model evaluation across TriviaQA \+ NQ\-Open \(3,000 samples per dataset, top\-50 token probabilities saved at the commitment step\) totals approximately 53,000 forward passes\. The Phase 2 within\-concept and between\-concept ratio analysis \(D2D\_\{2\},D3D\_\{3\}in Appendix[E](https://arxiv.org/html/2605.22007#A5)\) is a pure offline post\-processing of the saved top\-50 probabilities and required no additional GPU compute\.
The probing experiments \(Appendix[J](https://arxiv.org/html/2605.22007#A10)\) use logistic regression on saved hidden states; each probe trains in under one minute on CPU\.
## Appendix OBroader Impacts
This paper provides an analytical characterization of LLM hallucination structure: when a model places non\-trivial mass on the correct concept yet emits a wrong final answer\. The work introduces no new models, datasets, or deployable methods; the contribution is a probe and an empirical characterization\. Potential positive societal impact: a clearer mechanistic account of confident hallucination may inform safer deployment and calibration practices, particularly in high\-stakes settings where users treat fluency as evidence of reliability\. We do not foresee direct negative societal impact from the analysis itself, beyond the general concern that deeper understanding of model failures could in principle inform adversarial exploitation; however, the analytical probe itself is not a generation or control method and does not enable any new attack surface\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction state that we \(i\) introduce per\-step semantic probability massPmassP\_\{\\mathrm\{mass\}\}as an analytical probe over equivalence classes of token completions, \(ii\) show that 16–47% of Instruct hallucinations occur with substantial mass on the correct concept and that the rate rises monotonically with scale, and \(iii\) characterize the underlying mechanism as instruction\-induced sharpening at three structural levels\. Each claim is empirically supported in §[4](https://arxiv.org/html/2605.22007#S4)\(commitment\-failure rates in Table[1](https://arxiv.org/html/2605.22007#S4.T1); sharpening at three levels in §[4\.3](https://arxiv.org/html/2605.22007#S4.SS3); pre\-generation hidden\-state signal in §[4\.4](https://arxiv.org/html/2605.22007#S4.SS4)\)\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification: §[5](https://arxiv.org/html/2605.22007#S5)\(Limitations paragraph\) discusses \(i\) the dependence ofPmassP\_\{\\mathrm\{mass\}\}onScS\_\{c\}construction and the implications when alias completeness is imperfect, \(ii\) the restriction to first\-token commitment and the migration oftct\_\{c\}in long\-form generation, \(iii\) the analytical\-probe \(not detector\) status ofPmassP\_\{\\mathrm\{mass\}\}, and \(iv\) the model scope \(Qwen and Llama families up to 72B; no closed\-source or non\-English models tested\)\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[Yes\]
12. Justification: The paper has one formal result, Proposition[1](https://arxiv.org/html/2605.22007#Thmproposition1), relatingPmassP\_\{\\mathrm\{mass\}\}to concept belief under a latent\-concept generation model\. Assumptions \(alias completeness and concept separation\) are stated explicitly in §[3](https://arxiv.org/html/2605.22007#S3); the full proof is in Appendix[A](https://arxiv.org/html/2605.22007#A1)\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
15. Answer:\[Yes\]
16. Justification: All 18 models are publicly available on HuggingFace \(Qwen3\.5/Qwen2\.5\[Yanget al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib32),[2024](https://arxiv.org/html/2605.22007#bib.bib31)\], Llama\-3\.2/3\.1\[Grattafioriet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib33)\]\); all four datasets \(TriviaQA, NQ\-Open, MMLU, ARC\-Challenge\) are public\. Sample counts, prompts, decoding settings \(greedy\),Sc∗S\_\{c^\{\*\}\}construction \(Appendix[I](https://arxiv.org/html/2605.22007#A9)\), and metric definitions are specified in §[3](https://arxiv.org/html/2605.22007#S3)\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
19. Answer:\[Yes\]
20. Justification: Anonymized code is provided in the supplementary material as a zip archive, including scripts to reproduce the main results \(commitment\-failure rate computation, within\-population analysis,Ht=2H\_\{t=2\}classification,D2D\_\{2\}/D3D\_\{3\}measurements\)\. All datasets are publicly available \(TriviaQA, NQ\-Open, MMLU, ARC\-Challenge\); models are publicly released on HuggingFace\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: §[3](https://arxiv.org/html/2605.22007#S3)specifies all 18 models with sizes and variants, datasets and sample counts \(3,000 short\-QA per model; 2,672 MCQA per model\), decoding \(greedy\), quantization \(4\-bit NF4 for Qwen3\.5 sub\-10B; fp16 for the rest\), probe protocol \(5\-fold CV logistic regression\), and metric definitions \(AUROC, ECE, Brier\)\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
27. Answer:\[Yes\]
28. Justification: Within\-population comparisons report Welch’stt\-test statistics,pp\-values, and Cohen’sddeffect sizes for all 18 models \(§[4\.3](https://arxiv.org/html/2605.22007#S4.SS3), Appendix Table[2](https://arxiv.org/html/2605.22007#A4.T2)\); theHt=2H\_\{t=2\}multi\-token analysis reports Cohen’sddacross the same 18 models \(Appendix[F](https://arxiv.org/html/2605.22007#A6)\); a primer on these statistics is in Appendix[B](https://arxiv.org/html/2605.22007#A2)\. All experiments use deterministic greedy decoding, so there is no run\-to\-run sampling variance to capture with confidence intervals; the variability captured by Welch’sttis the relevant within\-sample distribution variability\.
29. 8\.Experiments compute resources
30. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
31. Answer:\[Yes\]
32. Justification: Compute resources are described in Appendix[N](https://arxiv.org/html/2605.22007#A14)\(single A100 80GB for sub\-10B models; single B200 180GB for 70B\+ models; all forward\-pass\-only experiments\)\.
33. 9\.Code of ethics
35. Answer:\[Yes\]
36. Justification: The work analyzes publicly released LLMs on standard public QA benchmarks\. No human subjects, no scraped data, no new models or datasets released\.
37. 10\.Broader impacts
38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
39. Answer:\[Yes\]
40. Justification: Potential positive and negative societal impacts are discussed in Appendix[O](https://arxiv.org/html/2605.22007#A15)\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
43. Answer:\[N/A\]
44. Justification: No new models, datasets, or scraped data are released\.
45. 12\.Licenses for existing assets
46. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
47. Answer:\[Yes\]
48. Justification: Models used: Qwen3\.5/Qwen2\.5 \(Apache 2\.0;\[Yanget al\.,[2025](https://arxiv.org/html/2605.22007#bib.bib32),[2024](https://arxiv.org/html/2605.22007#bib.bib31)\]\) and Llama\-3\.1/3\.2 \(Llama Community License;\[Grattafioriet al\.,[2024](https://arxiv.org/html/2605.22007#bib.bib33)\]\)\. Datasets: TriviaQA\[Joshiet al\.,[2017](https://arxiv.org/html/2605.22007#bib.bib27)\], NQ\-Open\[Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.22007#bib.bib28)\], MMLU\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.22007#bib.bib29)\], ARC\-Challenge\[Clarket al\.,[2018](https://arxiv.org/html/2605.22007#bib.bib30)\], all distributed under permissive academic\-use licenses\.
49. 13\.New assets
50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
51. Answer:\[N/A\]
52. Justification: No new models, datasets, or other assets are released\.
53. 14\.Crowdsourcing and research with human subjects
54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
55. Answer:\[N/A\]
56. Justification: No crowdsourcing or human subjects\.
57. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
59. Answer:\[N/A\]
60. Justification: No human subjects\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research?
63. Answer:\[N/A\]
64. Justification: LLMs are the subject of study but are not used as a methodological component for core method development\.Similar Articles
Why language models hallucinate
OpenAI publishes research explaining that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty, and proposes that evaluation metrics should prioritize honesty about limitations over raw accuracy.
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.
HalluScore: Large Language Model Hallucination Question Answering Benchmark
Introduces HalluScore, a structured Arabic QA benchmark for evaluating hallucination in LLMs across reasoning difficulty, knowledge domains, and cultural contexts. Contains 827 questions with verified evidence and annotations, tested on 17 LLMs.
Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.