Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

arXiv cs.CL 07/02/26, 04:00 AM Papers
hallucination language-models inference-misalignment diagnostic-testbed reasoning priors ai-reliability
Summary
This paper studies why language models hallucinate, proposing that hallucinations often stem from biased latent inference (inference misalignment) rather than missing knowledge. It introduces TrapQA, a controlled diagnostic testbed to test reasoning against priors, and demonstrates that hallucinations can arise from misleading latent associations.
arXiv:2607.00447v1 Announce Type: new Abstract: Large language models often produce hallucinated answers that violate prompt-level constraints. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but follows the wrong inference path. We study this phenomenon as inference misalignment: a mismatch between the answer supported by the prompt and the answer favored by statistically salient latent associations. We formalize this view with a latent key-task model, in which pretraining-frequency imbalance can cause a shortcut path to dominate the constraint-sensitive path and induce positive inference loss. The framework predicts two failure modes: task-retrieval bias in entity disambiguation and key-selection bias in action choice. We introduce TrapQA, a controlled diagnostic testbed with two components. ScientistQA tests disambiguation among similar scientists with supplementary factual probes, while Real-Life Constrained QA tests everyday constraint following under salient shortcuts. Our results show that hallucination can arise from biased latent inference rather than absent knowledge alone.
Original Article
View Cached Full Text
Cached at: 07/02/26, 05:37 AM
# Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors
Source: [https://arxiv.org/html/2607.00447](https://arxiv.org/html/2607.00447)
Yangfan HuXuhan Tong∗Haoyue Bai∗ Xi DingShashank Muralidhar BharadwajSiyang Cao Robert NowakJiawei Zhang22footnotemark:2 University of Wisconsin–Madison [Project page](https://neohughus.github.io/Understanding_Why_Language_Models_Hallucinate/)Equal contribution\. Author order was randomly determined\.Correspondence to:\{haoyue\.bai,jiawei\.zhang\}@wisc \.edu\.

###### Abstract

Large language models often produce hallucinated answers that violate prompt\-level constraints\. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but follows the wrong inference path\. We study this phenomenon as*inference misalignment*: a mismatch between the answer supported by the prompt and the answer favored by statistically salient latent associations\. We formalize this view with a latent key–task model, in which pretraining\-frequency imbalance can cause a shortcut path to dominate the constraint\-sensitive path and induce positive inference loss\. The framework predicts two failure modes: task\-retrieval bias in entity disambiguation and key\-selection bias in action choice\. We introduceTrapQA, a controlled diagnostic testbed with two components\.ScientistQAtests disambiguation among similar scientists with supplementary factual probes, whileReal\-Life Constrained QAtests everyday constraint following under salient shortcuts\. Our results show that hallucination can arise from biased latent inference rather than absent knowledge alone\.

Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

Yangfan Hu††thanks:Equal contribution\. Author order was randomly determined\.Xuhan Tong∗Haoyue Bai∗††thanks:Correspondence to:\{haoyue\.bai,jiawei\.zhang\}@wisc \.edu\.Xi DingShashank Muralidhar BharadwajSiyang CaoRobert NowakJiawei Zhang22footnotemark:2University of Wisconsin–Madison[Project page](https://neohughus.github.io/Understanding_Why_Language_Models_Hallucinate/)

## 1Introduction

Large language models \(LLMs\) have achieved strong performance across many tasks\(OpenAIet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib66); Teamet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib67); DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib68); Grattafioriet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib69)\)\. They are also increasingly integrated with tools and agentic workflows, such as web search and external services\(Nakanoet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib70); Liuet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib71); Steinberger,[2026](https://arxiv.org/html/2607.00447#bib.bib65)\)\. As model outputs become more tightly coupled to real\-world actions, hallucination remains a central reliability risk\.

Hallucination broadly refers to fluent but factually incorrect, unsupported, or context\-unfaithful outputs\(Jiet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib73); Huanget al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib72)\)\. Such errors are difficult to detect when models sound confident or when users lack domain expertise, and they can be amplified in agentic settings through downstream tool calls or transactions\. Importantly, hallucinations can arise even under benign inputs, making them an intrinsic reliability problem rather than only a failure under adversarial attack\(Zhanget al\.,[2025b](https://arxiv.org/html/2607.00447#bib.bib74); Huanget al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib72)\)\.

Prior work studies hallucination through training\-data bias, decoding dynamics, and attribution or mechanistic analysis\(Dziriet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib5); Zhanget al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib6); Sunet al\.,[2025b](https://arxiv.org/html/2607.00447#bib.bib4); Gaoet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib7)\)\. Existing evaluations such as TruthfulQA and HaluEval measure important aspects of truthfulness and hallucination behavior\(Linet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib8); Liet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib12)\)\. However, a central mechanistic question remains underexplored: when a model fails, did it lack the needed knowledge, or did it possess the relevant facts but retrieve and apply the wrong inference path?

We address this question by interpreting hallucination as*inference misalignment*: a mismatch between the answer logically supported by the prompt and the answer favored by statistically salient learned associations\. In our framework, a prompt activates latent key–task paths\. A model hallucinates when a high\-frequency shortcut path receives greater posterior weight than the constraint\-sensitive path required by the prompt\. This view predicts that errors can occur even when the relevant facts or constraints are available: the failure lies not only in stored knowledge, but in selecting and composing the appropriate inference path\.

Guided by this theory, we introduceTrapQA, a closed\-book diagnostic benchmark suite with two complementary settings\.ScientistQAtargets*task\-retrieval bias*, where a salient entity–relation association overrides a discriminative constraint\.Real\-Life Constrained QAtargets*key\-selection bias*, where a superficially salient cue dominates the task\-relevant constraint\. These settings are designed not as broad\-coverage benchmarks, but as controlled tests of the latent key–task framework\. External tools, including web search, are disabled in all evaluations\.

#### ScientistQA\.

ScientistQAtests constraint\-sensitive disambiguation among highly similar scientists\. For each pair, we generate a biography\-style paragraph broadly compatible with both candidates, then append a decisive fact that rules out exactly one candidate\. The model must choose the scientist matching the full description\. We also include closed\-book probes for candidate\-specific facts, allowing us to distinguish missing knowledge from knowledge\-deployment failure\. Across 2,925 questions and eight model–reasoning configurations spanning GPT, Gemini, Claude, and DeepSeek, hallucination rates range from 2\.50% to 37\.23% in the retrieval\-sensitive names\-only setting\. Many errors persist even when the model answers the relevant probe facts correctly in isolation, indicating a failure of comparative knowledge deployment rather than factual ignorance alone\.

![Refer to caption](https://arxiv.org/html/2607.00447v1/sections/fig/hallucination_example.png)Figure 1:Hallucination as misaligned inference\. In the motivatingScientistQAexample, the model selects the wrong scientist by following a high\-salience association while underweighting the decisive discriminative constraint\. Direct closed\-book probes, conducted with external tools disabled, show that the model can answer the relevant candidate\-specific facts in isolation, suggesting a comparative knowledge\-deployment failure rather than simple factual ignorance\.
#### Real\-Life Constrained QA\.

Real\-Life Constrained QAcomplementsScientistQAby testing shortcut failures in everyday action choice\. We begin from lexical associations inSWOW\(De Deyneet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib13)\), use them as high\-salience shortcut cues, and organize them through eight seed template families such asvehicle\_required,delivery\_medium,recording\_medium, andtool\_required\. GPT instantiates natural forced\-choice scenarios in which one option is superficially tempting but violates a prompt\-grounded physical, spatial, procedural, or medium\-specific constraint\. After filtering and controlled perturbations, the final collection contains 500 questions covering 13 aspects of daily life\. Claude, GPT, Gemini, and DeepSeek make 81, 44, 18, and 182 mistakes, respectively, corresponding to error rates of 16\.2%, 8\.8%, 3\.6%, and 36\.4%\. These failures show that inference misalignment is not limited to encyclopedic entity disambiguation: it also appears when models must select an action under ordinary real\-world constraints\.

#### Contributions\.

We make three contributions\.\(1\)We propose a latent key–task framework that formalizes hallucination as inference misalignment caused by posterior dominance of statistically salient shortcut paths over prompt\-supported constraint\-sensitive paths\.\(2\)We derive theoretical predictions linking pretraining\-frequency imbalance to shortcut posterior and positive inference loss, providing a mechanistic account of why hallucinations can occur even under benign prompts\.\(3\)We introduceTrapQA, a closed\-book diagnostic benchmark suite with two complementary settings, and evaluate frontier model families with external tools disabled, showing that hallucinations can persist despite isolated factual knowledge and under prompt\-grounded real\-world constraints\.

## 2Related Work

Hallucination in language generation predates the recent LLM era\. Neural data\-to\-text systems can produce fluent outputs that fail to reflect the underlying records\(Wisemanet al\.,[2017](https://arxiv.org/html/2607.00447#bib.bib3)\); neural machine translation models may generate plausible but source\-unsupported translations\(Leeet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib2); Raunaket al\.,[2021](https://arxiv.org/html/2607.00447#bib.bib26)\); and abstractive summarization models often produce content that is not faithful to the input document\(Maynezet al\.,[2020](https://arxiv.org/html/2607.00447#bib.bib1)\)\. Across these settings, hallucination reflects a common failure mode: the model produces plausible language that is insufficiently grounded in the input, retrieved evidence, or relevant knowledge\.

Recent work studies hallucination from both theoretical and empirical perspectives\. Formal accounts show that hallucination can arise from fundamental learning or statistical limitations\(Xuet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib15); Kalai and Vempala,[2024](https://arxiv.org/html/2607.00447#bib.bib32)\), while mechanistic accounts trace hallucinations to competing associations or latent\-state dynamics during generation\(Sunet al\.,[2025a](https://arxiv.org/html/2607.00447#bib.bib16); Cherukuri and Varshney,[2026](https://arxiv.org/html/2607.00447#bib.bib22)\)\. Other work locates hallucination across the LLM pipeline, including distributional imbalance and noise in pretraining data, difficulty acquiring new factual knowledge during fine\-tuning, and inference\-time reliance on memorized or frequency\-biased patterns\(Zhanget al\.,[2025a](https://arxiv.org/html/2607.00447#bib.bib17); Liuet al\.,[2026](https://arxiv.org/html/2607.00447#bib.bib35); Gekhmanet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib18); Zhanget al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib6); McKennaet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib41); Berglundet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib42)\)\. These suggest that hallucination is not merely a matter of missing knowledge; it can also reflect failures in retrieving, comparing, or applying knowledge under the constraints of a particular prompt\. Appendix[C](https://arxiv.org/html/2607.00447#A3)provides a fuller discussion, including reinforcement\-learning\-based mitigation efforts\.

A large body of benchmarks evaluates factuality and faithfulness in generated text, including TruthfulQA and HaluEval\(Linet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib8); Liet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib12)\), long\-form factuality benchmarks such as FActScore and LongFact\(Minet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib62); Weiet al\.,[2024b](https://arxiv.org/html/2607.00447#bib.bib61)\), short\-form factuality benchmarks such as SimpleQA\(Weiet al\.,[2024a](https://arxiv.org/html/2607.00447#bib.bib60)\), and retrieval\-augmented generation benchmarks such as RAGTruth and FRAMES\(Niuet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib58); Krishnaet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib57)\)\. These benchmarks primarily measure whether an output is factual, faithful, or grounded in evidence\. By contrast, our goal is diagnostic: Scientist QA and Real\-Life Constrained QA are designed to isolate why a model fails\. Scientist QA tests whether models can deploy candidate\-specific facts under disambiguation, while Real\-Life Constrained QA tests whether models can override SWOW\-derived associative cues with prompt\-grounded physical, spatial, procedural, or medium\-specific constraints\(De Deyneet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib13)\)\. This design allows us to distinguish simple ignorance from knowledge\-deployment failures in which relevant information is available to the model but not used in the relevant inference path\.

## 3Pretraining Frequency Induces Hallucination in Latent Inference

### 3\.1Hallucination as Inference Misalignment

In the study of LLM reliability,*hallucination*is commonly described as the generation of factually incorrect or logically inconsistent content\. However, to enable a quantitative mathematical analysis, we argue that the underlying mechanism of hallucination can be more precisely characterized as*inference misalignment*\.

Specifically, let𝒛\\bm\{z\}denote a prompt sequence of bounded length\. A pretrained sequence model induces a conditional distributionP\(⋅∣𝒛\)P\(\\cdot\\mid\\bm\{z\}\)over continuations\. We further assume the existence of an ideal predictor defining the ground\-truth conditional distributionP⋆\(⋅∣𝒛\)P\_\{\\star\}\(\\cdot\\mid\\bm\{z\}\), representing the correct reasoning process for the task associated with the prompt\. For analysis, we work in the embedding space where smoothness properties are well defined\. Conceptually, the model’s inference can be viewed as moving from regions well\-supported by the pretraining distribution toward the test prompt along an optimal*inference path*, producing an output consistent withP⋆\(⋅∣𝒛\)P\_\{\\star\}\(\\cdot\\mid\\bm\{z\}\)when the semantic structure and task intent are correctly identified\.

However, due to statistical regularities present in the pretraining corpus, the model may instead rely on high\-frequency*shortcuts*\. These shortcuts bias the model toward inference paths that are statistically dominant but semantically incorrect\. Consequently, the model may traverse regions of the representation space that are weakly supported by the training data, leading to unstable behavior\.

Guided by this perspective, we define the inference loss as the discrepancy between the model’s predictive distribution and the ideal distribution:ℓ\(𝒛\):=∥P\(⋅\|𝒛\)−P⋆\(⋅\|𝒛\)∥TV\\ell\(\\bm\{z\}\):=\\\|P\(\\cdot\|\\bm\{z\}\)\-P\_\{\\star\}\(\\cdot\|\\bm\{z\}\)\\\|\_\{TV\}, where∥⋅∥TV\\\|\\cdot\\\|\_\{TV\}denotes the total variation distance\. Through this formalization, we shift the study of hallucination from heuristic descriptions of textual inconsistency to a principled analysis of inference behavior\.

### 3\.2Latent Key–Task Model of LLM Inference

To analyze how a model generalizes from the pretraining corpus to a new prompt, we introduce a conceptual abstraction of inference stage\. Let𝒯=\{t1,…,tp\}\\mathcal\{T\}=\\\{t\_\{1\},\\dots,t\_\{p\}\\\}denote a finite set of latent tasks and let𝒦=\{k1,…,km\}\\mathcal\{K\}=\\\{k\_\{1\},\\dots,k\_\{m\}\\\}denote a finite set of task\-informative*keys*\. A key represents a salient feature pattern in a prompt \(for example, a lexical cue or structural pattern\) that provides evidence about the underlying task\. We model LLM inference as an implicit two\-stage reasoning process

𝒛→identify keyki→retrieve tasktj\(ki\)→generatey\.\\displaystyle\\bm\{z\}\\;\\xrightarrow\{\\text\{identify key\}\}\\;k\_\{i\}\\;\\xrightarrow\{\\text\{retrieve task\}\}\\;t\_\{j\}\(k\_\{i\}\)\\;\\xrightarrow\{\\text\{generate\}\}\\;y\.
Following the Bayesian perspective ofXieet al\.\([2022](https://arxiv.org/html/2607.00447#bib.bib9)\), we model the model’s prompt processing as an implicit inference over latent variables\. Under this view, the model associates each prompt with a posterior distribution over key–task pairs and combines their contributions to form the prediction\. Given a prompt𝒛\\bm\{z\}, the model implicitly forms a posterior distributionP\(k,t\|𝒛\)P\(k,t\|\\bm\{z\}\)over the latent space and aggregates predictions along these hypotheses:

P\(y\|𝒛\)=∑k∈𝒦,t∈𝒯P\(k,t\|𝒛\)P\(y\|𝒛;k,t\),P\(y\|\\bm\{z\}\)=\\sum\_\{k\\in\\mathcal\{K\},\\,t\\in\\mathcal\{T\}\}P\(k,t\|\\bm\{z\}\)\\,P\(y\|\\bm\{z\};k,t\),whereP\(y\|𝒛;k,t\)P\(y\|\\bm\{z\};k,t\)denotes the predictive distribution under the hypothesis that the prompt corresponds to keykkand tasktt\.

### 3\.3Frequency\-Induced Bias in Inference

We now introduce a framework that explicates how the statistical imbalance of the pretraining corpus leads to inference shortcuts and formally induces hallucination through generalization error\.

#### Pretraining Statistics\.

The distribution of keys and tasks in the pretraining corpus induces a statistical prior that guides latent inference\. Letci\(k\)c\_\{i\}^\{\(k\)\}denote the number of occurrences of keykik\_\{i\}in the corpus, and letC\(k\)=∑i=1mci\(k\)C^\{\(k\)\}=\\sum\_\{i=1\}^\{m\}c\_\{i\}^\{\(k\)\}\. Conditioned on a keykk, letcj\(t\)\(k\)c\_\{j\}^\{\(t\)\}\(k\)denote the number of occurrences in which tasktjt\_\{j\}co\-occurs withkk, and letC\(t\)\(k\)=∑j=1pcj\(t\)\(k\)C^\{\(t\)\}\(k\)=\\sum\_\{j=1\}^\{p\}c\_\{j\}^\{\(t\)\}\(k\)\. The empirical key distribution and conditional task distribution are

π\(k\)\(ki\)=ci\(k\)C\(k\),π\(t\)\(tj\|k\)=cj\(t\)\(k\)C\(t\)\(k\)\.\\pi^\{\(k\)\}\(k\_\{i\}\)=\\frac\{c\_\{i\}^\{\(k\)\}\}\{C^\{\(k\)\}\},\\quad\\pi^\{\(t\)\}\(t\_\{j\}\|k\)=\\frac\{c\_\{j\}^\{\(t\)\}\(k\)\}\{C^\{\(t\)\}\(k\)\}\.Throughout the analysis, we use these empirical frequencies as proxies for the underlying pretraining probabilities\. These statistics define a joint prior over key–task pairsπ\(k,t\)=π\(k\)\(k\)π\(t\)\(t\|k\)\\pi\(k,t\)=\\pi^\{\(k\)\}\(k\)\\,\\pi^\{\(t\)\}\(t\|k\)\.

#### Event\-based Perspective\.

We now specialize the latent inference framework, where the model is asked to select between two candidate entities or actions\. In this setting, the prompt𝒛\\bm\{z\}explicitly presents two candidate keysk∗k^\{\\ast\}andksk\_\{s\}:k∗k^\{\\ast\}is the candidate consistent with the prompt’s decisive constraint and corresponds to the correct answer, whileksk\_\{s\}is the alternative candidate\. We refer toksk\_\{s\}as the*shortcut key*when its associated key–task pair has higher pretraining frequency than that ofk∗k^\{\\ast\}, so that statistical salience favorsksk\_\{s\}even though the prompt\-level evidence favorsk∗k^\{\\ast\}\. Correspondingly, let\{t∗,ts\}\\\{t^\{\\ast\},t\_\{s\}\\\}denote correct and the shortcut task\.

###### Assumption 3\.1\(Activated Key Restriction\)

For a prompt𝐳\\bm\{z\}presenting candidatesk∗k^\{\\ast\}andksk\_\{s\}:

1. 1\.Negligible mass outside the candidate pair\. Latent keys other thank∗k^\{\\ast\}andksk\_\{s\}receive vanishing posterior:P\(k∉\{k∗,ks\}\|𝒛\)≪1P\\\!\\left\(k\\notin\\\{k^\{\\ast\},k\_\{s\}\\\}\\,\\middle\|\\,\\bm\{z\}\\right\)\\;\\ll\\;1\.
2. 2\.Prior\-driven posterior on the candidate pair\. Within the candidate pair, the prompt does not differentially update the relative posterior of the two keys:𝒛⟂k∣\{k∈\{k∗,ks\}\}\.\\bm\{z\}\\perp k\\mid\\\{k\\in\\\{k^\{\\ast\},k\_\{s\}\\\}\\\}\.

A useful way to interpret Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)is that \(i\) candidate keys outside the explicitly presented pair receive negligible posterior support; \(ii\) once the candidate set is fixed by the prompt template, the remaining prompt content does not, on its own, alter the relative plausibility of the two activated keys at the level of latent identification\. The model’s choice betweenk∗k^\{\\ast\}andksk\_\{s\}is then driven by the pretraining prior over keys rather than by within\-prompt likelihood asymmetries\.

###### Assumption 3\.2\(Activated Task Restriction\)

For each path, lett∗t^\{\\ast\}andtst\_\{s\}denote the tasks associated withk∗k^\{\\ast\}andksk\_\{s\}respectively\.

1. 1\.Negligible mass outside the candidate tasks\. Conditional on the activated key beingk∗k^\{\\ast\}\(resp\.ksk\_\{s\}\), the task posterior concentrates on the candidate set\{t∗,ts\}\\\{t^\{\\ast\},t\_\{s\}\\\}:P\(t∉\{t∗,ts\}\|𝒛,k∗\)≪1P\\\!\\left\(t\\notin\\\{t^\{\\ast\},t\_\{s\}\\\}\\,\\middle\|\\,\\bm\{z\},k^\{\\ast\}\\right\)\\ll 1andP\(t∉\{t∗,ts\}\|𝒛,ks\)≪1P\\\!\\left\(t\\notin\\\{t^\{\\ast\},t\_\{s\}\\\}\\,\\middle\|\\,\\bm\{z\},k\_\{s\}\\right\)\\ll 1\.
2. 2\.Prior\-driven posterior on the candidate tasks\. Within the candidate task pair, the prompt does not differentially update the relative posterior of the two tasks given the activated key:𝒛⟂t∣k,\{t∈\{t∗,ts\}\}\\bm\{z\}\\perp t\\mid k,\\\{t\\in\\\{t^\{\\ast\},t\_\{s\}\\\}\\\}\.

###### Assumption 3\.3\(Output separation\)

Lety∗y^\{\\ast\}denote the correct answer and letysy\_\{s\}denote the shortcut\-induced answer\. We assume that the two paths induce separated predictions:P\(y∗∣𝐳;ks,ts\)≪1P\(y^\{\\ast\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\ll 1, andP\(ys∣𝐳;k∗,t∗\)≪1P\(y\_\{s\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\\ll 1\. Moreover, the shortcut path is at least as confident in the shortcut answer as the correct path is in the correct answer:P\(ys∣𝐳;ks,ts\)≥P\(y∗∣𝐳;k∗,t∗\)P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\geq P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\.

###### Theorem 3\.4\(Shortcut Probability Dominance\)

Under Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)–[3\.3](https://arxiv.org/html/2607.00447#S3.Thmtheorem3), consider a fixed prompt𝐳\\bm\{z\}and the two main competing paths\(k∗,t∗\)\(k^\{\\ast\},t^\{\\ast\}\)and\(ks,ts\)\(k\_\{s\},t\_\{s\}\), then

P\(ys∣𝒛\)P\(y∗∣𝒛\)≳π\(ks,ts\)π\(k∗,t∗\)⋅P\(ys∣𝒛;ks,ts\)P\(y∗∣𝒛;k∗,t∗\)≳1\.\\frac\{P\(y\_\{s\}\\mid\\bm\{z\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\}\)\}\\gtrsim\\frac\{\\pi\(k\_\{s\},t\_\{s\}\)\}\{\\pi\(k^\{\\ast\},t^\{\\ast\}\)\}\\cdot\\frac\{P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\}\\gtrsim 1\.

Theorem[3\.4](https://arxiv.org/html/2607.00447#S3.Thmtheorem4)shows that when the pretraining frequency of the shortcut pair is sufficiently larger than that of the correct pair, the shortcut posterior can dominate the correct posterior even if the prompt contains semantically correct evidence\. We can decompose this frequency\-induced hallucination into two distinct modes:

#### I\. Key Selection Bias\.

The first term indicates that the model may fail to attend to the correct semantic anchor because a shortcut key appears much more frequently in the pretraining corpus\.

Consider the question:*“I want to go to a car wash\. The car wash is only 50 meters away\. Should I walk there or drive there?”*Many models answer that one should walk, since 50 meters is a very short distance, though without driving the car to the car wash the task cannot be completed\. This suggests that the model attends to the statistically dominant distance key \(i\.e\., “50 meters”\) rather than the semantically decisive key \(i\.e\., “car wash”\), since many examples associate short distances with walking and long distances with driving in pretraining\. As a result, the shortcut keyksk\_\{s\}corresponding to distance\-based transportation choice can dominate the correct keyk∗k^\{\\ast\}corresponding to task feasibility \(corresponding experiment see Appendix[G](https://arxiv.org/html/2607.00447#A7)\)\.

#### II\. Task Retrieval Bias\.

The theorem also shows that even with the correct key, the model may still retrieve the wrong relation if another task strongly dominates that key family in pretraining\.

During pretraining, discussions of special relativity overwhelmingly associate the concept with Albert Einstein\. Consider a prompt asking the model to choose between Enrico Fermi and Albert Einstein:*“A physicist and university teacher made major contributions to modern physics, but did not formulate the theory of special relativity\.”*Although the explicit constraint “did not formulate special relativity” rules out Einstein and supports Fermi, the model may still answer “Albert Einstein”\. The reason is that the affirmative shortcut key linking*special relativity*to Einstein \(ksk\_\{s\}\) dominates the rarer negated constraint \(k∗k^\{\\ast\}\) in the pretraining distribution\. As a result, the model attends to the statistically dominant association and effectively ignores the negation \(experiment see Section[5\.1](https://arxiv.org/html/2607.00447#S5.SS1)\)\.

Note that according to our theory, hallucination arises only when a dominant shortcut key–task pair has been learned during pretraining\. If no representative shortcut pattern exists, the posterior suppression mechanism does not occur, and the model will instead rely on the information provided in the prompt rather than retrieving a biased prior\. This prediction is consistent with our empirical observations in the*Knowledge Consistent*setting \(Section[5\.2](https://arxiv.org/html/2607.00447#S5.SS2)\), where the absence of strong pretraining shortcuts leads to reduced hallucination rates\. Complete proof is provided in Appendix[K\.2](https://arxiv.org/html/2607.00447#A11.SS2)\.

#### When Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)does not hold\.

Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)is best understood as describing the regime in whichk∗k^\{\\ast\}andksk\_\{s\}are*pretraining\-independent*: the two candidate keys rarely co\-occur in the same context during pretraining, so the model has not internalized any joint structure relating them\. In this regime, even though the prompt𝒛\\bm\{z\}presents both keys together, the model cannot leverage𝒛\\bm\{z\}to extract joint information beyond what is already encoded in the marginal priorsπ\(k\)\(k∗\)\\pi^\{\(k\)\}\(k^\{\\ast\}\)andπ\(k\)\(ks\)\\pi^\{\(k\)\}\(k\_\{s\}\), the posterior thus degenerates to the prior ratio\.

The complementary regime is whenk∗k^\{\\ast\}andksk\_\{s\}have been seen together during pretraining and the model has learned how their relative importance shifts in joint contexts\. In this case, the prompt𝒛\\bm\{z\}is not merely a pair of activated keys but a context whose surface form pattern\-matches against pretraining co\-occurrence statistics that the model has internalized\. The model can then use𝒛\\bm\{z\}to identify which of the two keys is task\-relevant in this particular context—essentially deploying a learned within\-context disambiguation, rather than falling back on marginal frequency\. Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)is violated, the posterior departs from the prior ratio, and hallucination may be avoided\.

### 3\.4From Shortcut to Inference Loss

We now show that when this posterior dominance changes the model’s preferred answer, it directly induces a positive inference loss\.

###### Assumption 3\.5\(Target preference margin\)

The target distribution prefers the correct answer over the shortcut answer: there existsγ⋆\(𝐳\)\>0\\gamma\_\{\\star\}\(\\bm\{z\}\)\>0such thatγ⋆\(𝐳\):=P⋆\(y∗∣𝐳\)−P⋆\(ys∣𝐳\)\>0\\gamma\_\{\\star\}\(\\bm\{z\}\):=P\_\{\\star\}\(y^\{\\ast\}\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y\_\{s\}\\mid\\bm\{z\}\)\>0\.

###### Theorem 3\.6\(Hallucination Lower Bound\)

Suppose Assumptions[3\.3](https://arxiv.org/html/2607.00447#S3.Thmtheorem3)and[3\.5](https://arxiv.org/html/2607.00447#S3.Thmtheorem5)hold\. If the shortcut posterior dominates the correct posterior \(i\.e\.,P\(y∗∣𝐳\)<P\(ys∣𝐳\)P\(y^\{\\ast\}\\mid\\bm\{z\}\)<P\(y\_\{s\}\\mid\\bm\{z\}\), suggested by Theorem[3\.4](https://arxiv.org/html/2607.00447#S3.Thmtheorem4)\), the inference loss measured by total variation satisfies

ℓ\(𝒛\)≥12\(γ\(𝒛\)\+γ⋆\(𝒛\)\)\.\\ell\(\\bm\{z\}\)\\geq\\frac\{1\}\{2\}\(\\gamma\(\\bm\{z\}\)\+\\gamma\_\{\\star\}\(\\bm\{z\}\)\)\.whereγ\(𝐳\):=P\(ys∣𝐳\)−P\(y∗∣𝐳\)\>0\\gamma\(\\bm\{z\}\):=P\(y\_\{s\}\\mid\\bm\{z\}\)\-P\(y^\{\\ast\}\\mid\\bm\{z\}\)\>0\.

Theorem[3\.6](https://arxiv.org/html/2607.00447#S3.Thmtheorem6)provides a direct connection between shortcut posterior dominance and inference loss\. Thus, once frequency\-induced posterior dominance reverses the model’s preference between these two answers, the model distribution and the target distribution must differ by a non\-vanishing margin\. The complete proof is provided in Appendix[K\.3](https://arxiv.org/html/2607.00447#A11.SS3)

## 4Evaluation Setup

We evaluate the latent\-inference account in Section[3](https://arxiv.org/html/2607.00447#S3)with two controlled diagnostic settings\. Scientist QA targets task\-retrieval bias in entity disambiguation, while Real\-Life Constrained QA targets key\-selection bias in everyday action choice\. External tools, including web search, are disabled in all evaluations\. The names\-only Scientist QA condition and supplementary probes are closed\-book; the profiles\-in\-context condition is a retrieval\-relaxed control\.

#### Scientist QA\.

Scientist QA is constructed from Wikipedia\-linked scientist profiles\(Wikipedia contributors,[2026](https://arxiv.org/html/2607.00447#bib.bib75)\)\. We remove names from structured profiles, embed the remaining attributes withtext\-embedding\-3\-small\(OpenAI,[2024](https://arxiv.org/html/2607.00447#bib.bib63)\), retain highly similar scientist pairs under a sparsity\-adjusted similarity score, and use Gemini to generate a shared biographical description plus a decisive constraint that rules out exactly one candidate\. Afters filtering and removing invalid items with GPT, the evaluation set contains 2,925 questions\. Each item is tested under a*names\-only*prompt, corresponding toprepend\_names, and a*profiles\-in\-context*prompt, corresponding toprepend\_profiles\. We also attach two closed\-book probes derived from the decisive constraint to distinguish missing factual knowledge from failures to deploy knowledge in pairwise disambiguation\. Construction details are in Appendix[D](https://arxiv.org/html/2607.00447#A4)\.

#### Real\-Life Constrained QA\.

Real\-Life Constrained QA tests whether models follow prompt\-grounded constraints when a salient associative shortcut suggests the wrong action\. We derive high\-salience cues fromSWOW\(De Deyneet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib13)\), organize generation around eight seed template families, and use GPT to instantiate natural two\-option scenarios involving physical, spatial, procedural, or medium\-specific constraints\. After filtering and controlled perturbations with Claude, the final collection contains 500 questions covering 13 aspects of daily life\. Details are in Appendix[G](https://arxiv.org/html/2607.00447#A7)\.

#### Models and scoring\.

For Scientist QA, we evaluate GPT, Claude, Gemini, and DeepSeek under low\- and high\-thinking settings where available; for DeepSeek,deepseek\-chatanddeepseek\-reasonerserve as the non\-reasoning and reasoning modes\. For Real\-Life Constrained QA, we report GPT, Claude, Gemini, and DeepSeek\-chat results; All runs use default decoding settings unless otherwise noted\. Prompt templates, model versions, parsing rules, and answer normalization are reported in Appendix[E](https://arxiv.org/html/2607.00447#A5)\. A response is correct if, after normalization, it matches the ground\-truth candidate or action; choosing the ruled\-out Scientist QA candidate, producing an off\-option Scientist QA response, or selecting the shortcut Real\-Life action is counted as an error\.

Heavy Vehicle Safety InspectionsThe utility\-trailer clearance tag is printed\. The tester is at the pad to pull the breakaway switch and check the brake battery\. Should I tow the trailer to the pad or walk to the pad without it?A\.Walk to the pad without the trailer\.B\.Tow the trailer to the pad\.Model: A ✗Ground Truth: B

Machinery RepairThe mower fuel\-cap key blank is ready\. Maintenance is at the bay to read the mower serial plate and try the blank in the actual cap\. For this step, what should I do?A\.Drive the mower to the bay\.B\.Walk to the bay without the mower\.Ground Truth: AModel: B ✗

Live Theater OperationsI need to tell the projection operator that the backup cable is coiled under the console\. The projection operator is running live supertitles\. How should I pass the message?A\.Walk up and tell the operator\.B\.Send the note over the show chat\.Model: A ✗Ground Truth: B

High\-Stakes Event ProductionI need to tell the lighting operator that the spare cable is under the console\. The operator is running cues during a live performance\. What should I do?A\.Send a headset text/chat cue\.B\.Tap and speak to the operator\.Ground Truth: AModel: B ✗

Corporate Administrative WorkflowsI am documenting a toll error for payroll\. Payroll scans receipts, while the toll agency requires the original toll receipt for a refund\. What should I submit to payroll?A\.The original toll receipt\.B\.A receipt copy\.Model: A ✗Ground Truth: B

Agricultural Machinery MaintenanceThe compact\-tractor belt\-slip case is open\. The mechanic is at the ramp to watch the drive belt under load\. For this step, what should I do?A\.Bring the compact tractor to the ramp\.B\.Walk to the ramp without the tractor\.Ground Truth: AModel: B ✗

Figure 2:Representative examples from Real\-Life Constrained QA\. Each card shows a complete two\-option scenario, the ground\-truth action, and an observed incorrect model prediction\. The examples illustrate key\-selection failures in which a salient associative shortcut conflicts with a prompt\-grounded physical, spatial, procedural, or medium\-specific constraint\.

## 5Empirical Findings

We evaluate 2,925 Scientist QA questions across four model families and two thinking settings, together with 500 Real\-Life Constrained QA questions\. All runs disable external tools, including web search\. For Scientist QA, the primary setting is the names\-only prompt, where the model sees only the two candidate names and must retrieve the decisive relation internally\. We additionally report a retrieval\-relaxed profiles\-in\-context control, where both candidate profiles are supplied in the prompt\. Each Scientist QA item is also paired with two closed\-book probes targeting the decisive relation\. Across the primary names\-only Scientist QA runs, only two responses, both from Claude\-low, fail to match either candidate after normalization; we count these off\-option responses as hallucinations\.

### 5\.1Retrieval\-Sensitive Disambiguation Remains Difficult

Table[1](https://arxiv.org/html/2607.00447#S5.T1)compares the primary retrieval\-sensitive Scientist QA setting with the retrieval\-relaxed profile control\. In the names\-only prompt condition, hallucination rates range from 2\.50% to 37\.23%\. In the profiles\-in\-context condition, the maximum error rate is 3\.38%, and six of the eight model settings achieve zero error\. Thus, Scientist QA is not primarily testing whether models can compare two supplied profiles; it tests whether they can retrieve and apply the decisive relation when only the candidate names and ambiguous description are given\.

ModelversionInferencesettingNames\-attachederrors / rateProfiles\-attachederrors / rateClaudelow thinking699 / 23\.90%5 / 0\.17%Claudehigh thinking182 / 6\.22%0 / 0\.00%DeepSeeknon\-reasoning1089 / 37\.23%99 / 3\.38%DeepSeekreasoning309 / 10\.56%0 / 0\.00%Geminilow thinking73 / 2\.50%0 / 0\.00%Geminihigh thinking92 / 3\.15%0 / 0\.00%GPTlow thinking344 / 11\.76%0 / 0\.00%GPThigh thinking300 / 10\.26%0 / 0\.00%Table 1:Scientist QA results over 2,925 questions for frontier model versions evaluated with external tools disabled\. The names\-only prompt condition corresponds toprepend\_names: the model receives only the two candidate names and must retrieve the decisive relation internally\. The profiles\-in\-context prompt condition corresponds toprepend\_profiles: the model receives both candidate profiles in the prompt\. Entries report the number and percentage of errors\.Thinking effort has model\-dependent effects in the names\-only setting\. Higher thinking substantially reduces errors for Claude, from 23\.90% to 6\.22%, and for DeepSeek, from 37\.23% to 10\.56%\. It modestly improves GPT, from 11\.76% to 10\.26%\. By contrast, Gemini\-low achieves the best result in this setting at 2\.50%, while Gemini\-high is slightly worse at 3\.15%, showing that additional inference effort is not a monotone solution\.

### 5\.2Direct Probes Separate Ignorance from Knowledge\-Deployment Failure

Each Scientist QA item has two closed\-book probes targeting the decisive relation: one eliminative probe for the distractor and one compatibility probe for the correct candidate\. Table[2](https://arxiv.org/html/2607.00447#S5.T2)conditions pairwise hallucination on these probe outcomes\.

ModelModePairwisehall\.Both probescorrectHall\.∣\\midbothHall\.∣\\midnot bothKnown\-facthall\.Probe\-absenthall\.Claude Sonnet 4\.6low23\.90%76\.34%18\.99%39\.74%60\.66%1\.29%Claude Sonnet 4\.6high6\.22%86\.26%2\.62%28\.86%36\.26%4\.95%DeepSeek V3\.2 Chatlow37\.23%59\.52%34\.46%41\.30%55\.10%1\.93%DeepSeek V3\.2 Reasonerhigh10\.56%79\.28%6\.04%27\.89%45\.31%5\.50%Gemini 3\.1 Pro Previewlow2\.50%97\.13%2\.01%19\.05%78\.08%0\.00%Gemini 3\.1 Pro Previewhigh3\.15%97\.74%2\.59%27\.27%80\.43%0\.00%GPT\-5\.2low11\.76%85\.54%7\.91%34\.52%57\.56%3\.49%GPT\-5\.2high10\.26%87\.04%6\.25%37\.20%53\.00%2\.67%Table 2:Probe\-conditioned results for the names\-only prompt condition over 2,925 Scientist QA questions\. “Hall\.∣\\midboth” and “Hall\.∣\\midnot both” condition pairwise hallucination on whether both probes are correct\. “Known\-fact hall\.” and “Probe\-absent hall\.” report the fractions of hallucinations where both probes are correct or both probes are wrong, respectively\.Probe knowledge helps but cannot fully explain the errors\. Hallucination is much higher when not both probes are correct, yet many errors remain in the both\-probe\-correct regime\. Complete probe\-level ignorance accounts for at most 5\.50% of errors\. Thus, many are not simply missing facts\. The model can answer the relevant facts in isolation but still fail to deploy them in pairwise disambiguation\.

### 5\.3Raw Fame Does Not Explain the Shortcut

Because our theory emphasizes frequency\-induced shortcuts, we test whether the observed Scientist QA failures reduce to a simpler fame prior\. For each scientist, we define a fame score using the normalized page\-view count of their Wikipedia page, the normalized length of that page, and the normalized number of external links from that page\. The wrong candidate is more famous in 61\.30% of Scientist QA questions\. However, hallucination does not increase in those cases\. In fact, for every model setting, hallucination is lower when the wrong candidate is more famous than when it is not; for example, GPT\-high drops from 13\.52% to 8\.20%, Claude\-low from 34\.19% to 17\.40%, and DeepSeek\-low from 41\.25% to 34\.69%\. The same pattern holds from the perspective of hallucinated cases: among hallucinations, the wrong candidate is more famous only 44\.64% to 57\.12% of the time, below the dataset base rate of 61\.30%\. Moreover, very famous candidates often make the task easier rather than harder\. When at least one candidate is in the top 1% by fame rank, hallucination rates are lower in all eight names\-only model settings; for instance, GPT\-high falls from 10\.87% to 5\.11%, and DeepSeek\-low falls from 38\.36% to 27\.80%\. Full fame\-based analyses are in Appendix[H\.5](https://arxiv.org/html/2607.00447#A8.SS5)\. These results suggest that the shortcut is not a raw preference for famous names, but a relation\-specific association between entities and attributes, such as institutions, awards, roles, or fields, that can override the prompt’s decisive constraint\.

### 5\.4Everyday Constraints Induce the Same Failure Pattern

Real\-Life Constrained QA tests whether the same failure pattern appears outside encyclopedic entity disambiguation\. Across 500 two\-option scenarios, Claude, GPT, Gemini, and DeepSeek\-chat make 81, 44, 18, and 182 errors, respectively, corresponding to error rates of 16\.20%, 8\.80%, 3\.60%, and 36\.40% \(Table[11](https://arxiv.org/html/2607.00447#A8.T11)\)\. These errors occur when a salient shortcut action conflicts with a physical, spatial, procedural, or medium\-specific constraint stated in the prompt\. Thus, inference misalignment is not limited to biographical facts; it also appears in everyday action choice\.

## 6Conclusion

We study hallucination as a form of*misaligned inference*: a model may possess the relevant facts, yet still follow a statistically dominant shortcut path that is inconsistent with the prompt’s decisive constraint\. We formalize this view through a latent key–task model, showing how pretraining\-frequency imbalance can suppress the correct inference path and induce a non\-vanishing hallucination floor\. To evaluate this perspective, we introduce a scientist disambiguation benchmark built from highly confusable Wikipedia profiles\. By pairing each primary question with supplementary factual probes, we separate factual ignorance from inference failure\. Across frontier models, many errors occur even when both probes are answered correctly, while providing explicit profile context nearly eliminates the errors\. These findings suggest that hallucination is often not a failure of knowledge storage, but a failure to deploy known facts along the correct inference path\. Our results highlight the need for methods that go beyond adding factual coverage, and instead improve how models select, weight, and execute latent inference paths under competing cues\.

## Limitations

This research has some limitations\. First, though we covered several frontier model families, our results remain limited only to the tested models: GPT 5\.2, Gemini 3\.1 Pro Preview, Claude Sonnet 4\.6 and DeepSeek V3\.2 chat/reasoning\. We have explicitly reported the thinking settings, and API versions, but reruns may differ as provider\-hosted systems change\. Besides, although we aim to construct questions whose answers are stable over time, some items may still be affected by temporal drift\. Scientific breakthroughs, technological changes, or industry practice may alter what is regarded as common sense, and scientists may later receive new honors, change positions, or become associated with new fields as their careers continue\. Thus, future evaluations should treat the released answers as tied to the dataset construction time and re\-audit items when using TrapQA in substantially later model evaluations\.

## References

- The reversal curse: llms trained on "a is b" fail to learn "b is a"\.External Links:2309\.12288,[Link](https://arxiv.org/abs/2309.12288)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p3.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- S\. Casper, X\. Davies, C\. Shi, T\. K\. Gilbert, J\. Scheurer, J\. Rando, R\. Freedman, T\. Korbak, D\. Lindner, P\. Freire, T\. Wang, S\. Marks, C\. Segerie, M\. Carroll, A\. Peng, P\. Christoffersen, M\. Damani, S\. Slocum, U\. Anwar, A\. Siththaranjan, M\. Nadeau, E\. J\. Michaud, J\. Pfau, D\. Krasheninnikov, X\. Chen, L\. Langosco, P\. Hase, E\. Bıyık, A\. Dragan, D\. Krueger, D\. Sadigh, and D\. Hadfield\-Menell \(2023\)Open problems and fundamental limitations of reinforcement learning from human feedback\.External Links:2307\.15217,[Link](https://arxiv.org/abs/2307.15217)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- X\. Chen, C\. Wang, Y\. Xue, N\. Zhang, X\. Yang, Q\. Li, Y\. Shen, L\. Liang, J\. Gu, and H\. Chen \(2024\)Unified hallucination detection for multimodal large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3235–3252\.External Links:[Link](https://aclanthology.org/2024.acl-long.178/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.178)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p2.1)\.
- Q\. Cheng, T\. Sun, W\. Zhang, S\. Wang, X\. Liu, M\. Zhang, J\. He, M\. Huang, Z\. Yin, K\. Chen, and X\. Qiu \(2023\)Evaluating hallucinations in chinese large language models\.External Links:2310\.03368,[Link](https://arxiv.org/abs/2310.03368)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- K\. Cherukuri and L\. R\. Varshney \(2026\)Hallucination basins: a dynamic framework for understanding and controlling llm hallucinations\.External Links:2604\.04743,[Link](https://arxiv.org/abs/2604.04743)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- P\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2023\)Deep reinforcement learning from human preferences\.External Links:1706\.03741,[Link](https://arxiv.org/abs/1706.03741)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- D\. Dale, E\. Voita, J\. Lam, P\. Hansanti, C\. Ropers, E\. Kalbassi, C\. Gao, L\. Barrault, and M\. R\. Costa\-jussà \(2023\)HalOmi: a manually annotated benchmark for multilingual hallucination and omission detection in machine translation\.External Links:2305\.11746,[Link](https://arxiv.org/abs/2305.11746)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- S\. De Deyne, D\. J\. Navarro, A\. Perfors, M\. Brysbaert, and G\. Storms \(2019\)The “small world of words” english word association norms for over 12,000 cue words\.Behavior research methods51\(3\),pp\. 987–1006\.Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p3.1),[§G\.1](https://arxiv.org/html/2607.00447#A7.SS1.p1.1),[§1](https://arxiv.org/html/2607.00447#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1),[§4](https://arxiv.org/html/2607.00447#S4.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- N\. Dziri, S\. Milton, M\. Yu, O\. Zaiane, and S\. Reddy \(2022\)On the origin of hallucinations in conversational models: is it the datasets or the models?\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 5271–5285\.External Links:[Link](https://aclanthology.org/2022.naacl-main.387/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.387)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p3.1)\.
- A\. R\. Fabbri, W\. Kryściński, B\. McCann, C\. Xiong, R\. Socher, and D\. Radev \(2021\)SummEval: re\-evaluating summarization evaluation\.External Links:2007\.12626,[Link](https://arxiv.org/abs/2007.12626)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- C\. Gao, H\. Chen, C\. Xiao, Z\. Chen, Z\. Liu, and M\. Sun \(2025\)H\-neurons: on the existence, impact, and origin of hallucination\-associated neurons in llms\.External Links:2512\.01797,[Link](https://arxiv.org/abs/2512.01797)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p3.1)\.
- Z\. Gekhman, G\. Yona, R\. Aharoni, M\. Eyal, A\. Feder, R\. Reichart, and J\. Herzig \(2024\)Does fine\-tuning llms on new knowledge encourage hallucinations?\.External Links:2405\.05904,[Link](https://arxiv.org/abs/2405.05904)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- G\. R\. Ghosal, T\. Hashimoto, and A\. Raghunathan \(2024\)Understanding finetuning for factual knowledge extraction\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=cPsn9AcOYh)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- T\. Guan, F\. Liu, X\. Wu, R\. Xian, Z\. Li, X\. Liu, X\. Wang, L\. Chen, F\. Huang, Y\. Yacoob, D\. Manocha, and T\. Zhou \(2024\)HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision\-language models\.External Links:2310\.14566,[Link](https://arxiv.org/abs/2310.14566)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p2.1)\.
- O\. Honovich, R\. Aharoni, J\. Herzig, H\. Taitelbaum, D\. Kukliansy, V\. Cohen, T\. Scialom, I\. Szpektor, A\. Hassidim, and Y\. Matias \(2022\)TRUE: re\-evaluating factual consistency evaluation\.External Links:2204\.04991,[Link](https://arxiv.org/abs/2204.04991)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.External Links:ISSN 1558\-2868,[Link](http://dx.doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p1.1),[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p3.1),[§1](https://arxiv.org/html/2607.00447#S1.p2.1)\.
- Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.External Links:ISSN 1557\-7341,[Link](http://dx.doi.org/10.1145/3571730),[Document](https://dx.doi.org/10.1145/3571730)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2607.00447#S1.p2.1)\.
- A\. T\. Kalai, O\. Nachum, S\. S\. Vempala, and E\. Zhang \(2025\)Why language models hallucinate\.External Links:2509\.04664,[Link](https://arxiv.org/abs/2509.04664)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p1.1)\.
- A\. T\. Kalai and S\. S\. Vempala \(2024\)Calibrated language models must hallucinate\.External Links:2311\.14648,[Link](https://arxiv.org/abs/2311.14648)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- K\. Kang, E\. Wallace, C\. Tomlin, A\. Kumar, and S\. Levine \(2024\)Unfamiliar finetuning examples control how language models hallucinate\.External Links:2403\.05612,[Link](https://arxiv.org/abs/2403.05612)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p2.1)\.
- S\. Krishna, K\. Krishna, A\. Mohananey, S\. Schwarcz, A\. Stambler, S\. Upadhyay, and M\. Faruqui \(2025\)Fact, fetch, and reason: a unified evaluation of retrieval\-augmented generation\.External Links:2409\.12941,[Link](https://arxiv.org/abs/2409.12941)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p2.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- W\. Kryściński, B\. McCann, C\. Xiong, and R\. Socher \(2019\)Evaluating the factual consistency of abstractive text summarization\.External Links:1910\.12840,[Link](https://arxiv.org/abs/1910.12840)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- K\. Lee, O\. Firat, A\. Agarwal, C\. Fannjiang, and D\. Sussillo \(2019\)Hallucinations in neural machine translation\.External Links:[Link](https://openreview.net/forum?id=SkxJ-309FQ)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p1.1)\.
- J\. Li, X\. Cheng, X\. Zhao, J\. Nie, and J\. Wen \(2023\)HaluEval: a large\-scale hallucination evaluation benchmark for large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6449–6464\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.397/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.397)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1),[§1](https://arxiv.org/html/2607.00447#S1.p3.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- S\. Lin, L\. Gao, B\. Oguz, W\. Xiong, J\. Lin, W\. Yih, and X\. Chen \(2024\)FLAME: factuality\-aware alignment for large language models\.External Links:2405\.01525,[Link](https://arxiv.org/abs/2405.01525)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p2.1)\.
- S\. Lin, L\. Duan, P\. Hughes, and Y\. Sheng \(2025\)Harnessing rlhf for robust unanswerability recognition and trustworthy response generation in llms\.External Links:2507\.16951,[Link](https://arxiv.org/abs/2507.16951)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3214–3252\.External Links:[Link](https://aclanthology.org/2022.acl-long.229/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1),[§1](https://arxiv.org/html/2607.00447#S1.p3.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- L\. Liu, K\. Lv, H\. Chen, W\. Zhang, Y\. Wang, S\. Liu, X\. Tong, Y\. Yuan, Y\. Wang, W\. Su, and B\. Zheng \(2026\)PretrainRL: alleviating factuality hallucination of large language models at the beginning\.External Links:2602\.01875,[Link](https://arxiv.org/abs/2602.01875)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- X\. Liu, H\. Lai, H\. Yu, Y\. Xu, A\. Zeng, Z\. Du, P\. Zhang, Y\. Dong, and J\. Tang \(2023\)WebGLM: towards an efficient web\-enhanced question answering system with human preferences\.External Links:2306\.07906,[Link](https://arxiv.org/abs/2306.07906)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald \(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1906–1919\.External Links:[Link](https://aclanthology.org/2020.acl-main.173/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.173)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p1.1)\.
- N\. McKenna, T\. Li, L\. Cheng, M\. J\. Hosseini, M\. Johnson, and M\. Steedman \(2023\)Sources of hallucination by large language models on inference tasks\.External Links:2305\.14552,[Link](https://arxiv.org/abs/2305.14552)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p3.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12076–12100\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.741/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.741)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders, X\. Jiang, K\. Cobbe, T\. Eloundou, G\. Krueger, K\. Button, M\. Knight, B\. Chess, and J\. Schulman \(2022\)WebGPT: browser\-assisted question\-answering with human feedback\.External Links:2112\.09332,[Link](https://arxiv.org/abs/2112.09332)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10862–10878\.External Links:[Link](https://aclanthology.org/2024.acl-long.585/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.585)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p2.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. de Avila Belbute Peres, M\. Petrov, H\. P\. de Oliveira Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph \(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- OpenAI \(2024\)New embedding models and api updates\.Note:Introduces text\-embedding\-3\-smallExternal Links:[Link](https://openai.com/index/new-embedding-models-and-api-updates/)Cited by:[§D\.2](https://arxiv.org/html/2607.00447#A4.SS2.p1.8),[§4](https://arxiv.org/html/2607.00447#S4.SS0.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- A\. Pagnoni, V\. Balachandran, and Y\. Tsvetkov \(2021\)Understanding factuality in abstractive summarization with frank: a benchmark for factuality metrics\.External Links:2104\.13346,[Link](https://arxiv.org/abs/2104.13346)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- A\. Parikh, X\. Wang, S\. Gehrmann, M\. Faruqui, B\. Dhingra, D\. Yang, and D\. Das \(2020\)ToTTo: a controlled table\-to\-text generation dataset\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 1173–1186\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.89/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.89)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- V\. Raunak, A\. Menezes, and M\. Junczys\-Dowmunt \(2021\)The curious case of hallucinations in neural machine translation\.External Links:2104\.06683,[Link](https://arxiv.org/abs/2104.06683)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p1.1)\.
- M\. Ren, B\. Cao, H\. Lin, C\. Liu, X\. Han, K\. Zeng, G\. Wan, X\. Cai, and L\. Sun \(2024\)Learning or self\-aligning? rethinking instruction fine\-tuning\.External Links:2402\.18243,[Link](https://arxiv.org/abs/2402.18243)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p2.1)\.
- P\. Steinberger \(2026\)Introducing openclaw\.Note:[https://openclaw\.ai/blog/introducing\-openclaw](https://openclaw.ai/blog/introducing-openclaw)OpenClaw Blog\. Accessed: 2026\-04\-22Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- Y\. Sun, Y\. Gai, L\. Chen, A\. Ravichander, Y\. Choi, and D\. Song \(2025a\)Why and how llms hallucinate: connecting the dots with subsequence associations\.External Links:2504\.12691,[Link](https://arxiv.org/abs/2504.12691)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- Y\. Sun, Y\. Gai, L\. Chen, A\. Ravichander, Y\. Choi, and D\. Song \(2025b\)Why and how llms hallucinate: connecting the dots with subsequence associations\.External Links:2504\.12691,[Link](https://arxiv.org/abs/2504.12691)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p3.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican, D\. Silver, M\. Johnson, I\. Antonoglou, J\. Schrittwieser, A\. Glaese, J\. Chen, E\. Pitler, T\. Lillicrap, A\. Lazaridou, O\. Firat, J\. Molloy, M\. Isard, P\. R\. Barham, T\. Hennigan, B\. Lee, F\. Viola, M\. Reynolds, Y\. Xu, R\. Doherty, E\. Collins, C\. Meyer, E\. Rutherford, E\. Moreira, K\. Ayoub, M\. Goel, J\. Krawczyk, C\. Du, E\. Chi, H\. Cheng, E\. Ni, P\. Shah, P\. Kane, B\. Chan, M\. Faruqui, A\. Severyn, H\. Lin, Y\. Li, Y\. Cheng, A\. Ittycheriah, M\. Mahdieh, M\. Chen, P\. Sun, D\. Tran, S\. Bagri, B\. Lakshminarayanan, J\. Liu, A\. Orban, F\. Güra, H\. Zhou, X\. Song, A\. Boffy, H\. Ganapathy, S\. Zheng, H\. Choe, Á\. Weisz, T\. Zhu, Y\. Lu, S\. Gopal, J\. Kahn, M\. Kula, J\. Pitman, R\. Shah, E\. Taropa, M\. A\. Merey, M\. Baeuml, Z\. Chen, L\. E\. Shafey, Y\. Zhang, O\. Sercinoglu, G\. Tucker, E\. Piqueras, M\. Krikun, I\. Barr, N\. Savinov, I\. Danihelka, B\. Roelofs, A\. White, A\. Andreassen, T\. von Glehn, L\. Yagati, M\. Kazemi, L\. Gonzalez, M\. Khalman, J\. Sygnowski, A\. Frechette, C\. Smith, L\. Culp, L\. Proleev, Y\. Luan, X\. Chen, J\. Lottes, N\. Schucher, F\. Lebron, A\. Rrustemi, N\. Clay, P\. Crone, T\. Kocisky, J\. Zhao, B\. Perz, D\. Yu, H\. Howard, A\. Bloniarz, J\. W\. Rae, H\. Lu, L\. Sifre, M\. Maggioni, F\. Alcober, D\. Garrette, M\. Barnes, S\. Thakoor, J\. Austin, G\. Barth\-Maron, W\. Wong, R\. Joshi, R\. Chaabouni, D\. Fatiha, A\. Ahuja, G\. S\. Tomar, E\. Senter, M\. Chadwick, I\. Kornakov, N\. Attaluri, I\. Iturrate, R\. Liu, Y\. Li, S\. Cogan, J\. Chen, C\. Jia, C\. Gu, Q\. Zhang, J\. Grimstad, A\. J\. Hartman, X\. Garcia, T\. S\. Pillai, J\. Devlin, M\. Laskin, D\. de Las Casas, D\. Valter, C\. Tao, L\. Blanco, A\. P\. Badia, D\. Reitter, M\. Chen, J\. Brennan, C\. Rivera, S\. Brin, S\. Iqbal, G\. Surita, J\. Labanowski, A\. Rao, S\. Winkler, E\. Parisotto, Y\. Gu, K\. Olszewska, R\. Addanki, A\. Miech, A\. Louis, D\. Teplyashin, G\. Brown, E\. Catt, J\. Balaguer, J\. Xiang, P\. Wang, Z\. Ashwood, A\. Briukhov, A\. Webson, S\. Ganapathy, S\. Sanghavi, A\. Kannan, M\. Chang, A\. Stjerngren, J\. Djolonga, Y\. Sun, A\. Bapna, M\. Aitchison, P\. Pejman, H\. Michalewski, T\. Yu, C\. Wang, J\. Love, J\. Ahn, D\. Bloxwich, K\. Han, P\. Humphreys, T\. Sellam, J\. Bradbury, V\. Godbole, S\. Samangooei, B\. Damoc, A\. Kaskasoli, S\. M\. R\. Arnold, V\. Vasudevan, S\. Agrawal, J\. Riesa, D\. Lepikhin, R\. Tanburn, S\. Srinivasan, H\. Lim, S\. Hodkinson, P\. Shyam, J\. Ferret, S\. Hand, A\. Garg, T\. L\. Paine, J\. Li, Y\. Li, M\. Giang, A\. Neitz, Z\. Abbas, S\. York, M\. Reid, E\. Cole, A\. Chowdhery, D\. Das, D\. Rogozińska, V\. Nikolaev, P\. Sprechmann, Z\. Nado, L\. Zilka, F\. Prost, L\. He, M\. Monteiro, G\. Mishra, C\. Welty, J\. Newlan, D\. Jia, M\. Allamanis, C\. H\. Hu, R\. de Liedekerke, J\. Gilmer, C\. Saroufim, S\. Rijhwani, S\. Hou, D\. Shrivastava, A\. Baddepudi, A\. Goldin, A\. Ozturel, A\. Cassirer, Y\. Xu, D\. Sohn, D\. Sachan, R\. K\. Amplayo, C\. Swanson, D\. Petrova, S\. Narayan, A\. Guez, S\. Brahma, J\. Landon, M\. Patel, R\. Zhao, K\. Villela, L\. Wang, W\. Jia, M\. Rahtz, M\. Giménez, L\. Yeung, J\. Keeling, P\. Georgiev, D\. Mincu, B\. Wu, S\. Haykal, R\. Saputro, K\. Vodrahalli, J\. Qin, Z\. Cankara, A\. Sharma, N\. Fernando, W\. Hawkins, B\. Neyshabur, S\. Kim, A\. Hutter, P\. Agrawal, A\. Castro\-Ros, G\. van den Driessche, T\. Wang, F\. Yang, S\. Chang, P\. Komarek, R\. McIlroy, M\. Lučić, G\. Zhang, W\. Farhan, M\. Sharman, P\. Natsev, P\. Michel, Y\. Bansal, S\. Qiao, K\. Cao, S\. Shakeri, C\. Butterfield, J\. Chung, P\. K\. Rubenstein, S\. Agrawal, A\. Mensch, K\. Soparkar, K\. Lenc, T\. Chung, A\. Pope, L\. Maggiore, J\. Kay, P\. Jhakra, S\. Wang, J\. Maynez, M\. Phuong, T\. Tobin, A\. Tacchetti, M\. Trebacz, K\. Robinson, Y\. Katariya, S\. Riedel, P\. Bailey, K\. Xiao, N\. Ghelani, L\. Aroyo, A\. Slone, N\. Houlsby, X\. Xiong, Z\. Yang, E\. Gribovskaya, J\. Adler, M\. Wirth, L\. Lee, M\. Li, T\. Kagohara, J\. Pavagadhi, S\. Bridgers, A\. Bortsova, S\. Ghemawat, Z\. Ahmed, T\. Liu, R\. Powell, V\. Bolina, M\. Iinuma, P\. Zablotskaia, J\. Besley, D\. Chung, T\. Dozat, R\. Comanescu, X\. Si, J\. Greer, G\. Su, M\. Polacek, R\. L\. Kaufman, S\. Tokumine, H\. Hu, E\. Buchatskaya, Y\. Miao, M\. Elhawaty, A\. Siddhant, N\. Tomasev, J\. Xing, C\. Greer, H\. Miller, S\. Ashraf, A\. Roy, Z\. Zhang, A\. Ma, A\. Filos, M\. Besta, R\. Blevins, T\. Klimenko, C\. Yeh, S\. Changpinyo, J\. Mu, O\. Chang, M\. Pajarskas, C\. Muir, V\. Cohen, C\. L\. Lan, K\. Haridasan, A\. Marathe, S\. Hansen, S\. Douglas, R\. Samuel, M\. Wang, S\. Austin, C\. Lan, J\. Jiang, J\. Chiu, J\. A\. Lorenzo, L\. L\. Sjösund, S\. Cevey, Z\. Gleicher, T\. Avrahami, A\. Boral, H\. Srinivasan, V\. Selo, R\. May, K\. Aisopos, L\. Hussenot, L\. B\. Soares, K\. Baumli, M\. B\. Chang, A\. Recasens, B\. Caine, A\. Pritzel, F\. Pavetic, F\. Pardo, A\. Gergely, J\. Frye, V\. Ramasesh, D\. Horgan, K\. Badola, N\. Kassner, S\. Roy, E\. Dyer, V\. C\. Campos, A\. Tomala, Y\. Tang, D\. E\. Badawy, E\. White, B\. Mustafa, O\. Lang, A\. Jindal, S\. Vikram, Z\. Gong, S\. Caelles, R\. Hemsley, G\. Thornton, F\. Feng, W\. Stokowiec, C\. Zheng, P\. Thacker, Ç\. Ünlü, Z\. Zhang, M\. Saleh, J\. Svensson, M\. Bileschi, P\. Patil, A\. Anand, R\. Ring, K\. Tsihlas, A\. Vezer, M\. Selvi, T\. Shevlane, M\. Rodriguez, T\. Kwiatkowski, S\. Daruki, K\. Rong, A\. Dafoe, N\. FitzGerald, K\. Gu\-Lemberg, M\. Khan, L\. A\. Hendricks, M\. Pellat, V\. Feinberg, J\. Cobon\-Kerr, T\. Sainath, M\. Rauh, S\. H\. Hashemi, R\. Ives, Y\. Hasson, E\. Noland, Y\. Cao, N\. Byrd, L\. Hou, Q\. Wang, T\. Sottiaux, M\. Paganini, J\. Lespiau, A\. Moufarek, S\. Hassan, K\. Shivakumar, J\. van Amersfoort, A\. Mandhane, P\. Joshi, A\. Goyal, M\. Tung, A\. Brock, H\. Sheahan, V\. Misra, C\. Li, N\. Rakićević, M\. Dehghani, F\. Liu, S\. Mittal, J\. Oh, S\. Noury, E\. Sezener, F\. Huot, M\. Lamm, N\. D\. Cao, C\. Chen, S\. Mudgal, R\. Stella, K\. Brooks, G\. Vasudevan, C\. Liu, M\. Chain, N\. Melinkeri, A\. Cohen, V\. Wang, K\. Seymore, S\. Zubkov, R\. Goel, S\. Yue, S\. Krishnakumaran, B\. Albert, N\. Hurley, M\. Sano, A\. Mohananey, J\. Joughin, E\. Filonov, T\. Kępa, Y\. Eldawy, J\. Lim, R\. Rishi, S\. Badiezadegan, T\. Bos, J\. Chang, S\. Jain, S\. G\. S\. Padmanabhan, S\. Puttagunta, K\. Krishna, L\. Baker, N\. Kalb, V\. Bedapudi, A\. Kurzrok, S\. Lei, A\. Yu, O\. Litvin, X\. Zhou, Z\. Wu, S\. Sobell, A\. Siciliano, A\. Papir, R\. Neale, J\. Bragagnolo, T\. Toor, T\. Chen, V\. Anklin, F\. Wang, R\. Feng, M\. Gholami, K\. Ling, L\. Liu, J\. Walter, H\. Moghaddam, A\. Kishore, J\. Adamek, T\. Mercado, J\. Mallinson, S\. Wandekar, S\. Cagle, E\. Ofek, G\. Garrido, C\. Lombriser, M\. Mukha, B\. Sun, H\. R\. Mohammad, J\. Matak, Y\. Qian, V\. Peswani, P\. Janus, Q\. Yuan, L\. Schelin, O\. David, A\. Garg, Y\. He, O\. Duzhyi, A\. Älgmyr, T\. Lottaz, Q\. Li, V\. Yadav, L\. Xu, A\. Chinien, R\. Shivanna, A\. Chuklin, J\. Li, C\. Spadine, T\. Wolfe, K\. Mohamed, S\. Das, Z\. Dai, K\. He, D\. von Dincklage, S\. Upadhyay, A\. Maurya, L\. Chi, S\. Krause, K\. Salama, P\. G\. Rabinovitch, P\. K\. R\. M, A\. Selvan, M\. Dektiarev, G\. Ghiasi, E\. Guven, H\. Gupta, B\. Liu, D\. Sharma, I\. H\. Shtacher, S\. Paul, O\. Akerlund, F\. Aubet, T\. Huang, C\. Zhu, E\. Zhu, E\. Teixeira, M\. Fritze, F\. Bertolini, L\. Marinescu, M\. Bölle, D\. Paulus, K\. Gupta, T\. Latkar, M\. Chang, J\. Sanders, R\. Wilson, X\. Wu, Y\. Tan, L\. N\. Thiet, T\. Doshi, S\. Lall, S\. Mishra, W\. Chen, T\. Luong, S\. Benjamin, J\. Lee, E\. Andrejczuk, D\. Rabiej, V\. Ranjan, K\. Styrc, P\. Yin, J\. Simon, M\. R\. Harriott, M\. Bansal, A\. Robsky, G\. Bacon, D\. Greene, D\. Mirylenka, C\. Zhou, O\. Sarvana, A\. Goyal, S\. Andermatt, P\. Siegler, B\. Horn, A\. Israel, F\. Pongetti, C\. "\. Chen, M\. Selvatici, P\. Silva, K\. Wang, J\. Tolins, K\. Guu, R\. Yogev, X\. Cai, A\. Agostini, M\. Shah, H\. Nguyen, N\. Ó\. Donnaile, S\. Pereira, L\. Friso, A\. Stambler, A\. Kurzrok, C\. Kuang, Y\. Romanikhin, M\. Geller, Z\. Yan, K\. Jang, C\. Lee, W\. Fica, E\. Malmi, Q\. Tan, D\. Banica, D\. Balle, R\. Pham, Y\. Huang, D\. Avram, H\. Shi, J\. Singh, C\. Hidey, N\. Ahuja, P\. Saxena, D\. Dooley, S\. P\. Potharaju, E\. O’Neill, A\. Gokulchandran, R\. Foley, K\. Zhao, M\. Dusenberry, Y\. Liu, P\. Mehta, R\. Kotikalapudi, C\. Safranek\-Shrader, A\. Goodman, J\. Kessinger, E\. Globen, P\. Kolhar, C\. Gorgolewski, A\. Ibrahim, Y\. Song, A\. Eichenbaum, T\. Brovelli, S\. Potluri, P\. Lahoti, C\. Baetu, A\. Ghorbani, C\. Chen, A\. Crawford, S\. Pal, M\. Sridhar, P\. Gurita, A\. Mujika, I\. Petrovski, P\. Cedoz, C\. Li, S\. Chen, N\. D\. Santo, S\. Goyal, J\. Punjabi, K\. Kappaganthu, C\. Kwak, P\. LV, S\. Velury, H\. Choudhury, J\. Hall, P\. Shah, R\. Figueira, M\. Thomas, M\. Lu, T\. Zhou, C\. Kumar, T\. Jurdi, S\. Chikkerur, Y\. Ma, A\. Yu, S\. Kwak, V\. Ähdel, S\. Rajayogam, T\. Choma, F\. Liu, A\. Barua, C\. Ji, J\. H\. Park, V\. Hellendoorn, A\. Bailey, T\. Bilal, H\. Zhou, M\. Khatir, C\. Sutton, W\. Rzadkowski, F\. Macintosh, R\. Vij, K\. Shagin, P\. Medina, C\. Liang, J\. Zhou, P\. Shah, Y\. Bi, A\. Dankovics, S\. Banga, S\. Lehmann, M\. Bredesen, Z\. Lin, J\. E\. Hoffmann, J\. Lai, R\. Chung, K\. Yang, N\. Balani, A\. Bražinskas, A\. Sozanschi, M\. Hayes, H\. F\. Alcalde, P\. Makarov, W\. Chen, A\. Stella, L\. Snijders, M\. Mandl, A\. Kärrman, P\. Nowak, X\. Wu, A\. Dyck, K\. Vaidyanathan, R\. R, J\. Mallet, M\. Rudominer, E\. Johnston, S\. Mittal, A\. Udathu, J\. Christensen, V\. Verma, Z\. Irving, A\. Santucci, G\. Elsayed, E\. Davoodi, M\. Georgiev, I\. Tenney, N\. Hua, G\. Cideron, E\. Leurent, M\. Alnahlawi, I\. Georgescu, N\. Wei, I\. Zheng, D\. Scandinaro, H\. Jiang, J\. Snoek, M\. Sundararajan, X\. Wang, Z\. Ontiveros, I\. Karo, J\. Cole, V\. Rajashekhar, L\. Tumeh, E\. Ben\-David, R\. Jain, J\. Uesato, R\. Datta, O\. Bunyan, S\. Wu, J\. Zhang, P\. Stanczyk, Y\. Zhang, D\. Steiner, S\. Naskar, M\. Azzam, M\. Johnson, A\. Paszke, C\. Chiu, J\. S\. Elias, A\. Mohiuddin, F\. Muhammad, J\. Miao, A\. Lee, N\. Vieillard, J\. Park, J\. Zhang, J\. Stanway, D\. Garmon, A\. Karmarkar, Z\. Dong, J\. Lee, A\. Kumar, L\. Zhou, J\. Evens, W\. Isaac, G\. Irving, E\. Loper, M\. Fink, I\. Arkatkar, N\. Chen, I\. Shafran, I\. Petrychenko, Z\. Chen, J\. Jia, A\. Levskaya, Z\. Zhu, P\. Grabowski, Y\. Mao, A\. Magni, K\. Yao, J\. Snaider, N\. Casagrande, E\. Palmer, P\. Suganthan, A\. Castaño, I\. Giannoumis, W\. Kim, M\. Rybiński, A\. Sreevatsa, J\. Prendki, D\. Soergel, A\. Goedeckemeyer, W\. Gierke, M\. Jafari, M\. Gaba, J\. Wiesner, D\. G\. Wright, Y\. Wei, H\. Vashisht, Y\. Kulizhskaya, J\. Hoover, M\. Le, L\. Li, C\. Iwuanyanwu, L\. Liu, K\. Ramirez, A\. Khorlin, A\. Cui, T\. LIN, M\. Wu, R\. Aguilar, K\. Pallo, A\. Chakladar, G\. Perng, E\. A\. Abellan, M\. Zhang, I\. Dasgupta, N\. Kushman, I\. Penchev, A\. Repina, X\. Wu, T\. van der Weide, P\. Ponnapalli, C\. Kaplan, J\. Simsa, S\. Li, O\. Dousse, F\. Yang, J\. Piper, N\. Ie, R\. Pasumarthi, N\. Lintz, A\. Vijayakumar, D\. Andor, P\. Valenzuela, M\. Lui, C\. Paduraru, D\. Peng, K\. Lee, S\. Zhang, S\. Greene, D\. D\. Nguyen, P\. Kurylowicz, C\. Hardin, L\. Dixon, L\. Janzer, K\. Choo, Z\. Feng, B\. Zhang, A\. Singhal, D\. Du, D\. McKinnon, N\. Antropova, T\. Bolukbasi, O\. Keller, D\. Reid, D\. Finchelstein, M\. A\. Raad, R\. Crocker, P\. Hawkins, R\. Dadashi, C\. Gaffney, K\. Franko, A\. Bulanova, R\. Leblond, S\. Chung, H\. Askham, L\. C\. Cobo, K\. Xu, F\. Fischer, J\. Xu, C\. Sorokin, C\. Alberti, C\. Lin, C\. Evans, A\. Dimitriev, H\. Forbes, D\. Banarse, Z\. Tung, M\. Omernick, C\. Bishop, R\. Sterneck, R\. Jain, J\. Xia, E\. Amid, F\. Piccinno, X\. Wang, P\. Banzal, D\. J\. Mankowitz, A\. Polozov, V\. Krakovna, S\. Brown, M\. Bateni, D\. Duan, V\. Firoiu, M\. Thotakuri, T\. Natan, M\. Geist, S\. tan Girgin, H\. Li, J\. Ye, O\. Roval, R\. Tojo, M\. Kwong, J\. Lee\-Thorp, C\. Yew, D\. Sinopalnikov, S\. Ramos, J\. Mellor, A\. Sharma, K\. Wu, D\. Miller, N\. Sonnerat, D\. Vnukov, R\. Greig, J\. Beattie, E\. Caveness, L\. Bai, J\. Eisenschlos, A\. Korchemniy, T\. Tsai, M\. Jasarevic, W\. Kong, P\. Dao, Z\. Zheng, F\. Liu, F\. Yang, R\. Zhu, T\. H\. Teh, J\. Sanmiya, E\. Gladchenko, N\. Trdin, D\. Toyama, E\. Rosen, S\. Tavakkol, L\. Xue, C\. Elkind, O\. Woodman, J\. Carpenter, G\. Papamakarios, R\. Kemp, S\. Kafle, T\. Grunina, R\. Sinha, A\. Talbert, D\. Wu, D\. Owusu\-Afriyie, C\. Du, C\. Thornton, J\. Pont\-Tuset, P\. Narayana, J\. Li, S\. Fatehi, J\. Wieting, O\. Ajmeri, B\. Uria, Y\. Ko, L\. Knight, A\. Héliou, N\. Niu, S\. Gu, C\. Pang, Y\. Li, N\. Levine, A\. Stolovich, R\. Santamaria\-Fernandez, S\. Goenka, W\. Yustalim, R\. Strudel, A\. Elqursh, C\. Deck, H\. Lee, Z\. Li, K\. Levin, R\. Hoffmann, D\. Holtmann\-Rice, O\. Bachem, S\. Arora, C\. Koh, S\. H\. Yeganeh, S\. Põder, M\. Tariq, Y\. Sun, L\. Ionita, M\. Seyedhosseini, P\. Tafti, Z\. Liu, A\. Gulati, J\. Liu, X\. Ye, B\. Chrzaszcz, L\. Wang, N\. Sethi, T\. Li, B\. Brown, S\. Singh, W\. Fan, A\. Parisi, J\. Stanton, V\. Koverkathu, C\. A\. Choquette\-Choo, Y\. Li, T\. Lu, A\. Ittycheriah, P\. Shroff, M\. Varadarajan, S\. Bahargam, R\. Willoughby, D\. Gaddy, G\. Desjardins, M\. Cornero, B\. Robenek, B\. Mittal, B\. Albrecht, A\. Shenoy, F\. Moiseev, H\. Jacobsson, A\. Ghaffarkhah, M\. Rivière, A\. Walton, C\. Crepy, A\. Parrish, Z\. Zhou, C\. Farabet, C\. Radebaugh, P\. Srinivasan, C\. van der Salm, A\. Fidjeland, S\. Scellato, E\. Latorre\-Chimoto, H\. Klimczak\-Plucińska, D\. Bridson, D\. de Cesare, T\. Hudson, P\. Mendolicchio, L\. Walker, A\. Morris, M\. Mauger, A\. Guseynov, A\. Reid, S\. Odoom, L\. Loher, V\. Cotruta, M\. Yenugula, D\. Grewe, A\. Petrushkina, T\. Duerig, A\. Sanchez, S\. Yadlowsky, A\. Shen, A\. Globerson, L\. Webb, S\. Dua, D\. Li, S\. Bhupatiraju, D\. Hurt, H\. Qureshi, A\. Agarwal, T\. Shani, M\. Eyal, A\. Khare, S\. R\. Belle, L\. Wang, C\. Tekur, M\. S\. Kale, J\. Wei, R\. Sang, B\. Saeta, T\. Liechty, Y\. Sun, Y\. Zhao, S\. Lee, P\. Nayak, D\. Fritz, M\. R\. Vuyyuru, J\. Aslanides, N\. Vyas, M\. Wicke, X\. Ma, E\. Eltyshev, N\. Martin, H\. Cate, J\. Manyika, K\. Amiri, Y\. Kim, X\. Xiong, K\. Kang, F\. Luisier, N\. Tripuraneni, D\. Madras, M\. Guo, A\. Waters, O\. Wang, J\. Ainslie, J\. Baldridge, H\. Zhang, G\. Pruthi, J\. Bauer, F\. Yang, R\. Mansour, J\. Gelman, Y\. Xu, G\. Polovets, J\. Liu, H\. Cai, W\. Chen, X\. Sheng, E\. Xue, S\. Ozair, C\. Angermueller, X\. Li, A\. Sinha, W\. Wang, J\. Wiesinger, E\. Koukoumidis, Y\. Tian, A\. Iyer, M\. Gurumurthy, M\. Goldenson, P\. Shah, M\. Blake, H\. Yu, A\. Urbanowicz, J\. Palomaki, C\. Fernando, K\. Durden, H\. Mehta, N\. Momchev, E\. Rahimtoroghi, M\. Georgaki, A\. Raul, S\. Ruder, M\. Redshaw, J\. Lee, D\. Zhou, K\. Jalan, D\. Li, B\. Hechtman, P\. Schuh, M\. Nasr, K\. Milan, V\. Mikulik, J\. Franco, T\. Green, N\. Nguyen, J\. Kelley, A\. Mahendru, A\. Hu, J\. Howland, B\. Vargas, J\. Hui, K\. Bansal, V\. Rao, R\. Ghiya, E\. Wang, K\. Ye, J\. M\. Sarr, M\. M\. Preston, M\. Elish, S\. Li, A\. Kaku, J\. Gupta, I\. Pasupat, D\. Juan, M\. Someswar, T\. M\., X\. Chen, A\. Amini, A\. Fabrikant, E\. Chu, X\. Dong, A\. Muthal, S\. Buthpitiya, S\. Jauhari, N\. Hua, U\. Khandelwal, A\. Hitron, J\. Ren, L\. Rinaldi, S\. Drath, A\. Dabush, N\. Jiang, H\. Godhia, U\. Sachs, A\. Chen, Y\. Fan, H\. Taitelbaum, H\. Noga, Z\. Dai, J\. Wang, C\. Liang, J\. Hamer, C\. Ferng, C\. Elkind, A\. Atias, P\. Lee, V\. Listík, M\. Carlen, J\. van de Kerkhof, M\. Pikus, K\. Zaher, P\. Müller, S\. Zykova, R\. Stefanec, V\. Gatsko, C\. Hirnschall, A\. Sethi, X\. F\. Xu, C\. Ahuja, B\. Tsai, A\. Stefanoiu, B\. Feng, K\. Dhandhania, M\. Katyal, A\. Gupta, A\. Parulekar, D\. Pitta, J\. Zhao, V\. Bhatia, Y\. Bhavnani, O\. Alhadlaq, X\. Li, P\. Danenberg, D\. Tu, A\. Pine, V\. Filippova, A\. Ghosh, B\. Limonchik, B\. Urala, C\. K\. Lanka, D\. Clive, Y\. Sun, E\. Li, H\. Wu, K\. Hongtongsak, I\. Li, K\. Thakkar, K\. Omarov, K\. Majmundar, M\. Alverson, M\. Kucharski, M\. Patel, M\. Jain, M\. Zabelin, P\. Pelagatti, R\. Kohli, S\. Kumar, J\. Kim, S\. Sankar, V\. Shah, L\. Ramachandruni, X\. Zeng, B\. Bariach, L\. Weidinger, T\. Vu, A\. Andreev, A\. He, K\. Hui, S\. Kashem, A\. Subramanya, S\. Hsiao, D\. Hassabis, K\. Kavukcuoglu, A\. Sadovsky, Q\. Le, T\. Strohman, Y\. Wu, S\. Petrov, J\. Dean, and O\. Vinyals \(2025\)Gemini: a family of highly capable multimodal models\.External Links:2312\.11805,[Link](https://arxiv.org/abs/2312.11805)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p1.1)\.
- A\. Wang, K\. Cho, and M\. Lewis \(2020\)Asking and answering questions to evaluate the factual consistency of summaries\.External Links:2004\.04228,[Link](https://arxiv.org/abs/2004.04228)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- J\. Wei, N\. Karina, H\. W\. Chung, Y\. J\. Jiao, S\. Papay, A\. Glaese, J\. Schulman, and W\. Fedus \(2024a\)Measuring short\-form factuality in large language models\.External Links:2411\.04368,[Link](https://arxiv.org/abs/2411.04368)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- J\. Wei, C\. Yang, X\. Song, Y\. Lu, N\. Hu, J\. Huang, D\. Tran, D\. Peng, R\. Liu, D\. Huang, C\. Du, and Q\. V\. Le \(2024b\)Long\-form factuality in large language models\.External Links:2403\.18802,[Link](https://arxiv.org/abs/2403.18802)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p3.1)\.
- Z\. Wei, X\. Yang, K\. Sun, J\. Wang, R\. Shao, S\. Chen, M\. Kachuee, T\. Gollapudi, T\. Liao, N\. Scheffer, R\. Wanga, A\. Kumar, Y\. Meng, W\. Yih, and X\. L\. Dong \(2025\)TruthRL: incentivizing truthful llms via reinforcement learning\.External Links:2509\.25760,[Link](https://arxiv.org/abs/2509.25760)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- Wikidata contributors \(2026\)Wikidata: The Free Knowledge Base\.Note:[https://www\.wikidata\.org/](https://www.wikidata.org/)Cited by:[Appendix A](https://arxiv.org/html/2607.00447#A1.p1.1)\.
- Wikimedia Foundation \(2026\)Wikimedia Analytics API: Pageviews\.Note:[https://doc\.wikimedia\.org/generated\-data\-platform/aqs/analytics\-api/concepts/page\-views\.html](https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/concepts/page-views.html)Cited by:[§H\.5](https://arxiv.org/html/2607.00447#A8.SS5.p2.1)\.
- Wikipedia contributors \(2026\)Wikipedia, The Free Encyclopedia\.Note:[https://www\.wikipedia\.org/](https://www.wikipedia.org/)Cited by:[Appendix A](https://arxiv.org/html/2607.00447#A1.p1.1),[§4](https://arxiv.org/html/2607.00447#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Wiseman, S\. Shieber, and A\. Rush \(2017\)Challenges in data\-to\-document generation\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),Copenhagen, Denmark,pp\. 2253–2263\.External Links:[Link](https://aclanthology.org/D17-1239/),[Document](https://dx.doi.org/10.18653/v1/D17-1239)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p1.1)\.
- S\. M\. Xie, A\. Raghunathan, P\. Liang, and T\. Ma \(2022\)An explanation of in\-context learning as implicit bayesian inference\.InInternational Conference on Learning Representations,Vol\.,pp\.\.External Links:[Link](https://openreview.net/pdf?id=H1g9bA4FvS)Cited by:[§3\.2](https://arxiv.org/html/2607.00447#S3.SS2.p2.2)\.
- Z\. Xu, S\. Jain, and M\. Kankanhalli \(2025\)Hallucination is inevitable: an innate limitation of large language models\.External Links:2401\.11817,[Link](https://arxiv.org/abs/2401.11817)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- H\. Zhang, S\. Diao, Y\. Lin, Y\. R\. Fung, Q\. Lian, X\. Wang, Y\. Chen, H\. Ji, and T\. Zhang \(2024\)R\-tuning: instructing large language models to say ‘i don’t know’\.External Links:2311\.09677,[Link](https://arxiv.org/abs/2311.09677)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.
- M\. Zhang, O\. Press, W\. Merrill, A\. Liu, and N\. A\. Smith \(2023\)How language model hallucinations can snowball\.External Links:2305\.13534,[Link](https://arxiv.org/abs/2305.13534)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p3.1),[§1](https://arxiv.org/html/2607.00447#S1.p3.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- S\. Zhang, F\. Gotti, F\. Mo, and J\. Nie \(2025a\)Measuring the impact of lexical training data coverage on hallucination detection in large language models\.External Links:2511\.17946,[Link](https://arxiv.org/abs/2511.17946)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2607.00447#S2.p2.1)\.
- Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, C\. Xu, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi \(2025b\)Siren’s song in the ai ocean: a survey on hallucination in large language models\.External Links:2309\.01219,[Link](https://arxiv.org/abs/2309.01219)Cited by:[§1](https://arxiv.org/html/2607.00447#S1.p2.1)\.
- W\. Zhao, T\. Goyal, Y\. Y\. Chiu, L\. Jiang, B\. Newman, A\. Ravichander, K\. Chandu, R\. L\. Bras, C\. Cardie, Y\. Deng, and Y\. Choi \(2024\)WildHallucinations: evaluating long\-form factuality in llms with real\-world entity queries\.External Links:2407\.17468,[Link](https://arxiv.org/abs/2407.17468)Cited by:[§C\.2](https://arxiv.org/html/2607.00447#A3.SS2.p1.1)\.
- S\. Zheng, J\. Huang, and K\. C\. Chang \(2023\)Why does chatgpt fall short in providing truthful answers?\.External Links:2304\.10513,[Link](https://arxiv.org/abs/2304.10513)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px3.p3.1)\.
- D\. M\. Ziegler, N\. Stiennon, J\. Wu, T\. B\. Brown, A\. Radford, D\. Amodei, P\. Christiano, and G\. Irving \(2020\)Fine\-tuning language models from human preferences\.External Links:1909\.08593,[Link](https://arxiv.org/abs/1909.08593)Cited by:[§C\.1](https://arxiv.org/html/2607.00447#A3.SS1.SSS0.Px4.p1.1)\.

## Appendix AArtifact licenses and terms

Scientist QA uses public Wikipedia\-linked scientist profiles and Wikidata QIDs\. Wikipedia text is available under CC BY\-SA 4\.0 unless otherwise noted, while Wikidata structured data is released under CC0\(Wikipedia contributors,[2026](https://arxiv.org/html/2607.00447#bib.bib75); Wikidata contributors,[2026](https://arxiv.org/html/2607.00447#bib.bib77)\)\. Real\-Life Constrained QA usesSWOWonly for seed selection; we cite the original SWOW resource and do not redistribute raw SWOW participant records or cue–response tables\. The released benchmark package contains derived QA items, labels, prompts, and saved evaluation outputs, with the final redistribution license stated in the repository README\.

## Appendix BArtifact use and intended use

We use existing artifacts only for research and diagnostic evaluation\. Wikipedia\- and Wikidata\-derived scientist information is used to construct public\-profile disambiguation questions; SWOW is used only for non\-commercial seed selection, and we do not redistribute raw SWOW data\. The new TrapQA artifacts are intended for research on hallucination, knowledge deployment, and constraint\-sensitive reasoning, not for deployment certification, individual assessment, or commercial redistribution of source\-derived data\.

## Appendix CAdditional Related Work

This appendix expands the related\-work discussion from Section[2](https://arxiv.org/html/2607.00447#S2), covering theoretical accounts, pipeline\-stage sources of hallucination, reinforcement learning from feedback, and evaluation benchmarks\.

### C\.1Mechanisms and Sources of Hallucination

#### Pre\-LLM hallucination\.

Hallucination has long been studied in language generation\. In data\-to\-text generation,Wisemanet al\.\([2017](https://arxiv.org/html/2607.00447#bib.bib3)\)showed that neural models can produce fluent outputs that nevertheless fail to faithfully reflect the underlying records\. In neural machine translation,Leeet al\.\([2019](https://arxiv.org/html/2607.00447#bib.bib2)\)analyzed hallucinations as spurious translations unrelated to the source text, a phenomenon further studied byRaunaket al\.\([2021](https://arxiv.org/html/2607.00447#bib.bib26)\)\. Similarly,Maynezet al\.\([2020](https://arxiv.org/html/2607.00447#bib.bib1)\)found that neural summarization models frequently generate content that is not faithful to the source document\. Although these settings differ in task formulation, they share the common problem that models may produce plausible text that is insufficiently grounded in the conditioning input\.

#### Theoretical and mechanistic accounts\.

Several studies examine hallucination from the perspective of fundamental limitations\.Xuet al\.\([2025](https://arxiv.org/html/2607.00447#bib.bib15)\)show, in a formal learning\-theoretic setting, that LLMs cannot learn all computable functions and therefore cannot completely avoid hallucination when used as general\-purpose problem solvers\.Kalai and Vempala \([2024](https://arxiv.org/html/2607.00447#bib.bib32)\)derive a statistical lower bound on hallucination for calibrated language models on certain classes of facts, suggesting that hallucination cannot be eliminated solely through better calibration\. From a mechanistic perspective,Sunet al\.\([2025a](https://arxiv.org/html/2607.00447#bib.bib16)\)propose a subsequence\-association framework for tracing hallucinations, arguing that hallucinations can arise when dominant hallucinatory associations outweigh faithful ones during generation\. More recently,Cherukuri and Varshney \([2026](https://arxiv.org/html/2607.00447#bib.bib22)\)analyze hallucination through a dynamical\-systems view of hidden\-state trajectories, in which hallucination behavior is characterized by task\-dependent latent\-space basin structure\.

These accounts are closely related to our work in treating hallucination as a competition between faithful and unfaithful associations\. Our framework differs by focusing on prompts in which a decisive local constraint conflicts with a statistically salient shortcut\. This setting allows us to study not only whether a model knows the relevant facts, but also whether it retrieves and applies the constraint\-sensitive inference path required by the prompt\.

#### Training, fine\-tuning, and inference\.

Hallucinations can arise from multiple stages of the LLM pipeline, including pretraining, post\-training, and inference\. At the pretraining level, distributional imbalance can make false or misleading continuations more probable than correct ones, particularly when correct facts are rare or expressed inconsistently\(Zhanget al\.,[2025a](https://arxiv.org/html/2607.00447#bib.bib17); Liuet al\.,[2026](https://arxiv.org/html/2607.00447#bib.bib35)\)\. More broadly, noisy, outdated, or contradictory training data may contribute to unsupported generations\(Jiet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib73); Huanget al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib72)\)\.Kalaiet al\.\([2025](https://arxiv.org/html/2607.00447#bib.bib14)\)further argue that modern training and evaluation procedures can reward guessing over acknowledging uncertainty, causing models to produce plausible answers even when they should abstain\.

Hallucinations may also persist or emerge during fine\-tuning\. Several studies find that language models struggle to acquire new factual knowledge through fine\-tuning\(Gekhmanet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib18); Kanget al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib19); Linet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib37); Renet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib38)\)\. In particular, fine\-tuning examples that introduce new knowledge may be learned more slowly than examples consistent with the model’s pre\-existing knowledge, and once learned, may increase hallucination on previously acquired facts\(Gekhmanet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib18)\)\. Related work also shows that fine\-tuning can differentially affect popular and unpopular factual knowledge, with models fine\-tuned on more widely known facts tending to achieve higher factual accuracy than those fine\-tuned on less popular facts\(Ghosalet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib36)\)\.

At inference time, hallucinations can be amplified by prompt ambiguity, decoding behavior, and reliance on memorized or frequency\-biased patterns\(Zhanget al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib6); Huanget al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib72)\)\.McKennaet al\.\([2023](https://arxiv.org/html/2607.00447#bib.bib41)\)identify attestation and relative\-frequency biases in natural language inference as sources of hallucination\-like errors, showing that models may rely on whether a hypothesis is attested in pretraining data rather than on the provided premise\.Berglundet al\.\([2024](https://arxiv.org/html/2607.00447#bib.bib42)\)expose a related limitation, the reversal curse, in which models trained on facts in one direction fail to reliably answer semantically equivalent queries in the reverse direction\.Zhenget al\.\([2023](https://arxiv.org/html/2607.00447#bib.bib43)\)analyze ChatGPT failures in open\-domain question answering and identify factuality, knowledge memorization, and knowledge recall as central sources of error\. Together, these studies suggest that hallucination is not merely a matter of missing knowledge; it can also reflect failures in retrieving, comparing, or applying knowledge under the constraints of a particular prompt\.

#### Reinforcement learning from feedback\.

Reinforcement learning from human feedback \(RLHF\) builds on preference\-based reinforcement learning and human\-preference fine\-tuning of language models\(Christianoet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib46); Ziegleret al\.,[2020](https://arxiv.org/html/2607.00447#bib.bib47)\)\. Its effect on hallucination remains debated\. On one hand,Ouyanget al\.\([2022](https://arxiv.org/html/2607.00447#bib.bib29)\)report that instruction tuning with human feedback improves truthfulness on several evaluations\. On the other hand, reward models can be imperfect proxies for truthfulness, and optimizing against them may encourage reward hacking or plausible\-sounding responses that satisfy human preferences without being fully faithful\(Casperet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib30)\)\. Recent work therefore proposes reinforcement\-learning objectives that explicitly penalize hallucination or reward truthful abstention\(Linet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib20); Weiet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib39); Zhanget al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib40)\)\. These approaches complement our work: rather than proposing a mitigation objective, we construct diagnostic settings that expose when models select a shortcut inference path despite having access to the relevant facts in closed\-book probes\.

### C\.2Hallucination Evaluation Benchmarks

Evaluating hallucination in language generation has attracted sustained attention, with benchmarks spanning different tasks, grounding sources, and annotation strategies\. For text\-only generation, early resources established task\-specific foundations in summarization faithfulness, table\-to\-text fidelity, and open\-domain factuality\(Kryścińskiet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib48); Wanget al\.,[2020](https://arxiv.org/html/2607.00447#bib.bib49); Pagnoniet al\.,[2021](https://arxiv.org/html/2607.00447#bib.bib24); Fabbriet al\.,[2021](https://arxiv.org/html/2607.00447#bib.bib50); Parikhet al\.,[2020](https://arxiv.org/html/2607.00447#bib.bib51); Honovichet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib52); Linet al\.,[2022](https://arxiv.org/html/2607.00447#bib.bib8); Liet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib12); Chenget al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib53); Daleet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib54)\)\. More recent work has shifted toward large\-scale and increasingly automated factuality evaluation\. FActScore\(Minet al\.,[2023](https://arxiv.org/html/2607.00447#bib.bib62)\)decomposes long\-form outputs into atomic facts and evaluates whether each is supported by a reliable knowledge source\. LongFact\(Weiet al\.,[2024b](https://arxiv.org/html/2607.00447#bib.bib61)\)targets factuality in extended, open\-domain responses, while SimpleQA\(Weiet al\.,[2024a](https://arxiv.org/html/2607.00447#bib.bib60)\)provides short, fact\-seeking questions with single, unambiguous answers and explicit grading of correct, incorrect, and not\-attempted responses\. WildHallucinations\(Zhaoet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib59)\)evaluates long\-form factuality on real\-world entity queries, including many entities outside Wikipedia coverage\.

A parallel line of work examines hallucination in retrieval\-augmented generation \(RAG\)\. RAGTruth\(Niuet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib58)\)provides human annotations of hallucinations in naturally generated RAG outputs, including word\-level labels\. FRAMES\(Krishnaet al\.,[2025](https://arxiv.org/html/2607.00447#bib.bib57)\)evaluates factuality, retrieval accuracy, and reasoning in end\-to\-end RAG scenarios, especially under multi\-hop reasoning demands\. In multimodal settings, researchers study hallucination in vision\-language models, including object\-level, relation\-level, and broader multimodal hallucination detection settings\(Guanet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib55); Chenet al\.,[2024](https://arxiv.org/html/2607.00447#bib.bib56)\)\. Collectively, these benchmarks reflect a progression from task\-specific faithfulness evaluation toward broader, multi\-task, and more automated hallucination assessment\.

Our benchmark is complementary to these evaluation efforts\. Existing benchmarks primarily measure whether a model’s output is factual, faithful, or grounded in evidence\. By contrast, Scientist QA and Real\-Life Constrained QA are designed to isolate why a model fails in controlled forced\-choice settings\. Scientist QA tests whether models can deploy candidate\-specific facts under disambiguation, while Real\-Life Constrained QA tests whether models can override SWOW\-derived associative cues with prompt\-grounded physical, spatial, procedural, or medium\-specific constraints\(De Deyneet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib13)\)\. This design allows us to distinguish simple ignorance from knowledge\-deployment failures in which the relevant information is available to the model but not used in the task\-relevant inference path\.

## Appendix DScientist QA Construction Details

This appendix gives the full Scientist QA construction pipeline: profile collection, name\-removed profile linearization, hard\-pair mining, question generation, and filtering\.

### D\.1Scientist Profiles

We collect 9,090 scientists with dedicated Wikipedia pages\. Each scientist is represented as a structured profile containing attributes such as occupation, field, notable work, awards, education, and a Wikidata QID\. The QID is used only for bookkeeping, deduplication, and linking candidates across processing stages\. Appendix[F\.1](https://arxiv.org/html/2607.00447#A6.SS1)shows an example profile\.

For pair mining, we remove the scientist’s name from each profile and linearize the remaining attributes into a short paragraph, which we call the*name\-removed profile*\. This prevents the embedding model from matching scientists by name while preserving semantic information from their profile attributes\. Appendix[F\.2](https://arxiv.org/html/2607.00447#A6.SS2)shows an example\.

### D\.2Hard\-Pair Mining

Let𝐞A\\mathbf\{e\}\_\{A\}and𝐞B\\mathbf\{e\}\_\{B\}denote the embeddings of the name\-removed profiles of scientistsAAandBB, obtained usingtext\-embedding\-3\-small\(OpenAI,[2024](https://arxiv.org/html/2607.00447#bib.bib63)\)\. LetTC\(A\)\\mathrm\{TC\}\(A\)be the number of retained attribute fields for scientistAA, and letλ\\lambdabe the median tag count across all scientists; in our data,λ=7\\lambda=7\. We score each pair by

s\(A,B\)=\\displaystyle s\(A,B\)=𝐞A⊤𝐞B‖𝐞A‖‖𝐞B‖⋅\\displaystyle\\frac\{\\mathbf\{e\}\_\{A\}^\{\\top\}\\mathbf\{e\}\_\{B\}\}\{\\\|\\mathbf\{e\}\_\{A\}\\\|\\,\\\|\\mathbf\{e\}\_\{B\}\\\|\}\\cdotmin⁡\(TC\(A\),TC\(B\)\)min⁡\(TC\(A\),TC\(B\)\)\+λ\\displaystyle\\frac\{\\min\(\\mathrm\{TC\}\(A\),\\mathrm\{TC\}\(B\)\)\}\{\\min\(\\mathrm\{TC\}\(A\),\\mathrm\{TC\}\(B\)\)\+\\lambda\}\(1\)The cosine term measures semantic similarity, while the penalty term downweights pairs whose similarity may be driven by sparse metadata\. We rank all scientist pairs by Eq\. equation[1](https://arxiv.org/html/2607.00447#A4.E1)and retain the top0\.01%0\.01\\%highest\-scoring pairs, yielding 2,958 candidate pairs\.

### D\.3Question Generation and Filtering

For each candidate pair\(A,B\)\(A,B\), we prompt Gemini to generate a third\-person biographical paragraph broadly compatible with both scientists, followed by a single decisive constraint that rules out exactly one candidate\. Each question ends with*“Who is this person?”*We then used ChatGPT to filter malformed or unverifiable outputs, including cases where the decisive constraint does not distinguish the pair, contradicts the candidate profiles, or cannot be verified against the paired candidates\. After filtering and excluding the 33 invalid items identified in the final proofreading pass, the final evaluation set contains 2,925 questions\.

Each retained question is evaluated in two variants:

1. 1\.prepend\_names, which prepends only the two candidate names;
2. 2\.prepend\_profiles, which prepends the full profiles of both candidates\.

Each question is also paired with two supplementary probe questions derived from the decisive constraint, one for each candidate\.

![Refer to caption](https://arxiv.org/html/2607.00447v1/sections/fig/scientist_pipeline_overview.png)Figure 3:Overview of the Scientist QA construction pipeline\. Starting from Wikipedia\-linked scientist profiles, we construct highly confusable scientist pairs, generate pairwise disambiguation questions, and attach two supplementary probes to each primary question\.

## Appendix EImplementation Details

This appendix specifies the prompt formats, answer parsing rules, and failure handling used in evaluation\.

### E\.1Model versions

For reproducibility, Table[3](https://arxiv.org/html/2607.00447#A5.T3)reports the model identifiers and inference settings used in evaluation\. Exact access dates, provider\-side parameters, and run identifiers should be preserved with the released run logs\.

Table 3:Model identifiers and inference settings used in the evaluation\. Low\- and high\-thinking settings are provider\-specific controls; for DeepSeek, the two columns correspond to non\-reasoning and reasoning model aliases\.FamilyNon\-reasoning API IDReasoning API IDGPT\-5\.2gpt\-5\.2\-2025\-12\-11gpt\-5\.2\-2025\-12\-11Claude Sonnet 4\.6claude\-sonnet\-4\-6claude\-sonnet\-4\-6Gemini 3\.1 Pro Previewgemini\-3\.1\-pro\-previewgemini\-3\.1\-pro\-previewDeepSeek V3\.2deepseek\-chatdeepseek\-reasoner
### E\.2Prompt Templates

We evaluate each benchmark item in separate conversations: one for the primary question and one for each supplementary probe\.

#### Names\-only prompt \(prepend\_names\)\.

> Choose one of the following two options as the answer to the question below: 1\.AA 2\.BB Question: questioni\\textit\{question\}\_\{i\}

HereAAandBBare the two candidate scientists\. Their order is randomized across items\.

#### Profiles\-in\-context prompt \(prepend\_profiles\)\.

> Given two profiles of two persons: profileA\\textit\{profile\}\_\{A\} profileB\\textit\{profile\}\_\{B\} Choose exactly one profile from the two, and output the name of the person as the answer to the following question: questioni\\textit\{question\}\_\{i\}

#### Supplementary probes\.

Each supplementary probe is asked independently as a binary factual question about one candidate and the decisive relation\. For example:

> Did Albert Einstein receive the Nobel Prize in Physics?

### E\.3Answer Matching and Failure Handling

For primary questions, the model is instructed to output exactly one of the two candidate names\. We normalize whitespace, capitalization, and minor formatting differences before matching\. If the normalized response matches the correct candidate, it is counted as correct; if it matches the distractor, it is counted as a hallucination\. If the response matches neither candidate, it is also counted as a hallucination\. Across the primaryprepend\_namesScientist QA experiments, only two unmatched primary\-question responses remain after normalization, both from Claude\-low\.

For supplementary probes, binary answers are normalized to true/false labels\. Each primary\-question outcome is then paired with its two probe outcomes to determine whether the model*knows both*,*knows one*, or*knows neither*of the relevant probe facts\.

## Appendix FAdditional Examples

### F\.1Profile Example

Figure[4](https://arxiv.org/html/2607.00447#A6.F4)provides a complete profile of Wolfgang Pauli\.

Example structured profile: Wolfgang Pauli```
{
  "Wolfgang Pauli": {
    "occupation": [
      "theoretical physicist",
      "university teacher",
      "chemist",
      "physicist"
    ],
    "award_received": [
      "Nobel Prize in Physics",
      "Max Planck Medal",
      "Lorentz Medal",
      "Foreign Member of the Royal Society",
      "honorary doctor of the University of Vienna"
    ],
    "field": [
      "quantum mechanics",
      "particle physics"
    ],
    "notable_work": [
      "Pauli exclusion principle",
      "Pauli matrices",
      "Pauli equation"
    ],
    "education": [
      "Ludwig-Maximilians-Universität München",
      "Bundesgymnasium Döbling"
    ],
    "qid": "Q65989"
  }
}
```

Figure 4:Example structured scientist profile used in Scientist QA\.
### F\.2Name\-Removed Profile Example

Example name\-removed profile: Albert Einsteinoccupation: inventor, mathematician, philosopher of science, physicist, professor, science writer, theoretical physicist, university teacher; field: theoretical physics; notable work: general relativity, mass–energy equivalence, photoelectric effect, quantum mechanics, special relativity, theory of Brownian motion, theory of relativity; awards: Copley Medal, Foreign Member of the Royal Society, Franklin Medal, Max Planck Medal, Nobel Prize in Physics, Pour le Mérite; education: ETH Zurich, Luitpold\-Gymnasium, University of Zurich; positions: professor\.

### F\.3Question Example

Primary question exampleThis prominent theoretical physicist, mathematician, and university teacher made significant contributions to science\. In recognition of their work, they delivered the Josiah Willard Gibbs Lectureship and were elected a Foreign Member of the Royal Society\. However, this scientist never received the Nobel Prize in Physics\. Who is this person?

Paired supplementary probesDid Albert Einstein receive the Nobel Prize in Physics?Did Edward Witten receive the Nobel Prize in Physics?

prepend\_namesvariantChoose one of the following two options as the answer to the question below:1\. Edward Witten2\. Albert EinsteinQuestion:This prominent theoretical physicist, mathematician, and university teacher made significant contributions to science\. In recognition of their work, they delivered the Josiah Willard Gibbs Lectureship and were elected a Foreign Member of the Royal Society\. However, this scientist never received the Nobel Prize in Physics\. Who is this person?

prepend\_profilesvariantGiven two profiles of two persons:[⬇](data:text/plain;base64,bmFtZTogRWR3YXJkIFdpdHRlbgpvY2N1cGF0aW9uOiBtYXRoZW1hdGljaWFuOyBwaHlzaWNpc3Q7IHVuaXZlcnNpdHkgdGVhY2hlcjsgdGhlb3JldGljYWwgcGh5c2ljaXN0CmF3YXJkX3JlY2VpdmVkOiBGaWVsZHMgTWVkYWw7IE1hY0FydGh1ciBGZWxsb3dzaGlwOyBJc2FhYyBOZXd0b24gTWVkYWw7IC4uLgpmaWVsZDogcGh5c2ljczsgbWF0aGVtYXRpY2FsIHBoeXNpY3M7IHN0cmluZyB0aGVvcnkKZWR1Y2F0aW9uOiBQcmluY2V0b24gVW5pdmVyc2l0eTsgVW5pdmVyc2l0eSBvZiBXaXNjb25zaW4tLU1hZGlzb247IC4uLgoKbmFtZTogQWxiZXJ0IEVpbnN0ZWluCm9jY3VwYXRpb246IHRoZW9yZXRpY2FsIHBoeXNpY2lzdDsgcGhpbG9zb3BoZXIgb2Ygc2NpZW5jZTsgc2NpZW5jZSB3cml0ZXI7IC4uLgphd2FyZF9yZWNlaXZlZDogTm9iZWwgUHJpemUgaW4gUGh5c2ljczsgQ29wbGV5IE1lZGFsOyBGcmFua2xpbiBNZWRhbDsgLi4uCmZpZWxkOiB0aGVvcmV0aWNhbCBwaHlzaWNzCm5vdGFibGVfd29yazogZ2VuZXJhbCByZWxhdGl2aXR5OyBzcGVjaWFsIHJlbGF0aXZpdHk7IHBob3RvZWxlY3RyaWMgZWZmZWN0OyAuLi4KZWR1Y2F0aW9uOiBFVEggWnVyaWNoOyBVbml2ZXJzaXR5IG9mIFp1cmljaDsgLi4u)name:EdwardWittenoccupation:mathematician;physicist;universityteacher;theoreticalphysicistaward\_received:FieldsMedal;MacArthurFellowship;IsaacNewtonMedal;\.\.\.field:physics;mathematicalphysics;stringtheoryeducation:PrincetonUniversity;UniversityofWisconsin\-\-Madison;\.\.\.name:AlbertEinsteinoccupation:theoreticalphysicist;philosopherofscience;sciencewriter;\.\.\.award\_received:NobelPrizeinPhysics;CopleyMedal;FranklinMedal;\.\.\.field:theoreticalphysicsnotable\_work:generalrelativity;specialrelativity;photoelectriceffect;\.\.\.education:ETHZurich;UniversityofZurich;\.\.\.Choose exactly one profile from the two, and output the name of the person as the answer to the following question:This prominent theoretical physicist, mathematician, and university teacher made significant contributions to science\. In recognition of their work, they delivered the Josiah Willard Gibbs Lectureship and were elected a Foreign Member of the Royal Society\. However, this scientist never received the Nobel Prize in Physics\. Who is this person?Figure 5:Exampleprepend\_profilesprompt variant\. The ellipses in the profile example indicate omitted attributes for readability\.

## Appendix GReal\-Life Constrained QA Construction Details

This appendix describes Real\-Life Constrained QA, a collection of realistic two\-option questions in which a locally plausible shortcut conflicts with a physical, spatial, procedural, or medium\-specific constraint\. Unlike Scientist QA, which tests entity disambiguation among highly similar scientists, this component targets shortcut\-driven failures in everyday scenarios\. Each item presents a short first\-person situation and two candidate actions or media\. One option is superficially attractive because it matches a strong prior association, while the other is correct because it satisfies the constraint implied by the scenario\. The final collection contains 500 questions covering 13 aspects of daily life\.

### G\.1Association Mining fromSWOW

We begin fromSWOW\(De Deyneet al\.,[2019](https://arxiv.org/html/2607.00447#bib.bib13)\), a large\-scale psycholinguistic resource of human word associations\. For each cue word, we use high\-probability first responses as candidate shortcut associations\. We lightly normalize and filter these associations by lowercasing, lemmatizing, merging obvious duplicates, and removing generic or noisy responses\. The result is a cleaned bank of human\-salient cue–response pairs suitable for question generation\. We used several preprocessing packages for SWOW seed preprocessing\. We use spaCy with theen\_core\_web\_smEnglish pipeline for tokenization, POS/stopword checks, and lemmatization; NLTK WordNet for coarse lexical\-type labels from synsets and lexnames; andwordfreqZipf frequencies to filter overly common or rare responses\. We disable the spaCy parser for this preprocessing step and use heuristic frequency thresholds ofhigh\_zipf=6\.5andlow\_zipf=1\.0in the first\-pass filter\.

### G\.2Template Families and Seed Selection

We organize cleaned associations into eight seed template families corresponding to recurring hidden\-constraint patterns\. Examples includevehicle\_required, where the task requires bringing a vehicle rather than merely reaching a location;delivery\_medium, where a physical item cannot be replaced by a digital surrogate;recording\_medium, where the correct action depends on the required recording modality; andtool\_required, where a specific tool is necessary for task completion\.

For each seed, we annotate structured metadata, including the scenario role, latent constraint type, and intended shapes of the correct and shortcut options\. We prioritize seeds whose associations are concrete, whose constraints are easy to instantiate in everyday settings, and whose shortcut options are plausible without being absurd\. We also cap overrepresented lemmas within each family to maintain scenario diversity\.

### G\.3Generation, Filtering, and Augmentation

For each selected seed, we use GPT to augment seed questions into multiple first\-person scenarios following the corresponding family template\. Each generated item must be realistic, self\-contained, and unambiguous: the incorrect option should be a plausible shortcut, while the correct option should be determined by a recoverable constraint in the scenario\. Claude is then used to proofread the resulting questions for ambiguity, plausibility, and constraint validity\. We manually filter malformed or weak items, including cases where both options are arguably valid, the constraint is too explicit, the scenario depends on niche expertise, or the shortcut option is implausible\.

### G\.4Benchmark Format

Each Real\-Life Constrained QA item consists of a short scenario, two candidate options and a gold label\.

## Appendix HExtended Empirical Results

This appendix provides the full empirical breakdowns supporting Section[5](https://arxiv.org/html/2607.00447#S5)\. Unless otherwise stated, all tables refer to the retrieval\-sensitiveprepend\_namescondition over 2,925 Scientist QA questions\.

### H\.1Full Probe\-State Breakdowns

For each pairwise question, we use two closed\-book supplementary probes targeting the decisive relation\. We group examples into three probe\-defined knowledge states:

- •Knows both:both supplementary probes are answered correctly;
- •Knows one:exactly one supplementary probe is answered correctly;
- •Knows neither:both supplementary probes are answered incorrectly\.

Table[4](https://arxiv.org/html/2607.00447#A8.T4)reports correct and incorrect pairwise outcomes within each state\. Correct answers are typically concentrated in the*knows\-both*bucket, while incorrect answers shift toward the*knows\-one*and*knows\-neither*buckets\. However, the*knows\-both*rows still contain nonzero error rates, showing that correct probe\-level knowledge does not guarantee correct comparative deployment\.

ModelModeKnowsbothcorrectKnowsbothwrongWrongrateKnowsonecorrectKnowsonewrongWrongrateKnowsneithercorrectKnowsneitherwrongWrongrateClaude Sonnet 4\.6high2457662\.62%27910727\.72%7956\.25%Claude Sonnet 4\.6low180942418\.99%40126639\.88%16936\.00%DeepSeek V3\.2 Chathigh21791406\.04%41015227\.05%271738\.64%DeepSeek V3\.2 Reasonerlow114160034\.46%66546841\.31%302141\.18%Gemini 3\.1 Pro Previewhigh2785742\.59%481827\.27%00–Gemini 3\.1 Pro Previewlow2784572\.01%681619\.05%00–GPT\-5\.2high23871596\.25%22913336\.74%9847\.06%GPT\-5\.2low23041987\.91%26013434\.01%171241\.38%Table 4:Probe\-conditioned breakdown of primary\-question outcomes in the names\-only Scientist QA condition\. Each bucket is defined by the number of supplementary probes answered correctly\. “Correct” and “wrong” count pairwise disambiguation outcomes within that bucket, and “Wrong rate” is computed within the corresponding bucket\.
### H\.2Eliminative\-Probe Asymmetry

The two supplementary probes play different diagnostic roles\. One tests the fact that should eliminate the distractor; the other tests the compatibility of the correct candidate with the decisive constraint\. Table[5](https://arxiv.org/html/2607.00447#A8.T5)focuses on one\-probe\-correct cases\. Across all eight model settings, hallucination is higher when the model misses the eliminative probe than when it misses the compatibility probe\. This asymmetry supports the latent key–task account in Section[3](https://arxiv.org/html/2607.00447#S3): the decisive relation is often not merely a fact about the correct candidate, but the fact that suppresses the shortcut candidate\. When this eliminative fact is not retrieved, the high\-salience candidate remains available as a plausible continuation\.

ModelModennmissingelim\.Hall\. whenelim\. missednnmissingcompat\.Hall\. whencompat\. missedGapClaude Sonnet 4\.6high12828\.91%25827\.13%1\.77Claude Sonnet 4\.6low48941\.72%17834\.83%6\.89DeepSeek V3\.2 Chathigh27229\.04%29025\.17%3\.87DeepSeek V3\.2 Reasonerlow29846\.64%83539\.40%7\.24Gemini 3\.1 Pro Previewhigh4040\.00%267\.69%32\.31Gemini 3\.1 Pro Previewlow3531\.43%4910\.20%21\.22GPT\-5\.2high12646\.03%23631\.78%14\.25GPT\-5\.2low15337\.91%24131\.54%6\.37Table 5:Hallucination rates in one\-probe\-correct cases for the names\-only Scientist QA condition\. “Elim\.” denotes the probe whose correct answer eliminates the distractor; “compat\.” denotes the probe whose correct answer confirms the correct candidate’s compatibility with the decisive constraint\. The gap is the difference between the two hallucination rates\.
### H\.3Consensus Failures

To distinguish idiosyncratic model errors from shared shortcut directions, we identify questions missed by multiple model settings\. Of the 2,925 questions, 1,489 are missed by at least one model setting, and 10 are missed by all eight settings\. Table[6](https://arxiv.org/html/2607.00447#A8.T6)lists these all\-setting consensus failures\. They concentrate on high\-frequency biographical relation families such as education, awards, professional roles, and offices\. In these cases, the distractor satisfies a salient affirmative association, while the correct answer is determined by an explicit incompatibility or non\-possession constraint\. These examples support the common\-shortcut assumption in Theorem[3\.4](https://arxiv.org/html/2607.00447#S3.Thmtheorem4): different model families can be biased toward the same incorrect answer when a dominant association conflicts with the prompt’s decisive constraint\.

Question IDCorrect candidateDistractorDecisive relation familyquestion\_0214Klaus von KlitzingRudolf MössbauerEducation / institutionquestion\_0596Glenn T\. SeaborgMildred DresselhausEducation / institutionquestion\_0797Jennifer DoudnaFrances ArnoldAward / honorquestion\_1092Fritz LipmannOtto Heinrich WarburgEducation / institutionquestion\_1161Norman Foster Ramsey, Jr\.Carl WiemanAward / honorquestion\_1517Joseph\-Louis LagrangeFrançois AragoPolitical office / rolequestion\_1772Alexander R\. Todd, Baron ToddSvante ArrheniusOccupation / rolequestion\_1981Harold Clayton UreyMildred DresselhausEducation / institutionquestion\_2183Steven WeinbergLeon Max LedermanAward / honorquestion\_2370Robert AumannGérard DebreuEducation / institutionTable 6:The 10 Scientist QA questions missed by all eight model settings\. Question IDs and candidate names are produced by the analysis notebook; relation\-family labels are manual annotations based on the decisive constraint\.
### H\.4Probe\-Underdetermined Cases

We refer to the one\-probe\-correct subset asprobe\-underdeterminedfrom the model’s local probe state\. Since the two gold probe labels are complementary, answering exactly one probe correctly is equivalent to predicting the same value for both probes, either both true or both false\. In this regime, the two probe predictions alone do not determine the correct pairwise answer for the model\. Table[7](https://arxiv.org/html/2607.00447#A8.T7)reports the size and behavior of this subset\.

ModelModennprobe\-underdet\.Pct\. ofquestionsAccuracyHall\. rateBoth predictedfalseBoth predictedtrueClaude Sonnet 4\.6high38613\.20%72\.28%27\.72%33\.16%66\.84%Claude Sonnet 4\.6low66722\.80%60\.12%39\.88%73\.31%26\.69%DeepSeek V3\.2 Chathigh56219\.21%72\.95%27\.05%48\.40%51\.60%DeepSeek V3\.2 Reasonerlow113338\.74%58\.69%41\.31%26\.30%73\.70%Gemini 3\.1 Pro Previewhigh662\.26%72\.73%27\.27%60\.61%39\.39%Gemini 3\.1 Pro Previewlow842\.87%80\.95%19\.05%41\.67%58\.33%GPT\-5\.2high36212\.38%63\.26%36\.74%34\.81%65\.19%GPT\-5\.2low39413\.47%65\.99%34\.01%38\.83%61\.17%Table 7:Results on probe\-underdetermined cases in the names\-only Scientist QA condition\. “Pct\. of questions” uses 2,925 as the denominator\. “Both predicted false” and “Both predicted true” describe the model’s two probe predictions within this subset\. These cases are common for weaker settings, especially DeepSeek\-low and Claude\-low, but rare for Gemini\.Table[8](https://arxiv.org/html/2607.00447#A8.T8)further shows that behavior in this regime is not explained by simply choosing the more famous scientist\. In several settings, the model chooses the more famous candidate less than half the time\.

ModelModennnon\-tieChooses morefamousppvs\. 50%Claude Sonnet 4\.6high38646\.89%0\.242Claude Sonnet 4\.6low66744\.68%0\.007DeepSeek V3\.2 Chathigh56249\.47%0\.833DeepSeek V3\.2 Reasonerlow113346\.07%0\.009Gemini 3\.1 Pro Previewhigh6650\.00%1\.000Gemini 3\.1 Pro Previewlow8446\.43%0\.586GPT\-5\.2high36245\.86%0\.127GPT\-5\.2low39446\.70%0\.208Table 8:Choice of the more famous candidate within probe\-underdetermined, non\-tie cases\. “Chooses more famous” is the fraction of such cases in which the pairwise answer is the candidate with the higher fame score\. The binomial test compares the observed rate to a 50% baseline\.
### H\.5Fame\-Based Analyses

We examine whether hallucinations can be explained by a simple prior toward the more famous scientist\. For each scientistss, we define

Fame\(s\)=13\[\\displaystyle\\mathrm\{Fame\}\(s\)=\\frac\{1\}\{3\}\[norm\(pageLengths\)\+\\displaystyle\\mathrm\{norm\}\(\\mathrm\{pageLength\}\_\{s\}\)\\ \+norm\(pageViews\)\+\\displaystyle\\mathrm\{norm\}\(\\mathrm\{pageViews\}\)\\ \+norm\(externalLinkss\)\]\\displaystyle\\mathrm\{norm\}\(\\mathrm\{externalLinks\}\_\{s\}\)\]wherenorm\(⋅\)\\mathrm\{norm\}\(\\cdot\)denotes corpus\-level normalization across scientists\. The fame rank is induced by this score\.

We use the 2020\-01\-01~2025\-12\-31 calendar\-year window for the page view count because it is the most recent complete multi\-year window before our evaluation period\(Wikimedia Foundation,[2026](https://arxiv.org/html/2607.00447#bib.bib76)\)\. This window balances recency with robustness to short\-term spikes in public attention and avoids using page\-view data generated after the benchmark evaluation\.

These analyses serve as negative controls\. The wrong candidate is more famous in 61\.30% of benchmark questions, but among hallucinated cases this fraction is lower, ranging from 44\.64% to 57\.12% depending on the model setting\. Table[9](https://arxiv.org/html/2607.00447#A8.T9)shows that hallucination rates are lower, not higher, when the incorrect candidate is more famous\.

ModelModeHall\. when wrongnot more famousHall\. when wrongmore famousWrong more famousamong hallucinationsClaude Sonnet 4\.6high8\.83%4\.57%45\.05%Claude Sonnet 4\.6low34\.19%17\.40%44\.64%DeepSeek V3\.2 Reasonerhigh13\.16%8\.92%51\.78%DeepSeek V3\.2 Chatlow41\.25%34\.69%57\.12%Gemini 3\.1 Pro Previewhigh4\.24%2\.45%47\.83%Gemini 3\.1 Pro Previewlow3\.53%1\.84%45\.21%GPT\-5\.2high13\.52%8\.20%49\.00%GPT\-5\.2low15\.37%9\.48%49\.42%Table 9:Fame\-direction negative control for the names\-only Scientist QA condition\. The first two numeric columns condition on whether the wrong candidate is more famous by fame score\. The final column reports, among hallucinated cases, the fraction in which the wrong candidate is more famous\. Hallucination rates are consistently lower when the wrong candidate is more famous, indicating that the observed shortcut is not a simple more\-famous\-name prior\.
### H\.6Confidence Diagnostics

Accuracy and confidence are imperfect certificates of faithful reasoning\. First, a model can sometimes answer the pairwise question correctly without answering both probes correctly; for example, only 62\.15% of DeepSeek\-low’s correct pairwise answers occur in the both\-probe\-correct regime\. This suggests that some correct answers may be supported by shortcuts that happen to point to the correct candidate\.

Second, self\-reported confidence does not reliably separate correct from hallucinated answers across model families\. Table[10](https://arxiv.org/html/2607.00447#A8.T10)reports confidence for correct and hallucinated pairwise answers\. Hallucinated answers are usually less confident than correct answers, but the absolute confidence remains high in many settings\. DeepSeek\-low is especially poorly separated: hallucinated and correct answers have nearly identical mean confidence\.

ModelModeMean conf\.correctMean conf\.hallucinatedGapClaude Sonnet 4\.6high88\.0674\.19\-13\.87Claude Sonnet 4\.6low72\.1662\.46\-9\.70DeepSeek V3\.2 Chathigh92\.7986\.48\-6\.31DeepSeek V3\.2 Reasonerlow84\.8984\.930\.04Gemini 3\.1 Pro Previewhigh99\.2392\.55\-6\.68Gemini 3\.1 Pro Previewlow97\.3876\.23\-21\.15GPT\-5\.2high91\.1081\.35\-9\.76GPT\-5\.2low90\.8183\.55\-7\.27Table 10:Mean self\-reported confidence for correct and hallucinated pairwise answers in the names\-only Scientist QA condition\. The gap is hallucinated confidence minus correct confidence\. Confidence separates correct and incorrect answers for some models, but not reliably across model families\.
### H\.7Real\-Life Constrained QA Results

Real\-Life Constrained QA contains 500 two\-option scenarios covering 13 aspects of daily life\. Table[11](https://arxiv.org/html/2607.00447#A8.T11)reports the final error counts and rates for the evaluated models\.

ModelErrors / rateClaude Sonnet 4\.681 / 16\.20%DeepSeek\-chat182 / 36\.40%GPT\-5\.244 / 8\.80%Gemini 3\.1 Pro Preview18 / 3\.6%Table 11:Real\-Life Constrained QA results over 500 questions covering 13 aspects of daily life\. Entries report the number and percentage of incorrect shortcut selections\.

## Appendix IPotential risks

TrapQA is intended as a diagnostic benchmark, not as a training set or a broad certificate of hallucination robustness\. Public release may enable overfitting or contaminate future model training/evaluation, so later results should be interpreted with this risk in mind\. We are working with the community to expandReal\-Life Constrained QAand extend entity disambiguation beyond scientists to domains such as sports players and music composers; such extensions should be reported separately unless results are recomputed\.

## Appendix JData Contains Personally Identifying Info Or Offensive Content

ScientistQAuses public Wikipedia/Wikidata\-linked scientist profiles, which makes the task verifiable but introduces coverage biases toward scientists with richer public or English\-language records\. Because the task distinguishes real scientists, names and public biographical facts are not anonymized\. We release only public attributes needed for the diagnostic task and exclude private contact information, images, surveillance data, and other private personal data\.Real\-Life Constrained QAis synthetic and filtered for ambiguity, plausibility, and inappropriate or offensive content\.

## Appendix KProof for Section[3](https://arxiv.org/html/2607.00447#S3)

### K\.1Posterior decomposition under the latent key–task model

In this appendix, we make explicit the hierarchical posterior structure implicit in the latent key–task model\. Recall that the pretraining prior over latent pairs factorizes as

π\(k,t\)=π\(k\)\(k\)π\(t\)\(t∣k\)\.\\pi\(k,t\)=\\pi^\{\(k\)\}\(k\)\\,\\pi^\{\(t\)\}\(t\\mid k\)\.Accordingly, for a prompt𝒛\\bm\{z\}, we consider the hierarchical posterior decomposition

P\(k,t∣𝒛\)=P\(k∣𝒛\)P\(t∣k,𝒛\),P\(k,t\\mid\\bm\{z\}\)=P\(k\\mid\\bm\{z\}\)\\,P\(t\\mid k,\\bm\{z\}\),where

P\(k∣𝒛\)=P\(𝒛∣k\)π\(k\)\(k\)∑k′∈𝒦P\(𝒛∣k′\)π\(k\)\(k′\),P\(k\\mid\\bm\{z\}\)=\\frac\{P\(\\bm\{z\}\\mid k\)\\,\\pi^\{\(k\)\}\(k\)\}\{\\sum\_\{k^\{\\prime\}\\in\\mathcal\{K\}\}P\(\\bm\{z\}\\mid k^\{\\prime\}\)\\,\\pi^\{\(k\)\}\(k^\{\\prime\}\)\},and

P\(t∣k,𝒛\)=P\(𝒛∣k,t\)π\(t\)\(t∣k\)∑t′∈𝒯P\(𝒛∣k,t′\)π\(t\)\(t′∣k\)\.P\(t\\mid k,\\bm\{z\}\)=\\frac\{P\(\\bm\{z\}\\mid k,t\)\\,\\pi^\{\(t\)\}\(t\\mid k\)\}\{\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\}P\(\\bm\{z\}\\mid k,t^\{\\prime\}\)\\,\\pi^\{\(t\)\}\(t^\{\\prime\}\\mid k\)\}\.Therefore,

P\(k,t∣𝒛\)=\\displaystyle P\(k,t\\mid\\bm\{z\}\)=P\(𝒛∣k\)π\(k\)\(k\)∑k′∈𝒦P\(𝒛∣k′\)π\(k\)\(k′\)⋅\\displaystyle\\frac\{P\(\\bm\{z\}\\mid k\)\\,\\pi^\{\(k\)\}\(k\)\}\{\\sum\_\{k^\{\\prime\}\\in\\mathcal\{K\}\}P\(\\bm\{z\}\\mid k^\{\\prime\}\)\\,\\pi^\{\(k\)\}\(k^\{\\prime\}\)\}\\cdotP\(𝒛∣k,t\)π\(t\)\(t∣k\)∑t′∈𝒯P\(𝒛∣k,t′\)π\(t\)\(t′∣k\)\.\\displaystyle\\frac\{P\(\\bm\{z\}\\mid k,t\)\\,\\pi^\{\(t\)\}\(t\\mid k\)\}\{\\sum\_\{t^\{\\prime\}\\in\\mathcal\{T\}\}P\(\\bm\{z\}\\mid k,t^\{\\prime\}\)\\,\\pi^\{\(t\)\}\(t^\{\\prime\}\\mid k\)\}\.
For brevity in the proof below, we write

π⋆\(k\):=π\(k\)\(k∗\),π\(s\)\(k\):=π\(k\)\(k\(s\)\),\\pi^\{\(k\)\}\_\{\\star\}:=\\pi^\{\(k\)\}\(k^\{\\ast\}\),\\qquad\\pi^\{\(k\)\}\_\{\(s\)\}:=\\pi^\{\(k\)\}\(k\_\{\(s\)\}\),and

π⋆\(t\):=π\(t\)\(t∗∣k∗\),π\(s\)\(t\):=π\(t\)\(t\(s\)∣ks\)\.\\pi^\{\(t\)\}\_\{\\star\}:=\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k^\{\\ast\}\),\\qquad\\pi^\{\(t\)\}\_\{\(s\)\}:=\\pi^\{\(t\)\}\(t\_\{\(s\)\}\\mid k\_\{s\}\)\.

### K\.2Proof of Theorem[3\.4](https://arxiv.org/html/2607.00447#S3.Thmtheorem4)

###### Proof K\.1

We factorize the joint posterior into key and task components:

P\(ks,ts∣𝒛\)P\(k∗,t∗∣𝒛\)=P\(ks∣𝒛\)P\(k∗∣𝒛\)⋅P\(ts∣𝒛,ks\)P\(t∗∣𝒛,k∗\)\.\\frac\{P\(k\_\{s\},t\_\{s\}\\mid\\bm\{z\}\)\}\{P\(k^\{\\ast\},t^\{\\ast\}\\mid\\bm\{z\}\)\}\\;=\\;\\frac\{P\(k\_\{s\}\\mid\\bm\{z\}\)\}\{P\(k^\{\\ast\}\\mid\\bm\{z\}\)\}\\cdot\\frac\{P\(t\_\{s\}\\mid\\bm\{z\},k\_\{s\}\)\}\{P\(t^\{\\ast\}\\mid\\bm\{z\},k^\{\\ast\}\)\}\.
By Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)\(i\),P\(k∈\{k∗,ks\}∣𝐳\)≈1P\(k\\in\\\{k^\{\\ast\},k\_\{s\}\\\}\\mid\\bm\{z\}\)\\approx 1, so fork∈\{k∗,ks\}k\\in\\\{k^\{\\ast\},k\_\{s\}\\\},

P\(k∣𝒛\)≈P\(k∣𝒛,k∈\{k∗,ks\}\)\.P\(k\\mid\\bm\{z\}\)\\;\\approx\\;P\(k\\mid\\bm\{z\},\\,k\\in\\\{k^\{\\ast\},k\_\{s\}\\\}\)\.By Assumption[3\.1](https://arxiv.org/html/2607.00447#S3.Thmtheorem1)\(ii\),𝐳\\bm\{z\}is independent ofkkwithin the candidate pair, hence

P\(k∣𝒛,k∈\{k∗,ks\}\)=P\(k∣k∈\{k∗,ks\}\)\\displaystyle P\(k\\mid\\bm\{z\},\\,k\\in\\\{k^\{\\ast\},k\_\{s\}\\\}\)\\;=\\;P\(k\\mid k\\in\\\{k^\{\\ast\},k\_\{s\}\\\}\)=π\(k\)\(k\)π\(k\)\(k∗\)\+π\(k\)\(ks\)\.\\displaystyle\\;=\\;\\frac\{\\pi^\{\(k\)\}\(k\)\}\{\\pi^\{\(k\)\}\(k^\{\\ast\}\)\+\\pi^\{\(k\)\}\(k\_\{s\}\)\}\.Taking the ratio atk=ksk=k\_\{s\}andk=k∗k=k^\{\\ast\},

P\(ks∣𝒛\)P\(k∗∣𝒛\)≈π\(k\)\(ks\)π\(k\)\(k∗\)\.\\frac\{P\(k\_\{s\}\\mid\\bm\{z\}\)\}\{P\(k^\{\\ast\}\\mid\\bm\{z\}\)\}\\;\\approx\\;\\frac\{\\pi^\{\(k\)\}\(k\_\{s\}\)\}\{\\pi^\{\(k\)\}\(k^\{\\ast\}\)\}\.
By Assumption[3\.2](https://arxiv.org/html/2607.00447#S3.Thmtheorem2)\(i\), conditional on the activated keykk, the task posterior concentrates on\{t∗,ts\}\\\{t^\{\\ast\},t\_\{s\}\\\}, so fort∈\{t∗,ts\}t\\in\\\{t^\{\\ast\},t\_\{s\}\\\}andk∈\{k∗,ks\}k\\in\\\{k^\{\\ast\},k\_\{s\}\\\},

P\(t∣𝒛,k\)≈P\(t∣𝒛,k,t∈\{t∗,ts\}\)\.P\(t\\mid\\bm\{z\},k\)\\;\\approx\\;P\(t\\mid\\bm\{z\},k,\\,t\\in\\\{t^\{\\ast\},t\_\{s\}\\\}\)\.By Assumption[3\.2](https://arxiv.org/html/2607.00447#S3.Thmtheorem2)\(ii\),𝐳\\bm\{z\}is independent ofttgivenkkwithin the candidate task pair, so

P\(t∣𝒛,k,t∈\{t∗,ts\}\)=P\(t∣k,t∈\{t∗,ts\}\)\\displaystyle P\(t\\mid\\bm\{z\},k,\\,t\\in\\\{t^\{\\ast\},t\_\{s\}\\\}\)\\;=\\;P\(t\\mid k,\\,t\\in\\\{t^\{\\ast\},t\_\{s\}\\\}\)=π\(t\)\(t∣k\)π\(t\)\(t∗∣k\)\+π\(t\)\(ts∣k\)\.\\displaystyle\\;=\\;\\frac\{\\pi^\{\(t\)\}\(t\\mid k\)\}\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k\)\+\\pi^\{\(t\)\}\(t\_\{s\}\\mid k\)\}\.Evaluating at\(t,k\)=\(ts,ks\)\(t,k\)=\(t\_\{s\},k\_\{s\}\)and\(t∗,k∗\)\(t^\{\\ast\},k^\{\\ast\}\)and taking the ratio,

P\(ts∣𝒛,ks\)P\(t∗∣𝒛,k∗\)≈\\displaystyle\\frac\{P\(t\_\{s\}\\mid\\bm\{z\},k\_\{s\}\)\}\{P\(t^\{\\ast\}\\mid\\bm\{z\},k^\{\\ast\}\)\}\\;\\approx\\;π\(t\)\(ts∣ks\)π\(t\)\(t∗∣k∗\)⋅\\displaystyle\\frac\{\\pi^\{\(t\)\}\(t\_\{s\}\\mid k\_\{s\}\)\}\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k^\{\\ast\}\)\}\\cdotπ\(t\)\(t∗∣k∗\)\+π\(t\)\(ts∣k∗\)π\(t\)\(t∗∣ks\)\+π\(t\)\(ts∣ks\)\.\\displaystyle\\frac\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k^\{\\ast\}\)\+\\pi^\{\(t\)\}\(t\_\{s\}\\mid k^\{\\ast\}\)\}\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k\_\{s\}\)\+\\pi^\{\(t\)\}\(t\_\{s\}\\mid k\_\{s\}\)\}\.The second factor is a ratio of normalization constants over the candidate task pair, which we absorb into the≈\\approxsymbol as it is bounded and does not depend on𝐳\\bm\{z\}:

P\(ks,ts∣𝒛\)P\(k∗,t∗∣𝒛\)≈π\(k\)\(ks\)π\(k\)\(k∗\)⋅π\(t\)\(ts∣ks\)π\(t\)\(t∗∣k∗\)\.\\frac\{P\(k\_\{s\},t\_\{s\}\\mid\\bm\{z\}\)\}\{P\(k^\{\\ast\},t^\{\\ast\}\\mid\\bm\{z\}\)\}\\;\\approx\\;\\frac\{\\pi^\{\(k\)\}\(k\_\{s\}\)\}\{\\pi^\{\(k\)\}\(k^\{\\ast\}\)\}\\cdot\\frac\{\\pi^\{\(t\)\}\(t\_\{s\}\\mid k\_\{s\}\)\}\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k^\{\\ast\}\)\}\.
By the law of total probability over key–task pairs, the marginal output probability decomposes as

P\(y∣𝒛\)=∑k,tP\(y∣𝒛;k,t\)P\(k,t∣𝒛\)\.P\(y\\mid\\bm\{z\}\)\\;=\\;\\sum\_\{k,t\}P\(y\\mid\\bm\{z\};k,t\)\\,P\(k,t\\mid\\bm\{z\}\)\.By Assumption[3\.3](https://arxiv.org/html/2607.00447#S3.Thmtheorem3), only the shortcut path contributes non\-negligibly toysy\_\{s\}and only the correct path contributes non\-negligibly toy∗y^\{\\ast\}:

P\(ys∣𝒛;k∗,t∗\)≪1,P\(y∗∣𝒛;ks,ts\)≪1\.P\(y\_\{s\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\\ll 1,\\qquad P\(y^\{\\ast\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\ll 1\.Hence,

P\(ys∣𝒛\)≈P\(ys∣𝒛;ks,ts\)P\(ks,ts∣𝒛\),P\(y\_\{s\}\\mid\\bm\{z\}\)\\;\\approx\\;P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\,P\(k\_\{s\},t\_\{s\}\\mid\\bm\{z\}\),P\(y∗∣𝒛\)≈P\(y∗∣𝒛;k∗,t∗\)P\(k∗,t∗∣𝒛\)\.P\(y^\{\\ast\}\\mid\\bm\{z\}\)\\;\\approx\\;P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\\,P\(k^\{\\ast\},t^\{\\ast\}\\mid\\bm\{z\}\)\.Taking the ratio,

P\(ys∣𝒛\)P\(y∗∣𝒛\)≈P\(ks,ts∣𝒛\)P\(k∗,t∗∣𝒛\)⋅P\(ys∣𝒛;ks,ts\)P\(y∗∣𝒛;k∗,t∗\)\.\\frac\{P\(y\_\{s\}\\mid\\bm\{z\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\}\)\}\\;\\approx\\;\\frac\{P\(k\_\{s\},t\_\{s\}\\mid\\bm\{z\}\)\}\{P\(k^\{\\ast\},t^\{\\ast\}\\mid\\bm\{z\}\)\}\\cdot\\frac\{P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\}\.
P\(ys∣𝒛\)P\(y∗∣𝒛\)≳\\displaystyle\\frac\{P\(y\_\{s\}\\mid\\bm\{z\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\}\)\}\\;\\gtrsim\\;π\(k\)\(ks\)π\(k\)\(k∗\)⋅π\(t\)\(ts∣ks\)π\(t\)\(t∗∣k∗\)⋅\\displaystyle\\frac\{\\pi^\{\(k\)\}\(k\_\{s\}\)\}\{\\pi^\{\(k\)\}\(k^\{\\ast\}\)\}\\cdot\\frac\{\\pi^\{\(t\)\}\(t\_\{s\}\\mid k\_\{s\}\)\}\{\\pi^\{\(t\)\}\(t^\{\\ast\}\\mid k^\{\\ast\}\)\}\\cdotP\(ys∣𝒛;ks,ts\)P\(y∗∣𝒛;k∗,t∗\)\.\\displaystyle\\frac\{P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\}\{P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\}\.The second inequality in the theorem statement follows from the shortcut\-frequency dominance condition \(both pretraining\-prior ratios are≥1\\geq 1by the definition of the shortcut path\) together with Assumption[3\.3](https://arxiv.org/html/2607.00447#S3.Thmtheorem3), which givesP\(ys∣𝐳;ks,ts\)≥P\(y∗∣𝐳;k∗,t∗\)P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\geq P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\.

### K\.3Proof of Theorem[3\.6](https://arxiv.org/html/2607.00447#S3.Thmtheorem6)

###### Proof K\.2

By the latent key–task decomposition, the model prediction can be written as

P\(y∣𝒛\)=∑k,tP\(k,t∣𝒛\)P\(y∣𝒛;k,t\)\.P\(y\\mid\\bm\{z\}\)=\\sum\_\{k,t\}P\(k,t\\mid\\bm\{z\}\)P\(y\\mid\\bm\{z\};k,t\)\.Restricting to the two relevant paths gives the contributions

P\(y∗∣𝒛\)≥q∗P\(y∗∣𝒛;k∗,t∗\)\+qsP\(y∗∣𝒛;ks,ts\),P\(y^\{\\ast\}\\mid\\bm\{z\}\)\\geq q^\{\\ast\}P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\+q\_\{s\}P\(y^\{\\ast\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\),and

P\(ys∣𝒛\)≥qsP\(ys∣𝒛;ks,ts\)\+q∗P\(ys∣𝒛;k∗,t∗\)\.P\(y\_\{s\}\\mid\\bm\{z\}\)\\geq q\_\{s\}P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\+q^\{\\ast\}P\(y\_\{s\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)\.Under Assumption[3\.3](https://arxiv.org/html/2607.00447#S3.Thmtheorem3),

P\(y∗∣𝒛;ks,ts\)=0,P\(ys∣𝒛;k∗,t∗\)=0\.P\(y^\{\\ast\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)=0,\\qquad P\(y\_\{s\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\)=0\.Thus, the two dominant contributions reduce to

P\(y∗∣𝒛\)≈q∗P\(y∗∣𝒛;k∗,t∗\),P\(y^\{\\ast\}\\mid\\bm\{z\}\)\\approx q^\{\\ast\}P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\),and

P\(ys∣𝒛\)≈qsP\(ys∣𝒛;ks,ts\)\.P\(y\_\{s\}\\mid\\bm\{z\}\)\\approx q\_\{s\}P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\.Since

and

P\(ys∣𝒛;ks,ts\)≥P\(y∗∣𝒛;k∗,t∗\),P\(y\_\{s\}\\mid\\bm\{z\};k\_\{s\},t\_\{s\}\)\\geq P\(y^\{\\ast\}\\mid\\bm\{z\};k^\{\\ast\},t^\{\\ast\}\),we obtain

P\(ys∣𝒛\)\>P\(y∗∣𝒛\)\.P\(y\_\{s\}\\mid\\bm\{z\}\)\>P\(y^\{\\ast\}\\mid\\bm\{z\}\)\.Therefore,

γ\(𝒛\):=P\(ys∣𝒛\)−P\(y∗∣𝒛\)\>0\.\\gamma\(\\bm\{z\}\):=P\(y\_\{s\}\\mid\\bm\{z\}\)\-P\(y^\{\\ast\}\\mid\\bm\{z\}\)\>0\.
It remains to lower bound the total variation distance\. By definition,

ℓ\(𝒛\)=12∑y\|P\(y∣𝒛\)−P⋆\(y∣𝒛\)\|\.\\ell\(\\bm\{z\}\)=\\frac\{1\}\{2\}\\sum\_\{y\}\\left\|P\(y\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y\\mid\\bm\{z\}\)\\right\|\.Keeping only the two coordinatesysy\_\{s\}andy∗y^\{\\ast\}, we have

ℓ\(𝒛\)≥\\displaystyle\\ell\(\\bm\{z\}\)\\geq12\|P\(ys∣𝒛\)−P⋆\(ys∣𝒛\)\|\\displaystyle\\frac\{1\}\{2\}\\left\|P\(y\_\{s\}\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y\_\{s\}\\mid\\bm\{z\}\)\\right\|\+12\|P\(y∗∣𝒛\)−P⋆\(y∗∣𝒛\)\|\.\\displaystyle\+\\frac\{1\}\{2\}\\left\|P\(y^\{\\ast\}\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y^\{\\ast\}\\mid\\bm\{z\}\)\\right\|\.Since the model prefers the shortcut answer,

P\(ys∣𝒛\)−P\(y∗∣𝒛\)=γ\(𝒛\)\>0,P\(y\_\{s\}\\mid\\bm\{z\}\)\-P\(y^\{\\ast\}\\mid\\bm\{z\}\)=\\gamma\(\\bm\{z\}\)\>0,whereas the target distribution prefers the correct answer,

P⋆\(y∗∣𝒛\)−P⋆\(ys∣𝒛\)=γ⋆\(𝒛\)\>0\.P\_\{\\star\}\(y^\{\\ast\}\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y\_\{s\}\\mid\\bm\{z\}\)=\\gamma\_\{\\star\}\(\\bm\{z\}\)\>0\.Adding these two inequalities gives

γ\(𝒛\)\+γ⋆\(𝒛\)=\\displaystyle\\gamma\(\\bm\{z\}\)\+\\gamma\_\{\\star\}\(\\bm\{z\}\)=\[P\(ys∣𝒛\)−P⋆\(ys∣𝒛\)\]\\displaystyle\\left\[P\(y\_\{s\}\\mid\\bm\{z\}\)\-P\_\{\\star\}\(y\_\{s\}\\mid\\bm\{z\}\)\\right\]\+\[P⋆\(y∗∣𝒛\)−P\(y∗∣𝒛\)\]\.\\displaystyle\+\\left\[P\_\{\\star\}\(y^\{\\ast\}\\mid\\bm\{z\}\)\-P\(y^\{\\ast\}\\mid\\bm\{z\}\)\\right\]\.Therefore, the two\-coordinate contribution to the total variation distance is at least

γ\(𝒛\)\+γ⋆\(𝒛\)2\.\\frac\{\\gamma\(\\bm\{z\}\)\+\\gamma\_\{\\star\}\(\\bm\{z\}\)\}\{2\}\.Hence,

ℓ\(𝒛\)≥γ\(𝒛\)\+γ⋆\(𝒛\)2\.\\ell\(\\bm\{z\}\)\\geq\\frac\{\\gamma\(\\bm\{z\}\)\+\\gamma\_\{\\star\}\(\\bm\{z\}\)\}\{2\}\.This completes the proof\.
Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

Similar Articles

Why language models hallucinate

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Submit Feedback

Similar Articles

Why language models hallucinate
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models