Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
Summary
This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.
View Cached Full Text
Cached at: 04/20/26, 08:31 AM
# Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness Source: https://arxiv.org/html/2510.09033 Chi Seng Cheang¹ Hou Pong Chan² Wenxuan Zhang³ Yang Deng¹ ¹Singapore Management University ²DAMO Academy, Alibaba Group ³Singapore University of Technology and Design [email protected], [email protected] [email protected], [email protected] ## Abstract Recent work suggests that LLMs "know what they don't know", positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model's parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection. https://github.com/AndyCheang/knowledge-recall-vs-truthfulness ## 1 Introduction **Figure 1:** Illustration of three categories of knowledge. Associated hallucinations follow similar internal knowledge recall processes with factual associations, while unassociated hallucinations arise when the model's output is detached from the input. Large language models (LLMs) are notorious for producing hallucinations (Zhang et al., 2023b; Huang et al., 2025), where generated outputs appear plausible but are factually incorrect. Recent research suggests that LLMs' internal states contain signals correlated with factual correctness, enabling hallucination detection using internal representations, such as residual streams (Azaria and Mitchell, 2023; Gottesman and Geva, 2024), attention weights (Yüksekgönül et al., 2024), and output token logits (Orgad et al., 2025; Varshney et al., 2023a). However, since LLMs are not explicitly trained to represent truthfulness, it remains unclear whether these signals genuinely reflect truthfulness or instead capture other confounding factors. Understanding what these signals actually encode is critical for reliably deploying LLMs in real-world applications. In this work, we argue that such internal signals primarily reflect the model's internal process of recalling parametric knowledge, rather than truthfulness itself. Consequently, these signals can reliably detect hallucinations only when hallucinated and factually correct outputs are produced by distinct internal mechanisms. For example, as shown in Figure 1, given the prompt "Brenda Johnston was born in the city of", a model that lacks the relevant factual knowledge about the subject ("Brenda Johnston") may hallucinate a completion such as "Portland". In contrast, given the prompt "Barack Obama studied in the city of", the model can leverage knowledge encoded about the subject ("Barack Obama") to produce the factually correct output "Chicago". These two cases are likely supported by different internal mechanisms: the former lacks knowledge about the subject entity, while the latter relies on encoded knowledge relevant to the queried subject. As a result, internal representations can reflect this difference in how the model processes the subject entity, enabling these cases to be distinguished. However, hallucinations do not always arise from missing knowledge. When a model exploits learned statistical shortcuts or spurious correlations (Lin et al., 2022b; Kang and Choi, 2023; Cheang et al., 2023), the resulting hallucinations may be produced through mechanisms similar to those underlying factual recall. For instance, "Barack Obama" frequently co-occurs with "Chicago" in the model's pre-training corpora. The model can leverage this statistical association to produce a factually correct output (e.g., "Barack Obama studied in the city of Chicago."), but it may also leverage the same association to produce a hallucinated response (e.g., "Barack Obama was born in the city of Chicago."). In both cases, the model relies on the same encoded statistical association involving the subject entity. Consequently, the resulting internal representations may not provide reliable signals to distinguish hallucinated outputs from factual ones, limiting the effectiveness of existing representation-based hallucination detection methods. Based on this observation, we hypothesize that the effectiveness of representation-based hallucination detection depends on how the model leverages its parametric knowledge when producing a response, particularly whether the generated output is driven by learned associations involving the subject entity. To investigate this hypothesis, we go beyond labeling outputs solely by factual correctness and instead categorize them according to their relationship with the subject entity through a causal intervention. Specifically, we label factually correct outputs as **Factual Associations (FAs)**. For outputs that are factually incorrect, we further classify them as **Unassociated Hallucinations (UHs)**, whose outputs lack strong learned associations with the subject entity, and **Associated Hallucinations (AHs)**, which are driven by strong but spurious associations involving the subject entity. Using this taxonomy, we conduct mechanistic and empirical analyses of these knowledge categories, yielding three key observations: **First, AHs and FAs share highly similar internal processes and representational geometries.** Building on the analytical framework of Geva et al. (2023), we examine the internal mechanisms underlying model predictions by tracing how information propagates across layers and token positions during inference. We observe that because AHs and FAs are both driven by learned associations with the subject, their hidden state representations overlap in the hidden space. In contrast, UHs lack this reliance on subject associations and are instead generated through a different internal process, allowing them to remain more separable in the representation space. **Second, existing hallucination detection methods struggle to distinguish AHs from FAs.** Since these methods rely on internal states that primarily reflect the process of knowledge recall rather than truthfulness, their performance degrades significantly for AH samples (AUROC ≈ 0.48–0.69 for LLaMA). However, UHs are more reliably detected (AUROC ≈ 0.86–0.93) due to their more separable representational geometry. **Third, representational overlap constrains the effectiveness of refusal tuning for AHs.** We compare tuning effectiveness under two settings: (i) training the model to refuse AHs, and (ii) training the model to refuse UHs. In both settings, the model is trained to maintain its original factual responses for FAs. Because UH representations are more separable from FAs, the model can successfully learn distinct generative behaviors, achieving an 82% refusal rate on UH samples. Conversely, because AH representations overlap substantially with FAs, the model struggles to differentiate them to learn refusal behaviors, resulting in a refusal rate of only 33% for AH samples. ## 2 Related Work Existing hallucination detection methods can be broadly categorized into two types: **representation-based** and **confidence-based**. **Representation-based methods** assume that an LLM's internal hidden states can reflect the correctness of its generated responses. These approaches train a classifier (often a linear probe) using the hidden states from a set of labeled correct/incorrect responses to predict whether a new response is hallucinatory (Li et al., 2023; Azaria and Mitchell, 2023; Su et al., 2024; Ji et al., 2024; Chen et al., 2024; Ni et al., 2025; Xiao et al., 2025). **Confidence-based methods**, in contrast, assume that lower confidence during generation indicates a higher probability of hallucination. These methods quantify uncertainty through various signals, including: (i) token-level output probabilities (Guerreiro et al., 2023; Varshney et al., 2023a; Orgad et al., 2025); (ii) directly querying the LLM to verbalize its own confidence (Lin et al., 2022a; Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024b; Ni et al., 2024; Zhao et al., 2024); or (iii) measuring the semantic consistency across multiple outputs sampled from the same prompt (Manakul et al., 2023; Kuhn et al., 2023; Zhang et al., 2023a; Ding et al., 2024). A response is typically flagged as a hallucination if its associated confidence metric falls below a predetermined threshold. However, a growing body of work reveals a critical limitation: even state-of-the-art LLMs are poorly calibrated, meaning their expressed confidence often fails to align with the factual accuracy of their generations (Kapoor et al., 2024; Xiong et al., 2024; Tian et al., 2023). This miscalibration limits the effectiveness of confidence-based detectors and raises a fundamental question about the extent of LLMs' self-awareness of their knowledge boundary, i.e., whether they can reliably "know what they don't know" (Yin et al., 2023; Li et al., 2025). Despite recognizing this problem, prior work does not provide a mechanistic explanation for its occurrence. To this end, our work addresses this explanatory gap by employing mechanistic interpretability techniques to trace the internal computations underlying knowledge recall within LLMs. ## 3 Dataset Construction In this section, we outline our dataset construction for mechanistic and empirical analyses under two conditions: hallucinations produced with and without leveraging the learned associations related to the subject entity. Given an input query q, the ground-truth answer y, and the model's response ŷ, standard evaluation of hallucination detection methods typically assigns a factual correctness label by comparing ŷ with y. To study hallucinations produced through different internal mechanisms, we go beyond factual correctness: for each hallucinatory sample, we perform a causal intervention to estimate its reliance on learned subject associations and categorize it accordingly. ### 3.1 Data Collection #### Factual Query Prompt Creation We focus on a knowledge-based question answering setting, where each example corresponds to a knowledge triple (subject, relation, object) (s, r, o). To construct factual query prompts, we first collect knowledge triples from Wikidata (Vrandečić and Krötzsch, 2014). Each (s, r) pair is then converted into a cloze-style factual query q using a handcrafted prompt template for each relation r. The corresponding object o is treated as the ground-truth answer y. To ensure a well-defined evaluation setting, we follow Gekhman et al. (2025) and select only relations for which the correct object is objectively verifiable. Details on relation selection and prompt templates are provided in Appendix A.1. #### Generating Model Responses For each query, we prompt LLMs to generate a response ŷ using greedy decoding. We conduct experiments on two widely adopted open-source LLMs: LLaMA-3 (Dubey et al., 2024) and Mistral-v0.3 (Jiang et al., 2023). Due to space constraints, full implementation details are provided in Appendix A.2. ### 3.2 Categorization of Knowledge We categorize each response based on two criteria: (1) **factual correctness** and (2) **reliance on subject representations**. Each sample is then categorized into one of the following categories: - **Factual Associations (FA)** refer to factual knowledge that is reliably stored in the parameters or internal states of an LLM and can be recalled to produce correct, verifiable outputs. - **Associated Hallucinations (AH)** refer to non-factual content produced when an LLM relies on input-triggered parametric associations. - **Unassociated Hallucinations (UH)** refer to non-factual content produced without reliance on parametric associations to the input. **Figure 2:** Effect of interventions across layers of LLaMA-3-8B. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model's predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token. ### 3.3 Labeling Procedure We detail the labeling procedure as follows:
Similar Articles
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
Can LLMs Introspect? A Reality Check
This paper argues that recent claims about LLMs' ability to introspect are not justified, as behavioral evidence alone cannot distinguish genuine introspection from pattern matching on surface-level cues. The authors re-examine two evaluation paradigms and find that models rely on input-level features rather than genuine access to internal states.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.