Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

This paper challenges the assumption that LLMs can reliably distinguish between hallucinated and factual outputs through internal signals, arguing that internal states primarily reflect knowledge recall rather than truthfulness. The authors propose a taxonomy of hallucinations (associated vs. unassociated) and show that associated hallucinations exhibit hidden-state geometries overlapping with factual outputs, making standard detection methods ineffective.

arXiv:2510.09033v3 Announce Type: replace Abstract: Recent work suggests that LLMs "know what they don't know", positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model's parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

Source: https://arxiv.org/html/2510.09033

Chi Seng Cheang¹ Hou Pong Chan² Wenxuan Zhang³ Yang Deng¹

¹Singapore Management University
²DAMO Academy, Alibaba Group
³Singapore University of Technology and Design

[email protected], [email protected]
[email protected], [email protected]

## Abstract

Recent work suggests that LLMs "know what they don't know", positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model's parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.

https://github.com/AndyCheang/knowledge-recall-vs-truthfulness

## 1 Introduction

**Figure 1:** Illustration of three categories of knowledge. Associated hallucinations follow similar internal knowledge recall processes with factual associations, while unassociated hallucinations arise when the model's output is detached from the input.

Large language models (LLMs) are notorious for producing hallucinations (Zhang et al., 2023b; Huang et al., 2025), where generated outputs appear plausible but are factually incorrect. Recent research suggests that LLMs' internal states contain signals correlated with factual correctness, enabling hallucination detection using internal representations, such as residual streams (Azaria and Mitchell, 2023; Gottesman and Geva, 2024), attention weights (Yüksekgönül et al., 2024), and output token logits (Orgad et al., 2025; Varshney et al., 2023a). However, since LLMs are not explicitly trained to represent truthfulness, it remains unclear whether these signals genuinely reflect truthfulness or instead capture other confounding factors. Understanding what these signals actually encode is critical for reliably deploying LLMs in real-world applications.

In this work, we argue that such internal signals primarily reflect the model's internal process of recalling parametric knowledge, rather than truthfulness itself. Consequently, these signals can reliably detect hallucinations only when hallucinated and factually correct outputs are produced by distinct internal mechanisms. For example, as shown in Figure 1, given the prompt "Brenda Johnston was born in the city of", a model that lacks the relevant factual knowledge about the subject ("Brenda Johnston") may hallucinate a completion such as "Portland". In contrast, given the prompt "Barack Obama studied in the city of", the model can leverage knowledge encoded about the subject ("Barack Obama") to produce the factually correct output "Chicago". These two cases are likely supported by different internal mechanisms: the former lacks knowledge about the subject entity, while the latter relies on encoded knowledge relevant to the queried subject. As a result, internal representations can reflect this difference in how the model processes the subject entity, enabling these cases to be distinguished.

However, hallucinations do not always arise from missing knowledge. When a model exploits learned statistical shortcuts or spurious correlations (Lin et al., 2022b; Kang and Choi, 2023; Cheang et al., 2023), the resulting hallucinations may be produced through mechanisms similar to those underlying factual recall. For instance, "Barack Obama" frequently co-occurs with "Chicago" in the model's pre-training corpora. The model can leverage this statistical association to produce a factually correct output (e.g., "Barack Obama studied in the city of Chicago."), but it may also leverage the same association to produce a hallucinated response (e.g., "Barack Obama was born in the city of Chicago."). In both cases, the model relies on the same encoded statistical association involving the subject entity. Consequently, the resulting internal representations may not provide reliable signals to distinguish hallucinated outputs from factual ones, limiting the effectiveness of existing representation-based hallucination detection methods.

Based on this observation, we hypothesize that the effectiveness of representation-based hallucination detection depends on how the model leverages its parametric knowledge when producing a response, particularly whether the generated output is driven by learned associations involving the subject entity. To investigate this hypothesis, we go beyond labeling outputs solely by factual correctness and instead categorize them according to their relationship with the subject entity through a causal intervention. Specifically, we label factually correct outputs as **Factual Associations (FAs)**. For outputs that are factually incorrect, we further classify them as **Unassociated Hallucinations (UHs)**, whose outputs lack strong learned associations with the subject entity, and **Associated Hallucinations (AHs)**, which are driven by strong but spurious associations involving the subject entity.

Using this taxonomy, we conduct mechanistic and empirical analyses of these knowledge categories, yielding three key observations:

**First, AHs and FAs share highly similar internal processes and representational geometries.** Building on the analytical framework of Geva et al. (2023), we examine the internal mechanisms underlying model predictions by tracing how information propagates across layers and token positions during inference. We observe that because AHs and FAs are both driven by learned associations with the subject, their hidden state representations overlap in the hidden space. In contrast, UHs lack this reliance on subject associations and are instead generated through a different internal process, allowing them to remain more separable in the representation space.

**Second, existing hallucination detection methods struggle to distinguish AHs from FAs.** Since these methods rely on internal states that primarily reflect the process of knowledge recall rather than truthfulness, their performance degrades significantly for AH samples (AUROC ≈ 0.48–0.69 for LLaMA). However, UHs are more reliably detected (AUROC ≈ 0.86–0.93) due to their more separable representational geometry.

**Third, representational overlap constrains the effectiveness of refusal tuning for AHs.** We compare tuning effectiveness under two settings: (i) training the model to refuse AHs, and (ii) training the model to refuse UHs. In both settings, the model is trained to maintain its original factual responses for FAs. Because UH representations are more separable from FAs, the model can successfully learn distinct generative behaviors, achieving an 82% refusal rate on UH samples. Conversely, because AH representations overlap substantially with FAs, the model struggles to differentiate them to learn refusal behaviors, resulting in a refusal rate of only 33% for AH samples.

## 2 Related Work

Existing hallucination detection methods can be broadly categorized into two types: **representation-based** and **confidence-based**.

**Representation-based methods** assume that an LLM's internal hidden states can reflect the correctness of its generated responses. These approaches train a classifier (often a linear probe) using the hidden states from a set of labeled correct/incorrect responses to predict whether a new response is hallucinatory (Li et al., 2023; Azaria and Mitchell, 2023; Su et al., 2024; Ji et al., 2024; Chen et al., 2024; Ni et al., 2025; Xiao et al., 2025).

**Confidence-based methods**, in contrast, assume that lower confidence during generation indicates a higher probability of hallucination. These methods quantify uncertainty through various signals, including: (i) token-level output probabilities (Guerreiro et al., 2023; Varshney et al., 2023a; Orgad et al., 2025); (ii) directly querying the LLM to verbalize its own confidence (Lin et al., 2022a; Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024b; Ni et al., 2024; Zhao et al., 2024); or (iii) measuring the semantic consistency across multiple outputs sampled from the same prompt (Manakul et al., 2023; Kuhn et al., 2023; Zhang et al., 2023a; Ding et al., 2024). A response is typically flagged as a hallucination if its associated confidence metric falls below a predetermined threshold.

However, a growing body of work reveals a critical limitation: even state-of-the-art LLMs are poorly calibrated, meaning their expressed confidence often fails to align with the factual accuracy of their generations (Kapoor et al., 2024; Xiong et al., 2024; Tian et al., 2023). This miscalibration limits the effectiveness of confidence-based detectors and raises a fundamental question about the extent of LLMs' self-awareness of their knowledge boundary, i.e., whether they can reliably "know what they don't know" (Yin et al., 2023; Li et al., 2025).

Despite recognizing this problem, prior work does not provide a mechanistic explanation for its occurrence. To this end, our work addresses this explanatory gap by employing mechanistic interpretability techniques to trace the internal computations underlying knowledge recall within LLMs.

## 3 Dataset Construction

In this section, we outline our dataset construction for mechanistic and empirical analyses under two conditions: hallucinations produced with and without leveraging the learned associations related to the subject entity.

Given an input query q, the ground-truth answer y, and the model's response ŷ, standard evaluation of hallucination detection methods typically assigns a factual correctness label by comparing ŷ with y. To study hallucinations produced through different internal mechanisms, we go beyond factual correctness: for each hallucinatory sample, we perform a causal intervention to estimate its reliance on learned subject associations and categorize it accordingly.

### 3.1 Data Collection

#### Factual Query Prompt Creation

We focus on a knowledge-based question answering setting, where each example corresponds to a knowledge triple (subject, relation, object) (s, r, o). To construct factual query prompts, we first collect knowledge triples from Wikidata (Vrandečić and Krötzsch, 2014). Each (s, r) pair is then converted into a cloze-style factual query q using a handcrafted prompt template for each relation r. The corresponding object o is treated as the ground-truth answer y.

To ensure a well-defined evaluation setting, we follow Gekhman et al. (2025) and select only relations for which the correct object is objectively verifiable. Details on relation selection and prompt templates are provided in Appendix A.1.

#### Generating Model Responses

For each query, we prompt LLMs to generate a response ŷ using greedy decoding. We conduct experiments on two widely adopted open-source LLMs: LLaMA-3 (Dubey et al., 2024) and Mistral-v0.3 (Jiang et al., 2023). Due to space constraints, full implementation details are provided in Appendix A.2.

### 3.2 Categorization of Knowledge

We categorize each response based on two criteria: (1) **factual correctness** and (2) **reliance on subject representations**. Each sample is then categorized into one of the following categories:

- **Factual Associations (FA)** refer to factual knowledge that is reliably stored in the parameters or internal states of an LLM and can be recalled to produce correct, verifiable outputs.

- **Associated Hallucinations (AH)** refer to non-factual content produced when an LLM relies on input-triggered parametric associations.

- **Unassociated Hallucinations (UH)** refer to non-factual content produced without reliance on parametric associations to the input.

**Figure 2:** Effect of interventions across layers of LLaMA-3-8B. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model's predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token.

### 3.3 Labeling Procedure

We detail the labeling procedure as follows:

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

Similar Articles

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Can LLMs Introspect? A Reality Check

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

Submit Feedback

Similar Articles

Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Can LLMs Introspect? A Reality Check

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts