Why Fine-Tuning Encourages Hallucinations and How to Fix It
Summary
This paper investigates how supervised fine-tuning (SFT) increases hallucinations in LLMs by causing knowledge degradation and proposes a self-distillation-based method to mitigate this issue while preserving pre-existing factual knowledge. The authors identify semantic interference among overlapping representations as the primary mechanism behind SFT-induced hallucinations and demonstrate solutions including parameter freezing and self-distillation.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
# Why Fine-Tuning Encourages Hallucinations and How to Fix It Source: https://arxiv.org/html/2604.15574 Guy Kaplan♡Zorik Gekhman♣Zhen Zhu♢Lotem Rozner♡ Yuval Reif♡Swabha Swayamdipta♠Derek Hoiem♢Roy Schwartz♡ ♡Hebrew University of Jerusalem♢University of Illinois Urbana-Champaign ♣Technion – Israel Institute of Technology♠University of Southern California ###### Abstract Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation–based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference. Refer to captionFigure 1:Left:SFT-induced hallucinations as factual forgetting in parameter space, starting fromθ₀*. The regions denote subspaces with low error on preexisting facts (A), the task (T) (e.g., QA), and new facts (B). Standard SFT acquires new facts but forgets existing ones. Parameter freeze preserves existing facts at the cost of new ones. Self-distillation achieves both. Right:SFT on semantically overlapping entities causes hallucinations on related existing ones. E.g., after learning that "Bergadena" (Refer to caption; a city-like fictional name) is in Greece, the model hallucinates about real cities like Milan (Refer to caption), while mapping random identifier Loc_fcfb42 (Refer to caption) to Greece causes no such effect, even with many new facts. ## 1Introduction Recent studies show that when models learn new factual knowledge via supervised fine-tuning (SFT), they also start to produce incorrect answers to questions that they previously answered correctly(Gekhman et al., 2024; Kalai et al., 2025). This is particularly concerning, as SFT is standard practice in LLM development and may further aggravate hallucinations, which pose a significant challenge for application reliability(Huang et al., 2025). In parallel, the continual learning literature has extensively studied how sequential training can interfere with previously acquired knowledge(Kirkpatrick et al., 2017; Sarfraz et al., 2023; Kim et al., 2025), and has proposed a range of mitigation strategies(Lange et al., 2019; Mai et al., 2021; Guo et al., 2025; Wang et al., 2023a; Lin et al., 2025). In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from continual learning. In continual learning, forgetting typically arises as a byproduct of acquiring new information: parameter updates during fine-tuning alter the model in ways that degrade previously encoded knowledge. Analogously, we propose that *factual forgetting* occurs when parameter updates introduced during SFT inadvertently distort representations of facts learned during pre-training. This behavior reflects a stability–plasticity tradeoff(Kim et al., 2023): increasing *factual plasticity* (the ability to acquire new facts) may come at the expense of *factual stability* (the ability to preserve existing facts). Consequently, factual plasticity can induce factual forgetting, which manifests as SFT-induced hallucinations. Motivated by this, we perform controlled experiments with the goal of disentangling it from task learning. We adopt the experimental setup of Gekhman et al. (2024), reproducing their finding that hallucinations increase when exposing models to new factual knowledge through SFT. Building on the observation that different parameter groups play distinct roles in factual storage and task learning(Geva et al., 2021; Dar et al., 2023; Zhu et al., 2025), we first demonstrate that reducing factual plasticity—e.g., by freezing parameter groups—enables the model to learn the downstream task while limiting new fact acquisition and reducing hallucinations. However, this setting deliberately suppresses factual plasticity, whereas in practice we may like SFT to support both task learning *and* the acquisition of new factual knowledge without inducing hallucinations. We hypothesize that continual learning methods designed to mitigate forgetting should help achieve this objective. To test this, we apply *self-distillation* (Li and Hoiem, 2017), a continual learning technique in which the model is regularized to stay close to its own earlier output distribution during fine-tuning, recently shown to reduce forgetting in LLMs(Shenfeld et al., 2026; Zhu et al., 2025). Our results show that this approach reduces SFT-induced hallucinations while still enabling effective acquisition of newly introduced facts (see left panel of Fig. 1). We next investigate the mechanisms underlying SFT-induced hallucinations. Specifically, we ask whether these errors stem from global capacity limitations(Allen-Zhu and Li, 2024), behavior cloning derived by SFT(Zhang et al., 2024; Schulman, 2023), or localized interference, whereby new facts corrupt existing ones when they share representational structure with them. To disentangle these, we fine-tune models on synthetic facts while varying the scale and surface form of entity names — either name-like strings, hypothesized to share representational neighborhoods with existing entities, or random UUID-style identifiers that do not (see Fig. 1, right). Forgetting appears highly sensitive to surface-form similarity: name-like entities are forgotten substantially more as scale increases, while UUID-based entities induce near-zero forgetting even at 1M new facts, suggesting representational overlap as a primary driver. Consistent with this, we show that self-distillation prevents representational drift of the held-out facts, suggesting its effectiveness stems from mitigating precisely this interference. In summary, this work (1) reframes SFT-induced hallucinations as factual forgetting arising from continual learning dynamics—distinct from hallucinations stemming from pre-training knowledge gaps or arising at inference time; (2) provides two complementary mitigations: reducing factual plasticity (e.g., via selective parameter freezing) is beneficial when new fact acquisition is undesirable (e.g., SFT on a private domain or alignment fine-tuning), while self-distillation is beneficial when new fact acquisition is also desired (e.g., domain adaptation with new factual content); both reduce factual forgetting from ~15% to ~3%; and (3) characterizes the mechanism underlying both the forgetting and its mitigation: factual forgetting appears selective, driven by interference among overlapping semantic representations, and self-distillation succeeds because it mitigates this interference. ## 2Fine-Tuning with Unknown Facts Leads to Factual Forgetting Supervised fine-tuning (SFT) can inadvertently increase factual hallucinations(Gekhman et al., 2024; Ovadia et al., 2024; Zucchetti et al., 2025), a phenomenon we reinterpret through the lens of continual learning as factual forgetting. To study this in a controlled manner, we reconstruct the experimental setting of Gekhman et al. (2024), which explicitly disentangles task learning (learning how to perform the task) from factual learning (learning facts). Refer to caption Split SLiCK Train Role D_Known Highly Known Yes Task learning D_Unk Unknown Yes Facts plasticity D_Held Highly Known No Facts stability Figure 2:Factual forgetting is caused by new fact acquisition, not fine-tuning itself. The model starts below ceiling as it has not yet adapted to the QA format, then rapidly learns it, achieving high accuracy on known facts. As training continues and unknown facts are acquired, accuracy on held-out facts declines, indicating that forgetting is driven by new factual knowledge, not fine-tuning per se. When unknown facts are excluded, held-out performance remains stable throughout training (Only Known Held-out curve), suggesting that new fact acquisition is a source of interference. Right: summary of data split roles. ### 2.1Preliminary: SLiCK Method and Factual Learning Setting To determine the model's preexisting knowledge for each question used in training and evaluation, we apply the SLiCK method(Gekhman et al., 2024). SLiCK categorizes questions into four levels based on the model's predictions under multiple randomized few-shot prompting configurations: *HighlyKnown*, *MaybeKnown*, *WeaklyKnown*, and *Unknown*. A factual relation is classified as HighlyKnown if the model consistently produces the correct answer across all configurations, Unknown if it never does, and the intermediate categories capturing varying degrees of consistency. To focus on factual learning and forgetting, we retain only *HighlyKnown* and *Unknown* examples, filtering out *MaybeKnown* and *WeaklyKnown* facts. We denote by D_Known a subset of *HighlyKnown* facts and by D_Unk a subset of *Unknown* facts, and form the training set as D_train = D_Known ∪ D_Unk. A disjoint subset of *HighlyKnown* facts, denoted D_Held, is reserved for evaluation. Since the data consists of sparse, semantically isolated relational facts (e.g., birthplaces, spouse), generalization across examples is unlikely, and accuracy on each split reflects a distinct and isolated aspect of the model's behavior. Performance on D_Known reflects the model's ability to learn the QA task format and style. These entity-relational facts are already present in the pretrained knowledge, so accuracy gains during fine-tuning reflect task format adaptation rather than new factual learning (task learning). Accuracy on D_Unk measures the model's ability and speed to acquire new factual knowledge (factual plasticity), while performance on D_Held captures the stability of previously acquired factual knowledge and directly quantifies fine-tuning–induced factual hallucinations (factual stability). This setup enables a clean disentanglement between task learning and factual learning, providing a controlled framework for analyzing factual forgetting during SFT (Fig. 2, right). ### 2.2Methodology and Experimental Setup #### Data We use the EntityQuestions dataset(Sciavolino et al., 2021) as our primary benchmark. EntityQuestions consists of QA pairs derived from relational triplets in Wikipedia (e.g., Q: "What is the capital of France?", A: "Paris"), covering a wide range of entity–relation types. We apply the SLiCK classification using 20 evaluation runs per question, each with three randomly sampled few-shot exemplars, and retain only relations for which at least 30% of examples are classified as *HighlyKnown*. From the remaining relations, we sample 8,000 examples to form D_Known and 8,000 examples to form D_Unk, both drawn from the training split. The held-out set D_Held is drawn from the development split and contains only *HighlyKnown* facts from the same relations. #### Models We conduct experiments with several non-reasoning LLMs: Qwen2.5 (1.5B and 8B parameters)(Yang et al., 2024) and LLaMA3.1 (8B)(Grattafiori and others, 2024)
Similar Articles
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
This paper investigates the phenomenon where large language models hallucinate despite having the correct answer available in their generation-time distribution. By introducing a semantic notion of answer availability, the authors show that 16-47% of instruction-tuned model hallucinations occur when the correct concept is already represented, and that this rate increases with scale. They identify that instruction tuning sharpens answer commitment, making helpfulness and confident hallucination two sides of the same coin.
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
This paper introduces Attention-Shifting (AS), a novel framework for selective machine unlearning in LLMs that balances effective removal of sensitive information while preventing hallucinations and preserving model utility. The method uses importance-aware attention suppression and retention enhancement to achieve up to 15% higher accuracy preservation compared to existing unlearning approaches on standard benchmarks.
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.