FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models

arXiv cs.AI 06/30/26, 04:00 AM Papers
hallucination vision-language-models decoding attention ffn training-free large-vision-language-models
Summary
This paper proposes FADE, a training-free method that mitigates hallucinations in Large Vision-Language Models by attenuating FFN outputs at critical layers to reduce language-prior dominance, demonstrating effectiveness across multiple benchmarks.
arXiv:2606.29431v1 Announce Type: new Abstract: Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucination, generating content inconsistent with the input image. Recent studies attribute this to the dominance of language priors over visual inputs and employ contrastive decoding methods to mitigate this dominance, but the mechanistic origin remains unexplored. We investigate the information flow through each transformer layer and find that attention modules consistently aggregate visual evidence, while FFN modules at critical layers act as the source of language priors. These priors can override visual evidence, causing correct predictions in intermediate layers to drift toward incorrect outputs. Based on this insight, we propose FADE (FFN Attenuation for DEcoding), a training-free method that attenuates FFN outputs to reduce language-prior dominance. Evaluations on POPE, CHAIR, and MME benchmarks across LLaVA-1.5, mPLUG-Owl2, and InstructBLIP show that FADE effectively mitigates hallucinations while preserving inference efficiency.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models
Source: [https://arxiv.org/html/2606.29431](https://arxiv.org/html/2606.29431)
Yichen Guo1,2,\*Kai Tang1,2,\*Fenglai Lin1Yiding Sun2 Dongshuo Zhang3Wenya Wang1Lin William Cong1Shanghang Zhang2,† 1Nanyang Technological University 2State Key Laboratory of Multimedia Information Processing School of Computer Science, Peking University 3Tsinghua University \*Equal contribution\.†Corresponding author:[shanghang@pku\.edu\.cn](https://arxiv.org/html/2606.29431v1/mailto:[email protected])

###### Abstract

Despite the impressive capabilities of Large Vision\-Language Models \(LVLMs\), they remain susceptible to hallucination, generating content inconsistent with the input image\. Recent studies attribute this to the dominance of language priors over visual inputs and employ contrastive decoding methods to mitigate this dominance, but the mechanistic origin remains unexplored\. We investigate the information flow through each transformer layer and find that attention modules consistently aggregate visual evidence, while FFN modules at critical layers act as the source of language priors\. These priors can override visual evidence, causing correct predictions in intermediate layers to drift toward incorrect outputs\. Based on this insight, we proposeFADE\(FFNAttenuation forDEcoding\), a training\-free method that attenuates FFN outputs to reduce language\-prior dominance\. Evaluations on POPE, CHAIR, and MME benchmarks across LLaVA\-1\.5, mPLUG\-Owl2, and InstructBLIP show that FADE effectively mitigates hallucinations while preserving inference efficiency\.

FADE: Mitigating Hallucinations by Reducing Language\-Prior Dominance in Large Vision\-Language Models

Yichen Guo1,2,\*Kai Tang1,2,\*Fenglai Lin1Yiding Sun2Dongshuo Zhang3Wenya Wang1Lin William Cong1Shanghang Zhang2,†1Nanyang Technological University2State Key Laboratory of Multimedia Information ProcessingSchool of Computer Science, Peking University3Tsinghua University\*Equal contribution\.†Corresponding author:[shanghang@pku\.edu\.cn](https://arxiv.org/html/2606.29431v1/mailto:[email protected])

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.29431v1/x1.png)Figure 1:Analyzing information flow through transformer layers\. Attention consistently aggregates visual evidence, while FFN at critical layers \(16–22\) introduces language priors that can override visual evidence\.Large Vision\-Language Models \(LVLMs\) have achieved remarkable progress in recent years, bridging the gap between vision and language through effective multimodal alignmentRadfordet al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib41)\); Liet al\.\([2022](https://arxiv.org/html/2606.29431#bib.bib26),[2023a](https://arxiv.org/html/2606.29431#bib.bib25)\); Liuet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib34)\); Chenet al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib7)\); Baiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib64)\); Wanget al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib50)\)\. These models have achieved significant success across diverse applications including visual question answering \(VQA\), image captioning, and multimodal reasoning over structured visual contentZhanget al\.\([2026b](https://arxiv.org/html/2606.29431#bib.bib60)\)\. However, a persistent challenge remains: LVLMs often generate text that is not consistent with the visual content of the input image, known as hallucinationLiet al\.\([2023b](https://arxiv.org/html/2606.29431#bib.bib29)\); Rohrbachet al\.\([2018](https://arxiv.org/html/2606.29431#bib.bib42)\); Liuet al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib18)\); Baiet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib17)\)\. This phenomenon can cause serious risks in critical applications, including medical diagnosis, autonomous drivingCuiet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib9)\), and embodied agentsDriesset al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib11)\), where precision and reliability are essential\.

![Refer to caption](https://arxiv.org/html/2606.29431v1/figures/introduction.png)Figure 2:Overview of our approach\.Left:LVLMs suffer from hallucinations where language priors override visual evidence, causing prediction drift from correct to incorrect outputs\.Middle:Our mechanistic analysis reveals that attention modules aggregate visual evidence toward correct answers, while FFN modules at critical layers introduce language priors that can override visual evidence\.Right:FADE attenuates FFN outputs at critical layers to suppress language priors while preserving visual evidence, enabling training\-free hallucination mitigation\.Recent research on mitigating hallucinations can be divided into two categories\. Training\-based approaches employ instruction tuningLiuet al\.\([2024c](https://arxiv.org/html/2606.29431#bib.bib32)\), RLHFSunet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib43)\)or DPOZhaoet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib62)\)to reduce hallucinations at the source, but they require expensive data collection and retraining\. Training\-free methods intervene during inference without modifying model parameters\. Attention modification approachesLiuet al\.\([2024e](https://arxiv.org/html/2606.29431#bib.bib36)\); Huanget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib19)\)amplify visual token weights to enhance visual grounding\. Layer\-wise intervention methodsChuanget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib8)\); Wanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib49)\)exploit cross\-layer differences to improve output quality\. Contrastive decoding methodsLenget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib23)\); Manevich and Tsarfaty \([2024](https://arxiv.org/html/2606.29431#bib.bib37)\)attribute hallucination to the dominance of language priors over visual inputs and attempt to suppress this dominance by contrasting output distributions\. However, these methods operate at the output level without understanding where language priors originate within the model\. Understanding this origin is crucial for developing more targeted and efficient solutions\.

In this work, we investigate the mechanistic origin of language\-prior dominance\. We decompose transformer computations into attention and FFN contributions using the residual stream perspectiveElhageet al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib13)\), and measure their effects on predictions through logit lens projectionsGevaet al\.\([2022](https://arxiv.org/html/2606.29431#bib.bib15)\); Belroseet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib4)\)\. As illustrated in Figure[1](https://arxiv.org/html/2606.29431#S1.F1), our analysis reveals two key findings: \(1\)*Attention Aggregates Visual Evidence\.*Attention mechanisms consistently aggregate visual features to generate correct predictions\. \(2\)*FFN Introduces Language Priors\.*FFN modules at critical layers act as the source of language priors that can override visual evidence, causing hallucinations\.

Based on this insight, we proposeFADE\(FFNAttenuation forDEcoding\), a training\-free method that attenuates FFN outputs at critical layers to reduce language\-prior dominance \(Figure[2](https://arxiv.org/html/2606.29431#S1.F2)\)\. By weakening FFN contributions, FADE preserves the visual evidence while suppressing the language priors that cause hallucination\. Unlike contrastive decoding, FADE operates in a single forward pass with minimal overhead\.

Our contributions can be summarized as follows:

- •We conduct a mechanistic analysis revealing the origin of language prior dominance in LVLMs: attention aggregates visual evidence, while FFN at critical layers acts as the source of language priors that can override it\.
- •We propose FADE, a training\-free method that attenuates FFN outputs at critical layers to reduce language\-prior dominance while preserving visual evidence\.
- •Extensive experiments across diverse architectures \(LLaVA\-1\.5\-7B/13B, mPLUG\-Owl2, InstructBLIP, InternVL3\-8B, Qwen2\.5/3\-VL\) and six benchmarks \(POPE, CHAIR, MME, MMHal\-Bench, HalBench, MMBench\) demonstrate that FADE effectively mitigates hallucinations while maintaining inference efficiency and general capabilities\.

## 2Related Work

### 2\.1Large Vision\-Language Models

Large Vision\-Language Models \(LVLMs\) have evolved from early BERT\-based decodersChenet al\.\([2020](https://arxiv.org/html/2606.29431#bib.bib5)\); Liet al\.\([2020](https://arxiv.org/html/2606.29431#bib.bib28)\); Zhanget al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib57)\); Liet al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib27)\); Wanget al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib52)\); Liet al\.\([2022](https://arxiv.org/html/2606.29431#bib.bib26)\)designed to integrate visual and textual information into a paradigm driven by large language models \(LLMs\)Touvronet al\.\([2023a](https://arxiv.org/html/2606.29431#bib.bib46),[b](https://arxiv.org/html/2606.29431#bib.bib47)\); Jianget al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib21)\); Grattafioriet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib12)\); Wanget al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib50)\)\. The emergence of LLMs has sustainably enhanced the capabilities and performance of LVLMs\. In this process, supported by end\-to\-end training techniquesAlayracet al\.\([2022](https://arxiv.org/html/2606.29431#bib.bib1)\); Daiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib10)\), LVLMs have achieved unified decoding of visual and textual tokens, indicating that both their expressiveness and adaptability have significantly improved\. Recent works, such as LLaVALiuet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib34),[2024c](https://arxiv.org/html/2606.29431#bib.bib32),[2024d](https://arxiv.org/html/2606.29431#bib.bib33)\)and InstructBLIPDaiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib10)\), have further refined these models through visual instruction tuning, enhancing their performance in various vision\-language tasks\. More recently, models such as the Qwen\-VL seriesBaiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib64)\); Wanget al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib50)\)and InternVL seriesChenet al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib7),[a](https://arxiv.org/html/2606.29431#bib.bib6)\)have further scaled up through improved alignment strategies and large\-scale joint training\.

### 2\.2Hallucination Mitigation in LVLMs

Hallucination in LVLMs refers to generating content that is linguistically plausible but inconsistent with visual inputRohrbachet al\.\([2018](https://arxiv.org/html/2606.29431#bib.bib42)\); Liet al\.\([2023b](https://arxiv.org/html/2606.29431#bib.bib29)\)\. Training\-based approaches mitigate it via additional fine\-tuning—robust instruction tuningLiuet al\.\([2024a](https://arxiv.org/html/2606.29431#bib.bib31)\), post\-hoc revisionZhouet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib63)\), RLHFYuet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib55)\), or DPOZhaoet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib62)\); Wanget al\.\([2024a](https://arxiv.org/html/2606.29431#bib.bib48)\)—but incur substantial training costs\.

Training\-free methods instead operate during inference\.*Attention\-based methods*re\-weight attention to strengthen visual grounding \(PAILiuet al\.\([2024e](https://arxiv.org/html/2606.29431#bib.bib36)\), OPERAHuanget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib19)\), AGLA/All\-PathAnet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib2)\); Qianet al\.\([2026](https://arxiv.org/html/2606.29431#bib.bib40)\)\)\.*Contrastive decoding*suppresses hallucinated content by contrasting distributions from original and perturbed inputsLenget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib23)\); Manevich and Tsarfaty \([2024](https://arxiv.org/html/2606.29431#bib.bib37)\), instructionsWanget al\.\([2024c](https://arxiv.org/html/2606.29431#bib.bib51)\), or self\-generated descriptionsKimet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib22)\)\.*Layer\-wise intervention*exploits the transformer hierarchy: DAMOWanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib49)\)accumulates activation momentum, while others contrast logits across layersChuanget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib8)\)or enforce inter\-layer consistencyHuoet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib20)\); Liet al\.\([2025a](https://arxiv.org/html/2606.29431#bib.bib24)\); Tanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib45)\)\.*Representation engineering*manipulates hidden states via pre\-computed steering vectors \(VISTALiet al\.\([2025b](https://arxiv.org/html/2606.29431#bib.bib30)\), VTILiuet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib35)\), FlexACLyuet al\.\([2026](https://arxiv.org/html/2606.29431#bib.bib56)\)\)\.

Concurrent work on layer\-wise transformer dynamics includesNeoet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib39)\), which analyzes visual\-token processing via attention knockouts but proposes no hallucination mitigation, and ReDeEPSunet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib44)\), which targets retrieval\-augmented generation and requires*dual*intervention because attention fails to retain external context there\. In contrast, our contrastive analysis on vision\-language hallucination shows that attention remains reliable across correct and hallucinated samples, while FFN at critical layers is the divergence point—motivatingFADE\(FFNAttenuation forDEcoding\), a training\-free single\-component intervention that attenuates FFN outputs at those layers to reduce language\-prior dominance\.

## 3Method

### 3\.1Preliminaries

A transformer\-based LVLM processes inputs throughLLdecoder layers\. Each layerllapplies attention and FFN with residual connections:

𝐡~\(l\)\\displaystyle\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}=𝐡\(l\)\+Attn\(l\)\(𝐡\(l\)\)\\displaystyle=\\mathbf\{h\}^\{\(l\)\}\+\\mathrm\{Attn\}^\{\(l\)\}\(\\mathbf\{h\}^\{\(l\)\}\)\(1\)𝐡\(l\+1\)\\displaystyle\\mathbf\{h\}^\{\(l\+1\)\}=𝐡~\(l\)\+FFN\(l\)\(𝐡~\(l\)\)\\displaystyle=\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\+\\mathrm\{FFN\}^\{\(l\)\}\(\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\)\(2\)From the residual stream perspectiveElhageet al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib13)\), attention aggregates information across positions while FFN performs per\-position transformations\. Prior work shows FFN layers function as key\-value memories storing factual knowledgeGevaet al\.\([2021](https://arxiv.org/html/2606.29431#bib.bib16)\); Menget al\.\([2022](https://arxiv.org/html/2606.29431#bib.bib38)\)\.

### 3\.2Motivation: Prediction Drift in LVLMs

We begin by examining how predictions evolve across layers in LVLMs\. Using logit lens projections on LLaVA\-1\.5\-7B, we track the probability of correct answer tokens at each layer for samples from POPE\-Adversarial\.

Figure[3](https://arxiv.org/html/2606.29431#S3.F3)reveals a striking pattern: for hallucinated samples, predictions drift from high to low P\(Correct Answer\) in later layers, while correct samples maintain stable high probability throughout\. This observation raises a critical question:*what causes this prediction drift?*We address this through mechanistic analysis in the following sections\.

![Refer to caption](https://arxiv.org/html/2606.29431v1/x2.png)Figure 3:P\(Correct Answer\) trajectories across layers for hallucinated \(red, dashed\) and correct \(green, solid\) samples\. Correct samples maintain high probability throughout, while hallucinated samples drift to low probability in later layers\. The shaded region indicates critical layers \(16–22\)\.
### 3\.3Mechanistic Analysis

To understand what causes the prediction drift observed in Figure[3](https://arxiv.org/html/2606.29431#S3.F3), we decompose the contributions of attention and FFN modules at each layer\. We analyze LLaVA\-1\.5\-7B on 50 samples from POPE\-Adversarial\.

Contribution Analysis\.To measure each component’s contribution, we use a differential logit lens approach\. For attention at layerll:

ΔAttn\(l\)\(t\)=LMhead\(𝐡~\(l\)\)t−LMhead\(𝐡\(l\)\)t\\Delta\_\{\\mathrm\{Attn\}\}^\{\(l\)\}\(t\)=\\mathrm\{LM\}\_\{\\mathrm\{head\}\}\(\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\)\_\{t\}\-\\mathrm\{LM\}\_\{\\mathrm\{head\}\}\(\\mathbf\{h\}^\{\(l\)\}\)\_\{t\}\(3\)wherettis the target token\. We computeΔFFN\(l\)\\Delta\_\{\\mathrm\{FFN\}\}^\{\(l\)\}analogously\. This differential approach accounts for the nonlinearity of layer normalization\.

Correct\-Direction Metric\.To enable comparison across samples with different ground truths, we define a*correct\-direction*metric:

C\(l\)=Δ\(l\)\(tcorrect\)−Δ\(l\)\(tincorrect\)C^\{\(l\)\}=\\Delta^\{\(l\)\}\(t\_\{\\mathrm\{correct\}\}\)\-\\Delta^\{\(l\)\}\(t\_\{\\mathrm\{incorrect\}\}\)\(4\)Under this metric,C\(l\)\>0C^\{\(l\)\}\>0indicates the component pushes toward the correct answer, whileC\(l\)<0C^\{\(l\)\}<0indicates it pushes toward the wrong answer\.

Table 1:Mean contributions toward correct answer \(correct\-direction metric\)\. Values are summed across layers and averaged across samples\. Positive values indicate pushing toward ground truth\. FFN at layers 16–22 shows the largest difference between correct and wrong predictions\.OBS\-1: Attention Aggregates Visual Evidence\.Attention contributions are positive and comparable for both correct \(\+1\.2\+1\.2\) and hallucinated \(\+0\.8\+0\.8\) samples \(Table[1](https://arxiv.org/html/2606.29431#S3.T1)\)\. This indicates that attention consistently aggregates visual features toward correct predictions across all samples\.

OBS\-2: FFN Introduces Language Priors\.In contrast, FFN at layers 16–22 shows a striking difference:\+8\.4\+8\.4for correct predictions and−3\.5\-3\.5for wrong predictions\. For correct samples, FFN reinforces the prediction; for hallucinated samples, FFN actively pushes toward the wrong answer\. We identify FFN as the source of language priors—when these priors conflict with visual evidence, they can override attention’s correct predictions\.

This directly explains the drift in Figure[3](https://arxiv.org/html/2606.29431#S3.F3): attention establishes correct predictions in intermediate layers, but language priors from FFN at layers 16–22 override the visual evidence, causing the prediction to drift toward incorrect outputs\.

### 3\.4FADE: FFN Attenuation for Decoding

Based on our analysis, we proposeFADE, which attenuates FFN outputs at critical layers to reduce language\-prior dominance:

𝐡\(l\+1\)=𝐡~\(l\)\+\(1−α\)⋅FFN\(l\)\(𝐡~\(l\)\)\\mathbf\{h\}^\{\(l\+1\)\}=\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\+\(1\-\\alpha\)\\cdot\\mathrm\{FFN\}^\{\(l\)\}\(\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\)\(5\)where𝐡~\(l\)\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}is the post\-attention hidden state andα∈\[0,1\]\\alpha\\in\[0,1\]is the attenuation strength\. The method is training\-free, requires no additional parameters, and introduces negligible overhead\. By reducing FFN contributions at selected critical layers, FADE suppresses language priors while preserving visual evidence aggregated by attention\. We identify layers 16–22 as the critical band on LLaVA\-1\.5\-7B and select task\-specific intervention layers within this band; for other architectures, we transfer the band to proportionally equivalent mid\-to\-late layers, with full per\-model configurations reported in Appendix[D](https://arxiv.org/html/2606.29431#A4)\.

## 4Experiments

Table 2:POPE benchmark results across three VLMs\. We evaluate across three sampling strategies \(Random, Popular, Adversarial\) and three datasets \(MSCOCO, A\-OKVQA, GQA\)\. Best results are inbold, second best areunderlined\.### 4\.1Experimental Setup

##### Models\.

We organize the evaluation into a core cross\-architecture suite and an extended robustness suite\. The core suite contains three 7B\-scale LVLMs spanning distinct vision\-language interfaces:LLaVA\-1\.5\-7BLiuet al\.\([2024c](https://arxiv.org/html/2606.29431#bib.bib32)\), which uses a two\-stage training with visual instruction tuning;mPLUG\-Owl2\-7BYeet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib54)\), which employs modality\-adaptive modules for vision\-language alignment; andInstructBLIP\-7BDaiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib10)\), which introduces instruction\-aware visual feature extraction via Q\-Former\. Beyond these core models, we evaluate whether the same FFN\-level intervention transfers to stronger and larger systems:LLaVA\-v1\.5\-13Bin Appendix[C](https://arxiv.org/html/2606.29431#A3),InternVL3\-8BChenet al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib7)\), and theQwen\-VLfamily, includingQwen2\.5\-VL\-7B\-InstructWanget al\.\([2024b](https://arxiv.org/html/2606.29431#bib.bib50)\)andQwen3\-VL\-8B\-InstructYanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib53)\), in Section[4\.2\.5](https://arxiv.org/html/2606.29431#S4.SS2.SSS5)\. Together, this suite spans projector\-based visual instruction tuning, modality\-adaptive fusion, Q\-Former querying, InternVL\-style large\-scale alignment, and recent Qwen\-VL models, enabling evaluation across both architecture and scale\.

##### Benchmarks\.

We adopt three widely\-used benchmarks:POPELiet al\.\([2023b](https://arxiv.org/html/2606.29431#bib.bib29)\)probes object hallucination via binary \(Yes/No\) questions across three sampling strategies \(Random, Popular, Adversarial\) on MSCOCO, A\-OKVQA, and GQA;CHAIRRohrbachet al\.\([2018](https://arxiv.org/html/2606.29431#bib.bib42)\)measures hallucination in image captioning, where CHAIRSand CHAIRIdenote sentence\-level and instance\-level hallucination rates \(lower is better\) and Recall measures coverage \(higher is better\);MMEFuet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib14)\)evaluates perception and cognition across 14 subtasks, and we report perception scores across its ten perception subtasks\.

##### Baselines\.

We compare against representative training\-free methods from each category:PAILiuet al\.\([2024e](https://arxiv.org/html/2606.29431#bib.bib36)\)amplifies attention on image tokens;VCDLenget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib23)\)contrasts outputs from original and distorted images;DAMOWanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib49)\)applies momentum\-based activation stabilization;VISTALiet al\.\([2025b](https://arxiv.org/html/2606.29431#bib.bib30)\)steers representations using pre\-computed visual vectors; andDCLATanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib45)\)enforces inter\-layer consistency via layer aggregation\. All baselines use official implementations with recommended hyperparameters\. VISTA relies on a model\-specific visual steering vector computation that is only officially supported on its three released architectures; we therefore evaluate it on LLaVA\-1\.5, mPLUG\-Owl2, and InstructBLIP, and substitute DCLA on the advanced models \(InternVL3\-8B, Qwen2\.5/3\-VL\) as a representative baseline from the layer\-aggregation family\.

##### Implementation\.

All experiments use greedy decoding on 8 NVIDIA H100 80GB GPUs\. FADE attenuates FFN outputs at task\-specific critical layers selected from the mid\-to\-late critical\-layer band \(Section[3\.4](https://arxiv.org/html/2606.29431#S3.SS4)\)\. On LLaVA\-1\.5\-7B, we useα=0\.6\\alpha=0\.6at layer 18 based on the ablation in Section[4\.5](https://arxiv.org/html/2606.29431#S4.SS5); per\-model layer indices for mPLUG\-Owl2, InstructBLIP, InternVL3\-8B, and the Qwen\-VL series are obtained by mapping the LLaVA critical band to the proportionally\-equivalent mid\-to\-late layers, with all hyperparameters and baseline configurations detailed in Appendix[A](https://arxiv.org/html/2606.29431#A1)and[D](https://arxiv.org/html/2606.29431#A4)\.

Table 3:CHAIR benchmark results across three VLMs\. CS/CI: sentence/instance\-level hallucination rates \(lower is better\)\. Rec: recall \(higher is better\)\. Best results are inbold, second best areunderlined\.Table 4:MME perception scores across 10 subtasks on LLaVA\-1\.5\-7B and mPLUG\-Owl2\-7B\. Higher is better\.Bold: best per model\.Underline: second best\.

### 4\.2Main Results

#### 4\.2\.1Results on POPE

Table[2](https://arxiv.org/html/2606.29431#S4.T2)presents results under random, popular, and adversarial settings\. FADE is strongest or tied under the challenging GQA adversarial setting on LLaVA\-1\.5 and mPLUG\-Owl2, achieving 82\.5% and 79\.0% F1, respectively\. On LLaVA\-1\.5, FADE surpasses VCD by 5\.0% F1 and DAMO by 1\.7% F1 on the GQA adversarial subset\. Notably, VCD shows limited generalization on LLaVA\-1\.5 GQA \(72\.0% accuracy under adversarial\), likely because its contrastive decoding with noisy images disrupts fine\-grained spatial reasoning required for GQA’s scene graph questions\. DAMO and VISTA improve over greedy decoding but exhibit inconsistent behavior—DAMO gains on A\-OKVQA but plateaus on GQA, while VISTA shows marginal improvements that do not consistently exceed the baseline across all settings\. Results on LLaVA\-v1\.5\-13B, reported in Appendix[C](https://arxiv.org/html/2606.29431#A3), show that FADE maintains its lead over all baselines at larger scales, while VCD, DAMO, and VISTA all degrade below greedy decoding\.

#### 4\.2\.2Results on CHAIR

Table[3](https://arxiv.org/html/2606.29431#S4.T3)reports image captioning results\. Among existing methods, we observe a clear accuracy\-coverage trade\-off: VISTA achieves the lowest CHAIRS\(19\.2%\) on LLaVA\-1\.5 but sacrifices Recall significantly \(62\.6% vs\. 80\.6% for greedy\), indicating over\-aggressive suppression of generation\. Conversely, VCD and DAMO increase hallucination rates on most models—VCD raises CHAIRSfrom 49\.8% to 58\.6% on LLaVA\-1\.5, suggesting that their uniform intervention strategies disrupt fluent generation\.

Relative to greedy decoding, FADE reduces both CHAIRSand CHAIRIacross all three models\. On mPLUG\-Owl2, FADE achieves the lowest CHAIRS\(55\.0%\) among all methods\. On InstructBLIP, FADE substantially reduces instance\-level hallucination to CHAIRI=14\.0%, compared to 37\.9% for PAI and 38\.5% for greedy, while maintaining competitive Recall \(72\.9%\)\. All methods are evaluated under identical decoding configurations and on the same caption pool, ensuring a fair comparison \(full settings in Appendix[A](https://arxiv.org/html/2606.29431#A1)\)\. This pronounced effect suggests that FFN attenuation is particularly well\-matched to Q\-Former\-based visual encoders, where pooled visual queries tend to leave more residual capacity for FFN\-stored priors to dominate\.

#### 4\.2\.3Results on MME

Table[4](https://arxiv.org/html/2606.29431#S4.T4)reports MME perception scores across 10 subtasks\. On LLaVA\-1\.5, FADE achieves 1519\.0 total perception score, improving over greedy decoding \(1505\.7\) by \+13\.3 points and outperforming all baselines including PAI \(1508\.9\)\. The improvement is particularly notable on counting \(\+5\.0 over greedy\) and celebrity recognition \(\+1\.7\), subtasks that require precise object grounding\. Interestingly, different methods show architecture\-dependent behavior: PAI improves LLaVA\-1\.5 \(\+3\.2\) but slightly degrades mPLUG\-Owl2 \(−\-15\.6\), while DAMO gains substantially on mPLUG\-Owl2’s counting subtask \(\+10\.0\) but loses on LLaVA\-1\.5 \(−\-6\.7\)\. This architecture sensitivity suggests that attention\-based and contrastive methods may interact differently with each model’s vision\-language alignment mechanism\. FADE’s FFN\-level intervention provides a more architecture\-agnostic approach by targeting the representation drift phenomenon that is common across transformer\-based LVLMs\.

#### 4\.2\.4Results on MMHal\-Bench

We further evaluate on MMHal\-Bench, where GPT\-4 judges open\-ended responses across eight categories, testing whether mitigation methods generalize beyond binary Yes/No questions to free\-form generationSunet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib43)\)\. Table[5](https://arxiv.org/html/2606.29431#S4.T5)shows that FADE achieves the highest overall GPT\-4 judged score \(2\.09 vs\. 2\.05 for greedy, 1\.83 for PAI, 1\.92 for VCD\), indicating that FFN attenuation preserves—rather than degrades—generation quality in open\-ended settings\.

Table 5:MMHal\-Bench results on LLaVA\-1\.5\-7B\. Scores range 0\-4 \(higher is better\)\. GPT\-4 evaluates both hallucination rate and informativeness\.
#### 4\.2\.5Generalization to Advanced Architectures

To evaluate architecture\-agnostic robustness, we extend FADE to next\-generation models:InternVL3\-8B\(Table[6](https://arxiv.org/html/2606.29431#S4.T6)\) and theQwen\-VL series\(Qwen2\.5\-VL\-7B, Table[7](https://arxiv.org/html/2606.29431#S4.T7); Qwen3\-VL\-8B, Table[8](https://arxiv.org/html/2606.29431#S4.T8)\)\. On InternVL3\-8B, FADE attains the top MMBenchEN\{\}^\{\\text\{EN\}\}score \(69\.24%, \+3\.1% over greedy\) and the highest MME Perception \(1734\.6\), while remaining competitive on the adversarial POPE split \(88\.2 F1, matching PAI\); notably, on the open\-ended CHAIR task all training\-free interventions fail to beat greedy \(29\.2\), suggesting that stronger models possess highly optimized internal language priors that aggressive modifications can easily disrupt\. On Qwen2\.5\-VL FADE matches the highest MME \(1694\.1\) and ties greedy for the second\-best CHAIRS\(36\.6\), with POPE\-Adv \(86\.8\) marginally improving over greedy \(86\.7\); on Qwen3\-VL it preserves the peak MMBench score \(86\.5\), reaches the second\-best POPE\-Adv \(88\.2, behind DAMO’s 88\.4\), and reduces CHAIRSfrom 57\.4 to 55\.8\. Across all three architectures, FADE delivers the most balanced trade\-off, whereas alternatives such as DAMO on Qwen3\-VL push HalBench to 58\.3 but spike CHAIRSto 61\.0, and VCD on Qwen3\-VL reaches the lowest CHAIRS\(26\.6\) at the cost of POPE\-Adv \(87\.4, the lowest among all methods\)—supporting FADE as a stable, architecture\-agnostic intervention\. A sensitivity sweep over the attenuation strengthα\\alphaon Qwen3\-VL\-8B is reported in Appendix[C\.2](https://arxiv.org/html/2606.29431#A3.SS2), showing that the gains are stable acrossα∈\[0\.3,0\.8\]\\alpha\\in\[0\.3,0\.8\]and not the product of cherry\-picked tuning\.

Table 6:Performance on InternVL3\-8B\. POPE is reported as Random/Popular/Adversarial F1 and their Average\. Bestbold, second bestunderlined\.Table 7:Performance on Qwen2\.5\-VL\-7B\-Instruct\. POPE is reported as Random/Popular/Adversarial F1 and their Average\. Bestbold, second bestunderlined\.Table 8:Performance on Qwen3\-VL\-8B\-Instruct\. POPE is reported as Random/Popular/Adversarial F1 and their Average\. Bestbold, second bestunderlined\.

### 4\.3Efficiency Study

We analyze FADE’s computational efficiency compared to existing methods\. Table[9](https://arxiv.org/html/2606.29431#S4.T9)compares inference efficiency\. FADE adds only 3% latency overhead compared to greedy decoding \(122ms vs 118ms\), while achieving substantial speedups over all comparison methods: 19% faster than DAMO, 34% faster than PAI, 57% faster than VCD, and 73% faster than VISTA\. VCD requires a second forward pass with distorted images, resulting in 2\.4×\\timestotal latency\. VISTA incurs the highest overhead \(3\.9×\\times\) due to steering vector computation during inference\. FADE’s efficiency stems from: \(1\) FFN attenuation requiring only element\-wise scaling at a single layer, not additional forward passes; and \(2\) no memory overhead \(14\.5 GB, identical to greedy decoding\)\. This single\-pass design is complementary to recent efforts on efficient multimodal and LLM reasoning that compress chain\-of\-thought tracesZhanget al\.\([2026c](https://arxiv.org/html/2606.29431#bib.bib58)\), address late\-stage fragility in reasoning chainsZhanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib59)\), or adaptively allocate compute via coarse\-to\-fine refinementZhanget al\.\([2026a](https://arxiv.org/html/2606.29431#bib.bib61)\), suggesting that FADE can be combined with such orthogonal acceleration techniques\.

Table 9:Inference efficiency comparison on LLaVA\-1\.5\-7B\. Measured on POPE \(500 samples\) with H100 GPU\.
### 4\.4Case Study

![Refer to caption](https://arxiv.org/html/2606.29431v1/x3.png)Figure 4:Qualitative comparison of hallucination correction\. Case Study 1: Greedy decoding incorrectly identifies which cat opens its mouth, while FADE provides the correct answer\. Case Study 2: Greedy decoding hallucinates a non\-existent dog in the skiing scene, while FADE correctly denies its presence\.Figure[4](https://arxiv.org/html/2606.29431#S4.F4)illustrates how FFN attenuation mitigates language prior dominance: FADE resolves spatial reasoning errors \(Case 1\) and suppresses non\-existent object hallucinations \(Case 2\), yielding more visually grounded responses\.

### 4\.5Ablation Study

![Refer to caption](https://arxiv.org/html/2606.29431v1/x4.png)\(a\)LLaVA\-1\.5: Strength
![Refer to caption](https://arxiv.org/html/2606.29431v1/x5.png)\(b\)LLaVA\-1\.5: Layer
![Refer to caption](https://arxiv.org/html/2606.29431v1/x6.png)\(c\)mPLUG\-Owl2: Strength
![Refer to caption](https://arxiv.org/html/2606.29431v1/x7.png)\(d\)mPLUG\-Owl2: Layer

Figure 5:Ablation on POPE\. \(a\)\(c\) Strength sensitivity: optimal range\[0\.5,0\.7\]\[0\.5,0\.7\]\. \(b\)\(d\) Layer sensitivity: Layer 18 provides the best or near\-best trade\-off on LLaVA\-1\.5 and mPLUG\-Owl2\. Shaded regions indicate recommended hyperparameter ranges\.We conduct ablations on POPE to analyze hyperparameter sensitivity across different models\.

##### Strength and Layer\.

Varyingα∈\[0\.1,0\.8\]\\alpha\\in\[0\.1,0\.8\]\(Figure[5\(a\)](https://arxiv.org/html/2606.29431#S4.F5.sf1),[5\(c\)](https://arxiv.org/html/2606.29431#S4.F5.sf3)\) yields optimal F1 atα=0\.6\\alpha\{=\}0\.6on LLaVA\-1\.5 \(variation within 0\.3% across\[0\.55,0\.7\]\[0\.55,0\.7\]\) andα=0\.5\\alpha\{=\}0\.5–0\.70\.7on mPLUG\-Owl2, showing consistent low sensitivity across architectures\. Sweeping intervention layers 14–22 \(Figure[5\(b\)](https://arxiv.org/html/2606.29431#S4.F5.sf2),[5\(d\)](https://arxiv.org/html/2606.29431#S4.F5.sf4)\) identifies Layer 18 as optimal on both models, matching our analysis that mid\-to\-late layers exhibit the highest directional drift; mPLUG\-Owl2 shows mild dataset\-dependent variation, with A\-OKVQA preferring layers 14/20 and COCO/GQA favoring 18\.

##### Task\-Specific Tuning\.

Optimal hyperparameters vary by task \(Appendix[B](https://arxiv.org/html/2606.29431#A2)\): discriminative POPE prefersα=0\.6\\alpha\{=\}0\.6, generative CHAIR benefits from stronger attenuation \(α=1\.0\\alpha\{=\}1\.0\) at later layers \(L20\), while MME’s diverse reasoning requires gentler intervention \(α=0\.02\\alpha\{=\}0\.02\)\.

## 5Conclusion

We presented a mechanistic analysis showing that while attention modules consistently aggregate visual evidence toward correct predictions, FFN modules at critical layers \(16–22\) inject language priors that can override visual evidence in LVLMs\. Based on this insight, we introduced FADE, a training\-free method that attenuates FFN outputs at those layers within a single forward pass, mitigating hallucinations with minimal overhead\. Experiments span diverse architectures—LLaVA\-1\.5\-7B/13B, mPLUG\-Owl2, InstructBLIP, InternVL3\-8B, and the Qwen2\.5/3\-VL series—and six benchmarks \(POPE, CHAIR, MME, MMHal\-Bench, HalBench, MMBench\), demonstrating that FADE provides a favorable hallucination\-efficiency trade\-off while preserving general capabilities\.

## Limitations

Our main experiments focus on 7B\-scale models; while we report 13B results in Appendix[C](https://arxiv.org/html/2606.29431#A3)showing consistent generalization, extending to 30B\+ scales remains future work\. Our evaluation emphasizes hallucination\-specific benchmarks \(POPE, CHAIR, MME\), so performance on broader VQA or reasoning tasks is untested\. The critical layer is fixed per model architecture; exploring adaptive layer selection is a promising direction\.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China \(62476011\), and by the Beijing Natural Science Foundation \(L252060\)\.

## References

- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.Advances in neural information processing systems35,pp\. 23716–23736\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Mitigating object hallucinations in large vision\-language models with assembly of global and local attention\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 29915–29926\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: a versatile vision\-language model for understanding, localization, text reading, and beyond\.arXiv preprint arXiv:2308\.12966\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Z\. Bai, P\. Wang, T\. Xiao, T\. He, Z\. Han, Z\. Zhang, and M\. Z\. Shou \(2025\)Hallucination of multimodal large language models: a survey\.External Links:2404\.18930,[Link](https://arxiv.org/abs/2404.18930)Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- N\. Belrose, I\. Ostrovsky, L\. McKinney, Z\. Furman, L\. Smith, D\. Halawi, S\. Biderman, and J\. Steinhardt \(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p3.1)\.
- Y\. Chen, L\. Li, L\. Yu, A\. El Kholy, F\. Ahmed, Z\. Gan, Y\. Cheng, and J\. Liu \(2020\)Uniter: universal image\-text representation learning\.InEuropean conference on computer vision,pp\. 104–120\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Z\. Chen, W\. Wang, H\. Tian, S\. Ye, Z\. Gao, E\. Cui, W\. Tong, K\. Hu, J\. Luo, Z\. Ma,et al\.\(2024a\)How far are we to gpt\-4v? closing the gap to commercial multimodal models with open\-source suites\.Science China Information Sciences67\(12\),pp\. 220101\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Z\. Chen, J\. Wu, W\. Wang, W\. Su, G\. Chen, S\. Xing, M\. Zhong, Q\. Zhang, X\. Zhu, L\. Lu,et al\.\(2024b\)Internvl: scaling up vision foundation models and aligning for generic visual\-linguistic tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 24185–24198\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Chuang, Y\. Xie, H\. Luo, Y\. Kim, J\. R\. Glass, and P\. He \(2024\)Dola: decoding by contrasting layers improves factuality in large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 54158–54183\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- C\. Cui, Y\. Ma, X\. Cao, W\. Ye, Y\. Zhou, K\. Liang, J\. Chen, J\. Lu, Z\. Yang, K\. Liao,et al\.\(2024\)A survey on multimodal large language models for autonomous driving\.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp\. 958–979\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- W\. Dai, J\. Li, D\. Li, A\. Tiong, J\. Zhao, W\. Wang, B\. Li, P\. N\. Fung, and S\. Hoi \(2023\)Instructblip: towards general\-purpose vision\-language models with instruction tuning\.Advances in neural information processing systems36,pp\. 49250–49267\.Cited by:[§A\.1](https://arxiv.org/html/2606.29431#A1.SS1.SSS0.Px2),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Driess, F\. Xia, M\. S\. Sajjadi, C\. Lynch, A\. Chowdhery, B\. Ichter, A\. Wahid, J\. Tompson, Q\. Vuong, T\. Yu,et al\.\(2023\)Palm\-e: an embodied multimodal language model\.arXiv preprint arXiv:2303\.03378\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1\(1\),pp\. 12\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.29431#S3.SS1.p1.3)\.
- C\. Fu, P\. Chen, Y\. Shen, Y\. Qin, M\. Zhang, X\. Lin, J\. Yang, X\. Zheng, K\. Li, X\. Sun, Y\. Wu, R\. Ji, C\. Shan, and R\. He \(2025\)MME: a comprehensive evaluation benchmark for multimodal large language models\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§A\.2](https://arxiv.org/html/2606.29431#A1.SS2.SSS0.Px4),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px2.p1.2)\.
- M\. Geva, A\. Caciularu, K\. Wang, and Y\. Goldberg \(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.InProceedings of the 2022 conference on empirical methods in natural language processing,pp\. 30–45\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p3.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[§3\.1](https://arxiv.org/html/2606.29431#S3.SS1.p1.3)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Q\. Huang, X\. Dong, P\. Zhang, B\. Wang, C\. He, J\. Wang, D\. Lin, W\. Zhang, and N\. Yu \(2024\)Opera: alleviating hallucination in multi\-modal large language models via over\-trust penalty and retrospection\-allocation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13418–13427\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- F\. Huo, W\. Xu, Z\. Zhang, H\. Wang, Z\. Chen, and P\. Zhao \(2025\)Self\-introspective decoding: alleviating hallucinations for large vision\-language models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 24272–24295\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- J\. Kim, H\. J\. Kim, Y\. J\. Kim, and Y\. M\. Ro \(2024\)Code: contrasting self\-generated description to combat hallucination in large multi\-modal models\.Advances in Neural Information Processing Systems37,pp\. 133571–133599\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- S\. Leng, H\. Zhang, G\. Chen, X\. Li, S\. Lu, C\. Miao, and L\. Bing \(2024\)Mitigating object hallucinations in large vision\-language models through visual contrastive decoding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13872–13882\.Cited by:[§A\.3](https://arxiv.org/html/2606.29431#A1.SS3.SSS0.Px2),[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px3.p1.1)\.
- C\. Li, Y\. Guo, B\. Qian, J\. You, K\. Tang, Y\. Du, Z\. Zhang, and X\. Huang \(2025a\)MAP: mitigating hallucinations in large vision\-language models with map\-level attention processing\.arXiv preprint arXiv:2508\.01653\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023a\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- J\. Li, D\. Li, C\. Xiong, and S\. Hoi \(2022\)Blip: bootstrapping language\-image pre\-training for unified vision\-language understanding and generation\.InInternational conference on machine learning,pp\. 12888–12900\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- J\. Li, R\. Selvaraju, A\. Gotmare, S\. Joty, C\. Xiong, and S\. C\. H\. Hoi \(2021\)Align before fuse: vision and language representation learning with momentum distillation\.Advances in neural information processing systems34,pp\. 9694–9705\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- X\. Li, X\. Yin, C\. Li, P\. Zhang, X\. Hu, L\. Zhang, L\. Wang, H\. Hu, L\. Dong, F\. Wei,et al\.\(2020\)Oscar: object\-semantics aligned pre\-training for vision\-language tasks\.InEuropean conference on computer vision,pp\. 121–137\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, X\. Zhao, and J\. Wen \(2023b\)Evaluating object hallucination in large vision\-language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 292–305\.Cited by:[§A\.2](https://arxiv.org/html/2606.29431#A1.SS2.SSS0.Px1),[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px2.p1.2)\.
- Z\. Li, H\. Shi, Y\. Gao, D\. Liu, Z\. Wang, Y\. Chen, T\. Liu, L\. Zhao, H\. Wang, and D\. N\. Metaxas \(2025b\)The hidden life of tokens: reducing hallucination of large vision\-language models via visual information steering\.arXiv preprint arXiv:2502\.03628\.Cited by:[§A\.3](https://arxiv.org/html/2606.29431#A1.SS3.SSS0.Px4),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px3.p1.1)\.
- F\. Liu, K\. Lin, L\. Li, J\. Wang, Y\. Yacoob, and L\. Wang \(2024a\)Mitigating hallucination in large multi\-modal models via robust instruction tuning\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 57689–57733\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1)\.
- H\. Liu, W\. Xue, Y\. Chen, D\. Chen, X\. Zhao, K\. Wang, L\. Hou, R\. Li, and W\. Peng \(2024b\)A survey on hallucination in large vision\-language models\.External Links:2402\.00253,[Link](https://arxiv.org/abs/2402.00253)Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024c\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§A\.1](https://arxiv.org/html/2606.29431#A1.SS1.SSS0.Px1),[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Liu, C\. Li, Y\. Li, B\. Li, Y\. Zhang, S\. Shen, and Y\. J\. Lee \(2024d\)LLaVA\-next: improved reasoning, ocr, and world knowledge\.External Links:[Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/)Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- S\. Liu, H\. Ye, and J\. Y\. Zou \(2025\)Reducing hallucinations in large vision\-language models via latent space steering\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 72402–72419\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- S\. Liu, K\. Zheng, and W\. Chen \(2024e\)Paying more attention to image: a training\-free method for alleviating hallucination in lvlms\.InEuropean Conference on Computer Vision,pp\. 125–140\.Cited by:[§A\.3](https://arxiv.org/html/2606.29431#A1.SS3.SSS0.Px1),[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px3.p1.1)\.
- X\. Lyu, S\. Wang, B\. Chen, J\. Song, L\. Gao,et al\.\(2026\)FlexAC: towards flexible control of associative reasoning in multimodal large language models\.Advances in Neural Information Processing Systems38,pp\. 2473–2514\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- A\. Manevich and R\. Tsarfaty \(2024\)Mitigating hallucinations in large vision\-language models \(lvlms\) via language\-contrastive decoding \(lcd\)\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 6008–6022\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.Advances in neural information processing systems35,pp\. 17359–17372\.Cited by:[§3\.1](https://arxiv.org/html/2606.29431#S3.SS1.p1.3)\.
- C\. Neo, L\. Ong, P\. Torr, M\. Geva, D\. Krueger, and F\. Barez \(2025\)Towards interpreting visual information processing in vision\-language models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 57172–57189\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p3.1)\.
- J\. Qian, G\. Zheng, Y\. Zhu, and S\. Yang \(2026\)Intervene\-all\-paths: unified mitigation of lvlm hallucinations across alignment formats\.External Links:2511\.17254,[Link](https://arxiv.org/abs/2511.17254)Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- A\. Rohrbach, L\. A\. Hendricks, K\. Burns, T\. Darrell, and K\. Saenko \(2018\)Object hallucination in image captioning\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 4035–4045\.Cited by:[§A\.2](https://arxiv.org/html/2606.29431#A1.SS2.SSS0.Px2),[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px2.p1.2)\.
- Z\. Sun, S\. Shen, S\. Cao, H\. Liu, C\. Li, Y\. Shen, C\. Gan, L\. Gui, Y\. Wang, Y\. Yang,et al\.\(2024\)Aligning large multimodal models with factually augmented rlhf\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 13088–13110\.Cited by:[§A\.2](https://arxiv.org/html/2606.29431#A1.SS2.SSS0.Px3),[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§4\.2\.4](https://arxiv.org/html/2606.29431#S4.SS2.SSS4.p1.1)\.
- Z\. Sun, X\. Zang, K\. Zheng, J\. Xu, X\. Zhang, W\. Yu, Y\. Song, and H\. Li \(2025\)Redeep: detecting hallucination in retrieval\-augmented generation via mechanistic interpretability\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 50250–50279\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p3.1)\.
- K\. Tang, J\. You, X\. Ge, H\. Li, Y\. Guo, and X\. Huang \(2025\)Mitigating hallucinations via inter\-layer consistency aggregation in large vision\-language models\.arXiv preprint arXiv:2505\.12343\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px3.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- B\. Wang, F\. Wu, X\. Han, J\. Peng, H\. Zhong, P\. Zhang, X\. Dong, W\. Li, W\. Li, J\. Wang,et al\.\(2024a\)Vigc: visual instruction generation and correction\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 5309–5317\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1)\.
- K\. Wang, H\. Gu, M\. Gao, and K\. Zhou \(2025\)Damo: decoding by accumulating activations momentum for mitigating hallucinations in vision\-language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.29431#A1.SS3.SSS0.Px3),[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px3.p1.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024b\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Wang, J\. Pan, L\. Ding, and C\. Biemann \(2024c\)Mitigating hallucinations in large vision\-language models with instruction contrastive decoding\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15840–15853\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p2.1)\.
- Z\. Wang, J\. Yu, A\. W\. Yu, Z\. Dai, Y\. Tsvetkov, and Y\. Cao \(2021\)Simvlm: simple visual language model pretraining with weak supervision\.arXiv preprint arXiv:2108\.10904\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- Q\. Ye, H\. Xu, J\. Ye, M\. Yan, A\. Hu, H\. Liu, Q\. Qian, J\. Zhang, and F\. Huang \(2024\)Mplug\-owl2: revolutionizing multi\-modal large language model with modality collaboration\.InProceedings of the ieee/cvf conference on computer vision and pattern recognition,pp\. 13040–13051\.Cited by:[§A\.1](https://arxiv.org/html/2606.29431#A1.SS1.SSS0.Px3),[§4\.1](https://arxiv.org/html/2606.29431#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Yu, Y\. Yao, H\. Zhang, T\. He, Y\. Han, G\. Cui, J\. Hu, Z\. Liu, H\. Zheng, M\. Sun,et al\.\(2024\)Rlhf\-v: towards trustworthy mllms via behavior alignment from fine\-grained correctional human feedback\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13807–13816\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1)\.
- D\. Zhang, H\. Lin, Y\. Sun, P\. Wang, Q\. Wang, N\. Yang, and J\. Zhu \(2026a\)Not all queries need deep thought: coficot for adaptive coarse\-to\-fine stateful refinement\.InAnn\. Conf\. Uncertain\. Artif\. Intell\.,Cited by:[§4\.3](https://arxiv.org/html/2606.29431#S4.SS3.p1.2)\.
- D\. Zhang, Y\. Sun, P\. Li, Y\. Liu, H\. Lin, H\. Xu, X\. Mu, L\. Lin, W\. Yan, N\. Yang,et al\.\(2026b\)PointCoT: a multi\-modal benchmark for explicit 3d geometric reasoning\.arXiv Prepr\. arXiv:2602\.23945\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p1.1)\.
- D\. Zhang, Y\. Sun, C\. Tan, W\. Yan, N\. Yang, J\. Zhu, and H\. Zhang \(2026c\)Chain\-of\-thought compression should not be blind: v\-skip for efficient multimodal reasoning via dual\-path anchoring\.InAnn\. Meet\. Assoc\. Comput\. Linguist\.,Cited by:[§4\.3](https://arxiv.org/html/2606.29431#S4.SS3.p1.2)\.
- D\. Zhang, Y\. Wu, Y\. Sun, J\. Zhu, J\. Yang, M\. Xin, and B\. Tian \(2025\)Not all errors are created equal: ascot addresses late\-stage fragility in efficient llm reasoning\.arXiv Prepr\. arXiv:2508\.05282\.Cited by:[§4\.3](https://arxiv.org/html/2606.29431#S4.SS3.p1.2)\.
- P\. Zhang, X\. Li, X\. Hu, J\. Yang, L\. Zhang, L\. Wang, Y\. Choi, and J\. Gao \(2021\)Vinvl: revisiting visual representations in vision\-language models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 5579–5588\.Cited by:[§2\.1](https://arxiv.org/html/2606.29431#S2.SS1.p1.1)\.
- Z\. Zhao, B\. Wang, L\. Ouyang, X\. Dong, J\. Wang, and C\. He \(2023\)Beyond hallucinations: enhancing lvlms through hallucination\-aware direct preference optimization\.arXiv preprint arXiv:2311\.16839\.Cited by:[§1](https://arxiv.org/html/2606.29431#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1)\.
- Y\. Zhou, C\. Cui, J\. Yoon, L\. Zhang, Z\. Deng, C\. Finn, M\. Bansal, and H\. Yao \(2024\)Analyzing and mitigating object hallucination in large vision\-language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 56969–56998\.Cited by:[§2\.2](https://arxiv.org/html/2606.29431#S2.SS2.p1.1)\.

## Appendix ADetailed Experimental Settings

### A\.1Model Descriptions

##### LLaVA\-1\.5Liuet al\.\([2024c](https://arxiv.org/html/2606.29431#bib.bib32)\)\.

An improved version of LLaVA that achieves strong performance through simple architectural modifications and better training recipes\. It uses a two\-stage training process with visual instruction tuning on high\-quality data\.

##### InstructBLIPDaiet al\.\([2023](https://arxiv.org/html/2606.29431#bib.bib10)\)\.

A vision\-language model that leverages instruction tuning on top of the BLIP\-2 architecture\. It uses a Q\-Former to bridge frozen image encoders and LLMs with instruction\-aware visual feature extraction\.

##### mPLUG\-Owl2Yeet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib54)\)\.

Introduces modality collaboration through a shared module that enables better interaction between visual and textual modalities, achieving strong performance on various multimodal benchmarks\.

### A\.2Benchmark Descriptions

##### POPELiet al\.\([2023b](https://arxiv.org/html/2606.29431#bib.bib29)\)\.

The Polling\-based Object Probing Evaluation is designed to evaluate object hallucination in LVLMs\. It contains 27,000 Yes/No questions about object existence in MSCOCO images, where the task is to judge whether the given object is present in the image\. The benchmark includes three sampling strategies: random, popular, and adversarial\. We compute accuracy, precision, recall, and F1 score for comprehensive evaluation\.

##### CHAIRRohrbachet al\.\([2018](https://arxiv.org/html/2606.29431#bib.bib42)\)\.

Caption Hallucination Assessment with Image Relevance quantifies object hallucinations in image captions by comparing generated objects to ground\-truth annotations\. We randomly select 500 images from the MSCOCO dataset and use three metrics: CHAIRI\(instance\-level hallucination rate\), CHAIRS\(sentence\-level hallucination rate\), and Recall \(coverage of ground\-truth objects\)\.

##### MMHal\-BenchSunet al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib43)\)\.

This benchmark evaluates LVLMs beyond simple object hallucination and contains eight diverse question types: object attributes, adversarial objects, comparisons, counting, spatial relations, environment, holistic description, and others\. We evaluate both the hallucination rate and response informativeness using GPT\-4 as the judge\.

##### MMEFuet al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib14)\)\.

A comprehensive evaluation benchmark covering both perception and cognition abilities across 14 subtasks\. The perception tasks include existence, count, position, color, poster, celebrity, scene, landmark, artwork, and OCR\. The cognition tasks cover commonsense reasoning, numerical calculation, text translation, and code reasoning\.

### A\.3Comparison Method Descriptions

##### PAILiuet al\.\([2024e](https://arxiv.org/html/2606.29431#bib.bib36)\)\.

Pays more attention to image tokens by amplifying the attention weights on visual features during decoding, ensuring that generated content is more grounded in the actual image content\.

##### VCDLenget al\.\([2024](https://arxiv.org/html/2606.29431#bib.bib23)\)\.

Visual Contrastive Decoding contrasts the output logits from original visual inputs with those from distorted visual inputs \(e\.g\., Gaussian noise\), suppressing hallucinated content that appears regardless of visual quality\.

##### DAMOWanget al\.\([2025](https://arxiv.org/html/2606.29431#bib.bib49)\)\.

Applies momentum\-based activation stabilization to reduce hallucination by smoothing hidden state transitions during autoregressive generation\.

##### VISTALiet al\.\([2025b](https://arxiv.org/html/2606.29431#bib.bib30)\)\.

Introduces visual steering vectors combined with self\-logits augmentation\. It computes steering directions from contrastive image pairs and applies them during decoding to enhance visual grounding\.

### A\.4Implementation Details

All experiments are conducted on 8 NVIDIA H100 80GB GPUs\. We use greedy decoding \(temperature=0\) for all methods to ensure reproducibility\. The detailed hyperparameters for each comparison method are listed in Table[10](https://arxiv.org/html/2606.29431#A1.T10)\.

Table 10:Hyperparameter settings for comparison methods\. All hyperparameters follow the official implementations\.For our method FADE, we use the following hyperparameters:

Table 11:Hyperparameter settings for FADE across different models\.ModelStrengthα\\alphaLayerTaskLLaVA\-1\.5\-7B0\.618POPE1\.020CHAIR0\.0217MMEmPLUG\-Owl2\-7B0\.518POPE\-COCO0\.718POPE\-GQA0\.514POPE\-A\-OKVQA0\.620CHAIR0\.051MMEInstructBLIP\-7B0\.514POPE

Note:MME requires significantly smaller attenuation strength \(α\\alpha=0\.02–0\.05\) compared to POPE/CHAIR \(α\\alpha=0\.5–0\.7\), as shown in Section[B](https://arxiv.org/html/2606.29431#A2)\. This is because MME’s diverse question types are more sensitive to FFN modifications\.

## Appendix BDetailed Ablation Study

We provide comprehensive ablation analysis on the FFN attenuation strength \(α\\alpha\) and layer selection on POPE benchmark across three datasets \(COCO, GQA, A\-OKVQA\)\.

### B\.1Strength Ablation \(Fixed at Layer 18\)

Table[12](https://arxiv.org/html/2606.29431#A2.T12)shows the sensitivity analysis of the attenuation strengthα\\alphawhile fixing the intervention layer at 18\. We test 10 different strength values ranging from 0\.1 to 0\.8\.

Table 12:FFN attenuation strength ablation on POPE benchmark\. Layer is fixed at 18\. F1 scores are averaged across Random/Popular/Adversarial settings\.StrengthCOCO F1GQA F1A\-OKVQA F1Avg F10\.185\.9885\.5286\.6686\.050\.286\.0185\.6186\.6286\.080\.386\.2085\.6786\.5986\.150\.486\.2185\.6386\.6786\.170\.4586\.3385\.6186\.6886\.210\.586\.4585\.6486\.5786\.220\.5586\.4785\.6586\.6786\.260\.686\.6185\.6886\.6486\.310\.786\.6285\.5686\.5686\.250\.886\.5985\.5586\.5286\.22Baseline: Greedy decoding achieves 85\.97% average F1

Key Findings:The optimal strength range is 0\.6–0\.7, achieving \+0\.34% to \+0\.44% improvement over greedy baseline\. Weaker attenuation \(α\\alpha<0\.5\) provides insufficient correction, while stronger attenuation \(α\\alpha\>0\.7\) shows diminishing returns\. The sweet spot atα\\alpha=0\.6 suggests that moderate FFN suppression is sufficient to mitigate directional noise without over\-correcting\.

### B\.2Layer Ablation \(Fixed Strength at 0\.5\)

Table[13](https://arxiv.org/html/2606.29431#A2.T13)analyzes the impact of layer selection while fixingα\\alpha=0\.5\. We test 8 layers around the critical region identified in our analysis \(layers 14–22\)\.

Table 13:Layer selection ablation on POPE benchmark\. Attenuation strength is fixed at 0\.5\. F1 scores are averaged across Random/Popular/Adversarial settings\.Key Findings:Layer 18 consistently provides the best results across all three datasets\. Layers 15 and 17 show significant degradation, suggesting these layers may serve different functional roles where FFN outputs should not be attenuated\. The localized effectiveness around layer 18 supports our mechanistic analysis that directional noise is concentrated in specific critical layers rather than distributed uniformly\.

### B\.3MME Ablation Results

Table[14](https://arxiv.org/html/2606.29431#A2.T14)and Table[15](https://arxiv.org/html/2606.29431#A2.T15)show ablation results on MME Perception benchmark\. Note that MME requires much smaller attenuation strength compared to POPE\.

Table 14:MME Perception: Strength ablation with Layer=18 fixed\.StrengthPerceptionΔ\\Deltavs BaselineCognition0\.011512\.63\+6\.91363\.210\.021506\.58\+0\.86363\.210\.031499\.58−\-6\.14367\.500\.041495\.08−\-10\.64368\.210\.051494\.04−\-11\.68360\.710\.11493\.70−\-12\.02363\.570\.21483\.97−\-21\.75355\.710\.31464\.46−\-41\.26328\.210\.51431\.43−\-74\.29290\.71Baseline: Greedy achieves 1505\.72 Perception score

Table 15:MME Perception: Layer ablation with Strength=0\.02 fixed\.Key Findings:MME requires dramatically different hyperparameters compared to POPE: optimal strength is 0\.02–0\.05 \(vs 0\.5–0\.7 for POPE\), representing 2–5% attenuation vs 50–70%\. This 10×\\times–35×\\timesdifference suggests that the diverse question types in MME are more sensitive to FFN modification, requiring gentler intervention\. Layer 17 is optimal for MME \(vs Layer 18 for POPE\), indicating task\-dependent critical layers\.

### B\.4mPLUG\-Owl2 Ablation on POPE

We also conduct ablation studies on mPLUG\-Owl2 to validate the generalizability of our findings across different model architectures\.

Table 16:mPLUG\-Owl2: FFN attenuation strength ablation on POPE benchmark\. Layer is fixed at 18\.StrengthCOCO F1GQA F1A\-OKVQA F1Avg F10\.185\.5681\.7784\.3183\.880\.285\.6881\.7984\.2983\.920\.385\.7181\.7484\.2383\.890\.485\.7681\.7284\.1983\.890\.585\.8681\.7184\.1883\.920\.685\.7581\.7384\.1583\.880\.785\.8081\.8184\.1383\.910\.885\.8281\.7984\.1083\.90Baseline: Greedy decoding achieves 83\.84% average F1

Table 17:mPLUG\-Owl2: Layer selection ablation on POPE benchmark\. Attenuation strength is fixed at 0\.5\.Key Findings for mPLUG\-Owl2:Unlike LLaVA\-1\.5 which has a clear optimal configuration, mPLUG\-Owl2 shows dataset\-dependent optimal hyperparameters: \(1\) COCO benefits most from Layer 18 with strength 0\.5; \(2\) GQA achieves best results at Layer 18 with strength 0\.7; \(3\) A\-OKVQA prefers Layer 14 or 20 with strength 0\.5\. This suggests that different model architectures may have different critical layer distributions, and dataset\-specific tuning can further improve performance\. The overall improvement is more modest \(\+0\.08–0\.11%\) compared to LLaVA\-1\.5 \(\+0\.34%\), indicating that mPLUG\-Owl2’s modality collaboration mechanism may already partially address the directional noise issue\.

### B\.5mPLUG\-Owl2 MME Ablation Results

Table[18](https://arxiv.org/html/2606.29431#A2.T18)shows the MME ablation results for mPLUG\-Owl2, revealing notably different optimal layers compared to POPE\.

Table 18:mPLUG\-Owl2: Best configurations per layer on MME Perception benchmark\.Key Findings for mPLUG\-Owl2 on MME:Unlike POPE where middle layers \(14–20\) are optimal, MME benefits most from early layer intervention\. Layer 1 withα\\alpha=0\.05 achieves the best result \(\+4\.25\), followed by Layer 10 \(\+2\.22\) and Layer 7 \(\+1\.64\)\. This suggests that for diverse question types in MME, suppressing language priors at the earliest layers is most effective\. Notably, the optimal strength for early layers \(0\.02–0\.05\) is higher than for middle layers \(0\.005–0\.01\)\.

### B\.6InstructBLIP Ablation on POPE

We validate FADE’s effectiveness on InstructBLIP, which uses a Q\-Former architecture with 32 visual tokens \(vs 576 for LLaVA\)\. Table[19](https://arxiv.org/html/2606.29431#A2.T19)shows the ablation results\.

Table 19:InstructBLIP: Best configurations on POPE benchmark\. Results averaged across Random/Popular/Adversarial settings\.Key Findings for InstructBLIP:The optimal configuration is layer 14 withα\\alpha=0\.5, achieving modest improvement on A\-OKVQA \(\+0\.5% F1\) and GQA \(\+0\.2% F1\)\. Layer 17 achieves slightly better COCO performance but worse on other datasets\. The smaller improvement compared to LLaVA\-1\.5 suggests that InstructBLIP’s Q\-Former architecture may already provide some robustness against hallucination through its learnable query\-based visual feature extraction\. Notably, the optimal layer \(14\) is earlier than LLaVA’s optimal layer \(18\), possibly due to architectural differences in how visual information is integrated\.

### B\.7CHAIR Ablation Results

We provide comprehensive ablation analysis on the CHAIR benchmark, which evaluates caption hallucination through object\-level metrics\.

#### B\.7\.1LLaVA\-1\.5 Layer Ablation on CHAIR

Table[20](https://arxiv.org/html/2606.29431#A2.T20)shows the impact of layer selection on CHAIR metrics while fixing the attenuation strength atα\\alpha=1\.0\. We test layers 10–22 to identify the optimal intervention point\.

Table 20:LLaVA\-1\.5: Layer ablation on CHAIR benchmark\. Strength is fixed atα\\alpha=1\.0\. CS/CI: lower is better\. Rec: higher is better\.Key Findings:Layer 20 achieves the lowest hallucination rates \(CS=46\.6, CI=14\.08\) with only a modest decrease in recall \(78\.69 vs 80\.6 baseline\)\. Earlier layers \(10–16\) either provide insufficient correction or increase hallucination\. This differs from POPE where layer 18 is optimal, suggesting that discriminative and generative tasks have slightly different critical layers\.

#### B\.7\.2LLaVA\-1\.5 Strength Ablation on CHAIR

Table[21](https://arxiv.org/html/2606.29431#A2.T21)shows the sensitivity to attenuation strength while fixing the intervention at layer 20\.

Table 21:LLaVA\-1\.5: Strength ablation on CHAIR benchmark\. Layer is fixed at 20\.Strengthα\\alphaCS↓\\downarrowCI↓\\downarrowRec↑\\uparrowLen0\.151\.214\.9480\.23101\.20\.251\.614\.8979\.85101\.00\.351\.015\.1879\.72100\.30\.450\.414\.7679\.91100\.20\.549\.014\.5979\.7899\.40\.648\.814\.9878\.9599\.20\.747\.414\.5478\.5798\.50\.847\.414\.7078\.8297\.90\.946\.814\.2878\.6998\.01\.046\.614\.0878\.6998\.6Baseline \(Greedy\): CS=49\.8, CI=14\.8, Rec=80\.6, Len=101\.2

Key Findings:Unlike POPE whereα\\alpha=0\.6 is optimal, CHAIR benefits from stronger attenuation \(α\\alpha=1\.0\), achieving CS=46\.6 \(−\-3\.2 vs baseline\)\. This suggests that caption generation tasks require more aggressive FFN suppression to reduce hallucinated objects\. The recall\-hallucination trade\-off is favorable: CSdrops by 6\.4% while recall only decreases by 2\.4%\.

#### B\.7\.3mPLUG\-Owl2 Ablation on CHAIR

Table[22](https://arxiv.org/html/2606.29431#A2.T22)presents ablation results for mPLUG\-Owl2, showing layer and strength combinations\.

Table 22:mPLUG\-Owl2: Ablation on CHAIR benchmark across different layer and strength combinations\.Key Findings:For mPLUG\-Owl2, the optimal configuration is layer 20 withα\\alpha=0\.6, achieving CS=55\.0 \(−\-2\.8 vs baseline\) and CI=16\.60 \(−\-0\.5\)\. The improvement is more modest compared to LLaVA\-1\.5, consistent with our observation that mPLUG\-Owl2’s modality collaboration mechanism partially addresses hallucination\. Notably, layer 18 \(optimal for POPE\) shows minimal improvement on CHAIR, supporting task\-dependent optimal layers\.

## Appendix CGeneralization to Larger Model Scale

In this section, we present additional experiments testing whether FADE generalizes beyond the 7B scale\. Results on advanced architectures \(InternVL3\-8B and the Qwen\-VL series\) are reported in the main text \(Section[4\.2\.5](https://arxiv.org/html/2606.29431#S4.SS2.SSS5)\)\.

### C\.1Evaluation on LLaVA\-v1\.5\-13B

To test whether FADE generalizes beyond the 7B scale, we evaluate all comparison methods onLLaVA\-v1\.5\-13B\(40 transformer layers\) using the POPE MSCOCO benchmark across the three standard sampling settings \(Random, Popular, Adversarial\), for a total of 9000 evaluation samples\.

##### Main Results\.

Table[23](https://arxiv.org/html/2606.29431#A3.T23)reports F1 scores on LLaVA\-v1\.5\-13B\. FADE \(L34,α=0\.7\\alpha\{=\}0\.7\) achieves the best overall performance, with an average F1 of86\.15, outperforming all training\-free baselines including PAI \(86\.10\)\. More importantly, several contrastive and attention\-based methods—VCD \(81\.74\), DAMO \(84\.12\), VISTA \(84\.05\)—*degrade below greedy decoding*\(85\.70\) at 13B, whereas FADE and PAI are the only methods that improve over greedy\. This suggests that FFN\-level intervention is more robust to model scale than methods relying on contrastive distorted inputs or attention steering, which become less reliable as model capacity grows\.

Table 23:POPE results on LLaVA\-v1\.5\-13B \(MSCOCO, 3 settings, 9000 samples\)\. F1 scores; best inbold, second bestunderlined\. FADE uses layer 34 andα=0\.7\\alpha=0\.7\.
##### Layer Sensitivity at 13B\.

We sweep the intervention layer withα\\alphafixed at 0\.6 \(Table[24](https://arxiv.org/html/2606.29431#A3.T24)\)\. All tested layers from L20 to L34 exceed greedy \(85\.70\), with the optimum at layer 34\. Note that layer 34 in a 40\-layer 13B model corresponds proportionally to layer 18 in a 32\-layer 7B model \(34/40≈18/3234/40\\approx 18/32\); both locate the optimal intervention point in the mid\-to\-late region of the network\.

Table 24:Layer search on LLaVA\-v1\.5\-13B POPE \(fixedα=0\.6\\alpha=0\.6, Avg F1\)\.
##### Strength Sensitivity at 13B\.

Fixing the intervention layer at L34, we varyα\\alphafrom 0\.3 to 0\.8 \(Table[25](https://arxiv.org/html/2606.29431#A3.T25)\)\. All tested values outperform greedy, with the optimum atα=0\.7\\alpha\{=\}0\.7\. The stable plateau observed aroundα∈\[0\.5,0\.8\]\\alpha\\in\[0\.5,0\.8\]mirrors the 7B behavior and indicates that a narrow mid\-range attenuation strength transfers robustly across model sizes\.

Table 25:Strength search on LLaVA\-v1\.5\-13B POPE \(fixed layer 34, Avg F1\)\.
##### Takeaway\.

Two observations emerge: \(i\) FADE’s mechanism generalizes cleanly to 13B, attaining the best average F1 among compared methods while contrastive and attention\-based baselines regress; \(ii\) the optimal hyperparameters transfer in a principled way, locating the critical layer in the mid\-to\-late region and the attenuation strength in the\[0\.5,0\.7\]\[0\.5,0\.7\]range for both 7B and 13B\. This provides a practical prior for future deployment on new model scales: begin the search around the proportionally\-equivalent mid\-to\-late layer with moderate attenuation strength\.

### C\.2Strength Sweep on Advanced Architectures

A natural concern with any single\-hyperparameter design is whether the reported gain depends on a carefully cherry\-picked attenuation strength\. To address this, we sweepα∈\{0\.3,0\.5,0\.7,0\.8\}\\alpha\\in\\\{0\.3,0\.5,0\.7,0\.8\\\}on Qwen3\-VL\-8B—one of the strongest architectures in our evaluation—while keeping the critical\-layer band fixed\. We report the full POPE breakdown \(Random / Popular / Adversarial / Avg F1\) and include the Vanilla \(greedy\) baseline for reference\.

Table 26:Sensitivity of FADE to attenuation strengthα\\alphaon Qwen3\-VL\-8B \(POPE F1, higher is better\)\. Variation acrossα∈\{0\.3,0\.5,0\.7,0\.8\}\\alpha\\in\\\{0\.3,0\.5,0\.7,0\.8\\\}stays within 0\.2 points on every split\. Defaultα=0\.5\\alpha=0\.5shaded\.##### Takeaway\.

Across the entire swept range, every POPE split varies by at most 0\.2 F1, and FADE matches or exceeds the Vanilla baseline at all four values ofα\\alpha\. Two implications follow: \(i\) the gains reported in Section[4\.2\.5](https://arxiv.org/html/2606.29431#S4.SS2.SSS5)are not the product of cherry\-picked tuning—any moderate attenuation strength yields essentially the same outcome; and \(ii\) the relatively modest absolute improvement on Qwen3\-VL\-8B is a property of the model, not ofα\\alpha\. Stronger architectures embed highly optimized internal language priors, leaving a smaller residual margin for any training\-free intervention to recover; this is consistent with the broader trend observed across the three advanced models in the main text\.

## Appendix DHyperparameter Transfer Across Models

We investigate whether optimal hyperparameters transfer across different VLM architectures\.

Table 27:Optimal hyperparameters across different VLMs\.Key Finding:For LLaVA\-style models \(LLaVA\-1\.5, mPLUG\-Owl2\), layer 18 is consistently optimal with strength in the 0\.5–0\.7 range\. InstructBLIP, which uses a different Q\-Former architecture, shows optimal performance at an earlier layer \(14\) with lower strength \(0\.5\)\. This suggests that the critical layers for textual bias are architecturally determined, with Q\-Former\-based models showing different layer distributions\.

## Appendix ELimitations and Future Work

##### Task\-Specific Tuning\.

While FADE achieves strong results with moderate attenuation for discriminative tasks \(POPE\) and caption generation \(CHAIR\), the MME benchmark requires much smaller attenuation strength \(α=0\.02\\alpha=0\.02–0\.050\.05vs\.0\.50\.5–0\.70\.7\)\. This suggests that different task types may require task\-specific tuning, which we leave for future work on adaptive strength selection\.

##### Larger Models\.

Our main experiments focus on 7B\-scale models\. As reported in Appendix[C](https://arxiv.org/html/2606.29431#A3), FADE generalizes to LLaVA\-v1\.5\-13B with consistent scale\-invariant hyperparameter patterns \(mid\-to\-late layers,α≈0\.6−0\.7\\alpha\\approx 0\.6\-0\.7\)\. Preliminary experiments on larger models \(e\.g\., InternVL3\-8B\) further suggest that the optimal layer shifts proportionally with model depth, but comprehensive evaluation at 30B\+ scale remains needed\.

##### Training\-Time Integration\.

FADE operates at inference time without model modification\. Future work could explore training\-time regularization that explicitly minimizes directional drift during instruction tuning\.
FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models

Similar Articles

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

Mitigating Multimodal Hallucination via Phase-wise Self-reward

Submit Feedback

Similar Articles

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding
From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding
Mitigating Multimodal Hallucination via Phase-wise Self-reward