Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
Summary
This paper introduces Positive-and-Negative Decoding (PND), a training-free inference framework that reduces object hallucination in Vision-Language Models by contrasting positive visual evidence with negative counterfactuals during decoding.
View Cached Full Text
Cached at: 05/11/26, 06:39 AM
# Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
Source: [https://arxiv.org/html/2605.06679](https://arxiv.org/html/2605.06679)
Yubo Jiang1,2Yitong An1Xin Yang2Abudukelimu Wuerkaixi2Xuxin Cheng2Fengying Xie1,3 Zhiguo Jiang3Cao Liu2Ke Zeng2†\{\}^\{2~\\dagger\}Haopeng Zhang1,3†\{\}^\{1,3~\\dagger\} 1School of Astronautics, Beihang University, Beijing 102206, China 2Longcat Interaction Team, Meituan, Beijing 100102, China 3Tianmushan Laboratory, Beihang University, Hangzhou 311115, China \{jbond0409, zhanghaopeng\}@buaa\.edu\.cn \(Y\.J\., H\.Z\.\)
###### Abstract
Vision\-Language Models \(VLMs\) are frequently undermined by object hallucination—generating content that contradicts visual reality—due to an over\-reliance on linguistic priors\. We introduce Positive\-and\-Negative Decoding \(PND\), a training\-free inference framework that intervenes directly in the decoding process to enforce visual fidelity\. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under\-weighted\. Our framework corrects this via a dual\-path contrast: The positive path amplifies salient visual evidence using multi\-layer attention to encourage faithful descriptions, directly counteracting the attention deficit\. Simultaneously, the negative path identifies and degrades the core object’s features to create a strong counterfactual, which penalizes ungrounded, prior\-dominant generation\. By contrasting the model’s outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual\. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state\-of\-the\-art performance with up to 6\.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail—all without requiring any model retraining\. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen\-VL\. Project Page: https://github\.com/JiangYubo4399/PND\.
††footnotetext:†\\daggerindicates the corresponding authors\.## 1Introduction
Large\-scale vision language models \(VLMs\) have achieved remarkable success in multimodal tasks\[[45](https://arxiv.org/html/2605.06679#bib.bib45),[43](https://arxiv.org/html/2605.06679#bib.bib43),[7](https://arxiv.org/html/2605.06679#bib.bib7)\]\. However, a critical failure mode persists: these models frequently hallucinate, generating plausible but factually incorrect content that contradicts the visual input\[[14](https://arxiv.org/html/2605.06679#bib.bib14),[20](https://arxiv.org/html/2605.06679#bib.bib20),[11](https://arxiv.org/html/2605.06679#bib.bib11)\]\. We argue this failure is, in essence, a Bayesian reasoning imbalance\[[38](https://arxiv.org/html/2605.06679#bib.bib38),[4](https://arxiv.org/html/2605.06679#bib.bib4)\]\. From a Bayesian perspective, a VLM’s generative process is determined by two competing forces: a language prior, referring to the model’s learned co\-occurrence biases from pre\-training that encode how words and visual concepts are statistically aligned; and a visual likelihood, referring to the image\-grounded evidence that directly constrains what objects and attributes are actually present\. Hallucination occurs when this balance fails and the generation becomes ”prior\-dominant”\[[33](https://arxiv.org/html/2605.06679#bib.bib33),[22](https://arxiv.org/html/2605.06679#bib.bib22)\]\. This imbalance manifests in two primary ways:
- •Object Confabulation \(Positive Hallucination\)\.Positive hallucination refers to the model*inventing*objects that are not present, typically when a dominant language prior overrules visual evidence\. Existing perturbation\-only approaches, such as VCD\-style negative\-pair construction, attempt to suppress this behavior but often degrade the image too heavily, removing crucial semantics needed for reliable grounding\.
- •Object Omission \(Negative Hallucination\)\.In contrast, omission occurs when a real object receives insufficient grounding\. As shown in[Fig\.1](https://arxiv.org/html/2605.06679#S1.F1)\(b\), the frisbee is clearly visible, yet the model answers“No”to the question“Is there a frisbee in the image?”\. Perturbation\-only contrastive methods further suppress the already weak frisbee region, and—because they operate in a single destructive path—VCD\-style methods cannot recover from such loss of evidence, again causing the model to deny its presence\.
Overcoming this phenomenon requires a mechanism that can dynamically and reliably intervene during decoding\. Such an approach must continually steer the model toward visually grounded predictions and prevent it from reverting to descriptions dominated by an incorrect language prior\[[40](https://arxiv.org/html/2605.06679#bib.bib40),[46](https://arxiv.org/html/2605.06679#bib.bib46)\]\. As illustrated in[Fig\.1](https://arxiv.org/html/2605.06679#S1.F1)\(c\), our method effectively resolves this imbalance\. To truly break the illusion of understanding, we introducePositive\-and\-Negative Decoding \(PND\), a training\-free, plug\-and\-play framework that performs real\-timeBayesian belief adjustmentduring inference\. PND achieves this by injecting a carefully designed dual\-path contrast mechanism\[[6](https://arxiv.org/html/2605.06679#bib.bib6)\]that dynamically re\-balances the prior and likelihood\.
Figure 1:PND suppresses object hallucination viadual\-path contrast\. \(a\) A standard VLM fails to identify the frisbee due to weak linguistic priors\. \(b\) Existing negative\-only methods are insufficient\. \(c\) Our PND’s dual pathways overcome this: the positive path reinforces the object’s presence, while the negative path creates a counterfactual, leading to correct identification\.1. 1\.The Positive Pathway \(Amplifying the Likelihood\):This pathway uses multi\-layer cross\-modal attention to gather relevant visual evidence and amplify it\. By strengthening high\-salience visual features, it boosts thevisual likelihoodand steers the model toward grounded, faithful descriptions\.
2. 2\.The Negative Pathway \(Isolating the Prior\):This pathway builds a targeted counterfactual by identifying the core evidence regions and selectively degrading them\. By removing only the minimal visual cues the model still relies on, it induces an “evidence\-blind” state while preserving useful prior information, leading the model to rely more heavily on itslanguage prior\. The resulting output reveals the model’s underlying, prior\-driven hallucination tendencies\.
The ”meeting” of these pathways is orchestrated by our decoding objective\. By contrasting the model’s outputs from the enhanced\-likelihood \(Positive\) path against the prior\-dominant \(Negative\) path, PND applies symmetrical pressure\. It steers the generation trajectory towards object\-level truth \(high visual likelihood\) and away from misleading contextual beliefs \(dominant language prior\), thus resolving the Bayesian imbalance that leads to hallucinations\[[8](https://arxiv.org/html/2605.06679#bib.bib8),[46](https://arxiv.org/html/2605.06679#bib.bib46)\]\. Our contributions can be summarized as follows:
- •We present PND, a training\-free, plug\-and\-play decoding framework that uses adual\-path contrastive mechanismto suppress hallucinations during inference\.
- •Our method uniquely leverages multi\-layer attention to dynamically generate positive \(likelihood\-amplifying\) and negative \(prior\-isolating\) guidance, implementing a robust Bayesian belief adjustment at inference time\.
- •Extensive experiments across multiple VLMs and benchmark datasets demonstrate that PND achieves state\-of\-the\-art performance in suppressing object hallucinations, significantly outperforming existing methods\.
## 2Related Work
### 2\.1Vision\-Language Models \(VLMs\)
The current paradigm of Large Multimodal Models \(LMMs\) enhances powerful LLMs with visual understanding\. This is typically achieved by connecting a pre\-trained vision encoder\[[9](https://arxiv.org/html/2605.06679#bib.bib9)\]to the LLM via a lightweight adapter\[[24](https://arxiv.org/html/2605.06679#bib.bib24),[25](https://arxiv.org/html/2605.06679#bib.bib25)\]\. This architecture, particularly when refined with large\-scalevisual instruction tuning\[[24](https://arxiv.org/html/2605.06679#bib.bib24)\], has proven highly effective\. It has produced a wave of influentialopen\-source models—such as LLaVA\-1\.5\[[25](https://arxiv.org/html/2605.06679#bib.bib25)\], Qwen\-VL\[[2](https://arxiv.org/html/2605.06679#bib.bib2)\], and Deepseek\-VL\[[27](https://arxiv.org/html/2605.06679#bib.bib27)\]—as well as cutting\-edge proprietary systems like OpenAI’s GPT models and Google’s Gemini\[[36](https://arxiv.org/html/2605.06679#bib.bib36)\]\. These models demonstrate unprecedented conversational skills and performance\[[13](https://arxiv.org/html/2605.06679#bib.bib13)\]\.
While this design enables powerful fluency, it also causes the models to inherit the vast parametric knowledge of their underlying LLM\. This reliance substantially contributes to theBayesian imbalancewe address: their stronglanguage priorscan easily override factual visual evidence, leading to the pervasive problem of hallucination\[[26](https://arxiv.org/html/2605.06679#bib.bib26),[11](https://arxiv.org/html/2605.06679#bib.bib11),[17](https://arxiv.org/html/2605.06679#bib.bib17),[46](https://arxiv.org/html/2605.06679#bib.bib46)\]\.
### 2\.2Hallucination in Vision\-Language Models
Hallucination in VLMs denotes generated content inconsistent with visual input\[[44](https://arxiv.org/html/2605.06679#bib.bib44)\]\. Object hallucination—false positives \(describing non\-existent objects\) or false negatives \(omitting present ones\)—is the most studied and practically significant form\[[11](https://arxiv.org/html/2605.06679#bib.bib11),[3](https://arxiv.org/html/2605.06679#bib.bib3)\]\. These errors are closely linked to strong language priors\[[29](https://arxiv.org/html/2605.06679#bib.bib29)\]and attention misalignment, where models rely on contextual cues instead of object\-level evidence\[[34](https://arxiv.org/html/2605.06679#bib.bib34)\]\. Existing mitigation strategies fall into two categories\. Training\-based methods modify parameters through RLHF\[[28](https://arxiv.org/html/2605.06679#bib.bib28),[15](https://arxiv.org/html/2605.06679#bib.bib15)\], curated datasets\[[3](https://arxiv.org/html/2605.06679#bib.bib3)\], or architectural changes\[[1](https://arxiv.org/html/2605.06679#bib.bib1)\]\. Although effective, they are computationally costly and often degrade other multimodal abilities\. Inference\-time methods provide a practical alternative\. Recent approaches such as Visual Contrastive Decoding \(VCD\)\[[17](https://arxiv.org/html/2605.06679#bib.bib17)\], AGLA\[[1](https://arxiv.org/html/2605.06679#bib.bib1)\]\), and VAF\[[41](https://arxiv.org/html/2605.06679#bib.bib41)\]contrast predictions under perturbed visual inputs to detect prior\-dominant tokens\. From a Bayesian perspective, these techniques perform single\-path perturbation control: they weaken visual likelihood and down\-weight tokens that remain unchanged\. Despite their effectiveness, this mechanism is inherently one\-sided; it neither amplifies evidence\-bearing regions nor separates the influence of the language prior\.
Figure 2:Overview of the PND framework for belief\-adjusted decoding\.Given an input image, we first extract multi\-layer cross\-modal attention maps to estimate query\-aligned visual evidence\. These maps guide the construction of two perturbed visual representations: a*positive*view𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}that amplifies evidence, and a*negative*view𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}that suppresses it\. Passing each view through the VLM yields three logits \(original, positive, and negative\), whose contrast reveals whether a token is driven by visual likelihood or by the language prior\. The final next\-token probability is obtained by a belief\-adjusted combination of these logits, enabling the model to recover visually grounded predictions and reduce hallucination\.Our approach offers a complementary, structured perspective\. Rather than relying on a single perturbed view, PND introduces a dual\-path formulation probing both belief sources\. The negative path builds a controlled counterfactual removing multi\-layer consensus evidence to approximate the prior, while the positive path reinforces salient regions via attention\-guided enhancement\. This design treats CAM as a differentiable, architecture\-agnostic proxy separating likelihood\- and prior\-dominant regions, enabling principled, model\-agnostic belief\-adjusted decoding\. The symmetric formulation provides a clearer approximation of Bayesian decomposition than perturbation\-only methods\.
## 3Method
This section introduces PND, our inference\-time framework guided by Bayesian belief adjustment\. Rather than assuming a fully parametric Bayesian model, we use this viewpoint to describe an empirically observed imbalance: modern VLMs rely heavily on linguistic self\-consistency while progressively under\-utilizing visual evidence in deeper decoding layers \(see[Fig\.3](https://arxiv.org/html/2605.06679#S3.F3)\)\. Under this imbalance, PND aims to dynamically re\-weight the language prior and visual likelihood during token generation\.
To achieve this, PND contrasts the model’s behavior under two visual representations\. A positive representation𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}amplifies salient visual evidence, while a negative representation𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}attenuates or removes such evidence to isolate the model’s language prior\. The premise is simple: tokens dominated by the prior remain insensitive to visual perturbations, whereas likelihood\-driven tokens exhibit strong shifts\. As illustrated in[Fig\.2](https://arxiv.org/html/2605.06679#S2.F2), our framework consists of two components: attention\-derived salience maps for constructing𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}and𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}, and a belief\-adjusted decoding objective that integrates logits from all three paths into a single next\-token distribution\.
### 3\.1Disentangling Evidence and Context via Attention
We revisit multimodal decoding through a conceptual Bayesian lens, where the next\-token distribution is jointly influenced by linguistic expectations and dynamically evolving image\-derived evidence:
p\(y∣xv,xt\)∝p\(y∣xt\)⏟language prior⋅p\(xv∣y\)⏟visual likelihood\.p\(y\\mid x\_\{v\},x\_\{t\}\)\\,\\propto\\,\\underbrace\{p\(y\\mid x\_\{t\}\)\}\_\{\\text\{language prior\}\}\\cdot\\underbrace\{p\(x\_\{v\}\\mid y\)\}\_\{\\text\{visual likelihood\}\}\.\(1\)While this factorization usefully interprets hallucination, explicitly decomposing a VLM’s hidden features into these components is intractable\. Instead, we seek a practical proxy aligning with observable model behavior\[[19](https://arxiv.org/html/2605.06679#bib.bib19),[4](https://arxiv.org/html/2605.06679#bib.bib4)\]\.
##### Cross\-modal attention as an empirical proxy\.
We interpret visual embedding𝐕\\mathbf\{V\}as comprising evidence\-bearing𝐕evidence\\mathbf\{V\}\_\{\\mathrm\{evidence\}\}\(object features supporting likelihood\) and contextual𝐕context\\mathbf\{V\}\_\{\\mathrm\{context\}\}\(semantics reinforcing language priors\)\. Hallucinations emerge when models overweight𝐕context\\mathbf\{V\}\_\{\\mathrm\{context\}\}and underweight𝐕evidence\\mathbf\{V\}\_\{\\mathrm\{evidence\}\}during decoding\. Despite the early\-aggregation hypothesis \(visual evidence integrating early for indirect later access\), deeper layers increasingly favor language priors over direct visual evidence\. Thus, observed attention decline empirically indicates reduced direct visual grounding rather than definitive information loss\.
To approximate this decomposition, we extract cross\-modal attention maps from an external vision–language model \(BLIP\-ITM\[[18](https://arxiv.org/html/2605.06679#bib.bib18)\]\)\. These maps quantify relevance between textual queries𝐐text\\mathbf\{Q\}\_\{\\text\{text\}\}and visual patches𝐊vis\\mathbf\{K\}\_\{\\text\{vis\}\}, estimating where visual evidence resides\. Given textual queries and visual keys in layerii, attention map𝐀i\\mathbf\{A\}\_\{i\}is
𝐀i=softmax\(𝐐text\(𝐊vis\(i\)\)⊤/dk\)\.\\mathbf\{A\}\_\{i\}=\\mathrm\{softmax\}\\\!\\left\(\\mathbf\{Q\}\_\{\\text\{text\}\}\\\!\\left\(\\mathbf\{K\}\_\{\\text\{vis\}\}^\{\(i\)\}\\right\)^\{\\\!\\top\}/\\sqrt\{d\_\{k\}\}\\right\)\.\(2\)
##### Layerwise distinction between likelihood and prior\.
CAMs across layers exhibit systematic behavior: early layers emphasize fine\-grained object regions, whereas deeper layers shift toward global semantics\. Empirically \([Fig\.3](https://arxiv.org/html/2605.06679#S3.F3)\), visual patches receive substantially less attention in deeper layers, dominated by user instructions and system prompts\. This trend aligns with our Bayesian interpretation: visual likelihood fades with depth, while language priors accumulate\. Motivated by this, we fuse multi\-layer CAMs\{𝐀1,…,𝐀L\}\\\{\\mathbf\{A\}\_\{1\},\\dots,\\mathbf\{A\}\_\{L\}\\\}into salience maps that highlight evidence\-bearing regions while suppressing context\-driven ones\. These fused maps provide a differentiable, architecture\-agnostic proxy for separating likelihood\-dominant and prior\-dominant regions, forming the foundation of the positive and negative visual pathways\. Supplementary Material IV details individual layer contributions\.
Figure 3:Empirical evidence for Bayesian imbalance in multimodal decoding\.We plot the layer\-wise allocation of cross\-modal attention in a representative VLM\. Early layers attend to visual evidence, but deeper layers shift attentional most entirely toward textual context, reflecting the accumulation of a strong language prior\. This progressive decline in visual contribution indicates thatp\(xv∣y\)p\(x\_\{v\}\\mid y\)is underweighted relative top\(y∣xt\)p\(y\\mid x\_\{t\}\)during token generation, providing a direct motivation for our*Bayesian belief adjustment*design in PND\.
### 3\.2Positive and Negative Visual Augmentation
With cross\-modal attention maps providing a practical proxy for disentangling*evidence\-bearing*and*contextual*regions \([Sec\.3\.1](https://arxiv.org/html/2605.06679#S3.SS1)\), we now describe how these signals are operationalized to construct the two visual pathways used for belief adjustment\. As illustrated in[Fig\.2](https://arxiv.org/html/2605.06679#S2.F2), PND generates two complementary visual representations: apositiveview𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}that*explicitly amplifies the visual likelihood*, and anegativeview𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}that*strategically suppresses evidence to expose the language prior*\.
The goal of these augmentations is not to alter the semantics of the image, but to selectively modulate the strength of the model’s visual cues\. When the model processes𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}, regions identified as high\-evidence receive proportionally greater emphasis, increasing the model’s sensitivity to grounded signals\. Conversely,𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}reduces or removes these same regions—via attention\-guided degradation—to approximate a counterfactual setting in which visual evidence is attenuated\. Tokens that genuinely rely onp\(xv∣y\)p\(x\_\{v\}\\mid y\)exhibit strong shifts across the two views, whereas hallucinated \(prior\-dominant\) tokens remain comparatively stable\.
These augmented representations form the foundation for our dual\-path decoding strategy, enabling PND to contrast likelihood\-sensitive and prior\-sensitive behavior at inference time\. Details of the specific construction procedures for𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}and𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}follow in the next sections\.
#### 3\.2\.1Positive Enhancement: Amplifying the visual likelihood
The positive pathway is designed to counteract the empiricalBayesian imbalancerevealed in our attention analysis \([Fig\.3](https://arxiv.org/html/2605.06679#S3.F3)\)\. Across layers, cross\-modal attention assigns only a small portion of its budget to visual patches \(13\.7% in early layers, decreasing to 6\.2% and 4\.9% in middle and late layers\), while textual context—such asUser InstructionsandSystem Prompts—dominates\. This attenuation of visually grounded cues indicates that thevisual likelihoodp\(xv∣y\)p\(x\_\{v\}\\mid y\)is systematically underweighted, motivating an intervention that selectively strengthens evidence\-bearing regions\. We compute a token\-level evidence weight from multi\-layer cross\-modal attention, following the detailed formulation provided in the Supplementary Material VII\. To implement this correction, we construct a salience map by aggregating multi\-layer cross\-modal attention, following attention\-fusion strategies explored in prior work\[[35](https://arxiv.org/html/2605.06679#bib.bib35),[21](https://arxiv.org/html/2605.06679#bib.bib21)\]:
𝐌fused=1L∑i=1L𝐀^i,\\mathbf\{M\}\_\{\\mathrm\{fused\}\}=\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\hat\{\\mathbf\{A\}\}\_\{i\},\(3\)where𝐀^i\\hat\{\\mathbf\{A\}\}\_\{i\}is the normalized attention map at layerii\. This fused map highlights the regions the model implicitly associates with the query and thus approximates theevidencecomponent in our Bayesian interpretation\. We then amplify these regions using a multiplicative modulation inspired by feature\-boosting mechanisms in prior work\[[30](https://arxiv.org/html/2605.06679#bib.bib30)\]:
𝐕pos=𝐕orig⊙\(1\+λ⋅𝐌fused\),\\mathbf\{V\}\_\{\\mathrm\{pos\}\}=\\mathbf\{V\}\_\{\\mathrm\{orig\}\}\\odot\\left\(1\+\\lambda\\cdot\\mathbf\{M\}\_\{\\mathrm\{fused\}\}\\right\),\(4\)whereλ\\lambdacontrols the strength of amplification and⊙\\odotdenotes element\-wise scaling\. This operation does not alter the semantics of the image; instead, it gradually increases the relative prominence of evidence\-bearing features, thereby encouraging the model to more faithfully reflect thevisual likelihoodduring decoding\.
#### 3\.2\.2Negative Degradation: Isolating the language prior
The negative pathway constructs a counterfactual visual input that*carefully isolates the language prior*without destroying the useful parts of it\. As shown in[Fig\.3](https://arxiv.org/html/2605.06679#S3.F3), modern VLMs already allocate very little deep\-layer attention to visual patches \(e\.g\., 4\.9%\), making global perturbations wasteful and prone to erasing helpful priors\. Instead, we remove only the*minimal evidence*identified by multi\-layer CAM consensus—preserving most visual information while still inducing strongly maximal hallucination\. This subtle intervention forces decoding to rely almost entirely onp\(y∣xt\)p\(y\\mid x\_\{t\}\)and reveals the model’s prior\-driven bias\.
##### Identifying minimal evidence via attention consensus\.
Following insights from evidence localization\[[33](https://arxiv.org/html/2605.06679#bib.bib33)\], we compute a*consensus map*that highlights visual regions consistently attended to across layers\. Specifically, we take the pixel\-wise minimum over normalized attention maps:
𝐌consensus=min\(𝐀^1,…,𝐀^L\)\.\\mathbf\{M\}\_\{\\mathrm\{consensus\}\}=\\min\(\\hat\{\\mathbf\{A\}\}\_\{1\},\\ldots,\\hat\{\\mathbf\{A\}\}\_\{L\}\)\.\(5\)This soft intersection yields a conservative estimate of the evidence\-bearing regions that contribute to the visual likelihood\. A binary mask is then produced via thresholding:
𝐌mask=𝕀\[𝐌consensus≥τ\]\.\\mathbf\{M\}\_\{\\mathrm\{mask\}\}=\\mathbb\{I\}\[\\mathbf\{M\}\_\{\\mathrm\{consensus\}\}\\geq\\tau\]\.\(6\)
##### Semantic degradation through DDPM forward process\.
To meaningfully degrade these regions without introducing artifacts, we employ the forward noising process of a DDPM\[[12](https://arxiv.org/html/2605.06679#bib.bib12)\], which produces corrupted features that remain structurally coherent\. Unlike Gaussian noise, DDPM corruption yields distributionally plausible and semantically aligned patterns, avoiding the out\-of\-distribution artifacts that models tend to ignore\. For noise levelTT, the corrupted representation is sampled as:
𝐕noise=α¯T𝐕orig\+1−α¯Tϵ,ϵ∼𝒩\(0,𝐈\),\\mathbf\{V\}\_\{\\mathrm\{noise\}\}=\\sqrt\{\\bar\{\\alpha\}\_\{T\}\}\\,\\mathbf\{V\}\_\{\\mathrm\{orig\}\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{T\}\}\\,\\boldsymbol\{\\epsilon\},\\quad\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\),\(7\)whereα¯T\\bar\{\\alpha\}\_\{T\}controls the resulting signal\-to\-noise ratio\.
##### Constructing the negative counterfactual\.
The final negative input is assembled by carefully replacing only the masked evidence\-bearing regions with the corresponding DDPM\-corrupted features\.
𝐕neg=𝐕orig⊙\(1−𝐌mask\)\+𝐕noise⊙𝐌mask\.\\mathbf\{V\}\_\{\\mathrm\{neg\}\}=\\mathbf\{V\}\_\{\\mathrm\{orig\}\}\\odot\(1\-\\mathbf\{M\}\_\{\\mathrm\{mask\}\}\)\+\\mathbf\{V\}\_\{\\mathrm\{noise\}\}\\odot\\mathbf\{M\}\_\{\\mathrm\{mask\}\}\.\(8\)This targeted degradation removes the model’s remaining access to visual likelihood while preserving overall image statistics, yielding a prior\-isolating counterfactual\. Such counterfactual inputs have been shown to expose hallucination tendencies in VLMs\[[5](https://arxiv.org/html/2605.06679#bib.bib5),[10](https://arxiv.org/html/2605.06679#bib.bib10)\], and here they form the negative component of our dual\-path decoding framework\.
### 3\.3PND: Decoding as Bayesian Belief Adjustment
With𝐕orig\\mathbf\{V\}\_\{\\mathrm\{orig\}\},𝐕pos\\mathbf\{V\}\_\{\\mathrm\{pos\}\}, and𝐕neg\\mathbf\{V\}\_\{\\mathrm\{neg\}\}prepared, we now describe the decoding step of PND in detail\. This stage performs a lightweight, inference\-onlyBayesian belief adjustment, dynamically re\-balancing the relative influence of thelanguage priorandvisual likelihoodat every token\.
We run three parallel forward passes to obtain logits𝐥orig\\mathbf\{l\}\_\{\\mathrm\{orig\}\},𝐥pos\\mathbf\{l\}\_\{\\mathrm\{pos\}\}\(likelihood\-amplified\), and𝐥neg\\mathbf\{l\}\_\{\\mathrm\{neg\}\}\(prior\-isolated\)\. PND combines these signals through a contrastive update:
𝐥PND=𝐥orig\+α𝐥pos−γ𝐥neg,\\mathbf\{l\}\_\{\\mathrm\{PND\}\}=\\mathbf\{l\}\_\{\\mathrm\{orig\}\}\+\\alpha\\,\\mathbf\{l\}\_\{\\mathrm\{pos\}\}\-\\gamma\\,\\mathbf\{l\}\_\{\\mathrm\{neg\}\},\(9\)whereα,γ≥0\\alpha,\\gamma\\geq 0are balancing coefficients\. The positive term boosts visually grounded candidates, while the negative term suppresses prior\-driven tokens that persist without visual support, consistent with findings in prior hallucination\-control work\[[35](https://arxiv.org/html/2605.06679#bib.bib35),[10](https://arxiv.org/html/2605.06679#bib.bib10)\]\. To avoid introducing implausible candidates, we retain only tokens that are credible under the model’s original distribution:
𝐥final=𝐥PND⊙𝕀\[𝐥orig≥log\(β\)\+max\(𝐥orig\)\],\\mathbf\{l\}\_\{\\mathrm\{final\}\}=\\mathbf\{l\}\_\{\\mathrm\{PND\}\}\\odot\\mathbb\{I\}\\\!\\left\[\\mathbf\{l\}\_\{\\mathrm\{orig\}\}\\geq\\log\(\\beta\)\+\\max\(\\mathbf\{l\}\_\{\\mathrm\{orig\}\}\)\\right\],\(10\)whereβ\\betais a confidence threshold\. The final probability distribution is obtained by applying softmax to𝐥final\\mathbf\{l\}\_\{\\mathrm\{final\}\}\.
Table 1:Performance comparison on the POPE benchmark, evaluated on LLaVA1\.5\-7B, InstructBLIP\-7B, and other models\. Our PND framework showsstate\-of\-the\-art performance, with superior Accuracy and F1\-scores across most settings compared to the baseline and competing methods \(VCD\[[17](https://arxiv.org/html/2605.06679#bib.bib17)\], VAF\[[41](https://arxiv.org/html/2605.06679#bib.bib41)\], AGLA\[[1](https://arxiv.org/html/2605.06679#bib.bib1)\]\)\.ModelCategoryMethodsAccuracyPrecisionRecallF1regular78\.5382\.7772\.0677\.04VCD81\.1384\.5276\.7280\.43VAF80\.8886\.8476\.5881\.39AGLA83\.1386\.9777\.9582\.21adversarialPND\(ours\)84\.0390\.5577\.4383\.48regular81\.5694\.1477\.2584\.87VCD84\.8993\.8677\.9285\.16VAF84\.5095\.7678\.8586\.49AGLA85\.2194\.9280\.8787\.34popularPND\(ours\)86\.1098\.4180\.8788\.79regular83\.0096\.2382\.9989\.12VCD86\.8196\.1185\.3890\.43VAF85\.6895\.3388\.6591\.87AGLA87\.1296\.4288\.7892\.44LLaVA1\.5\-7B\[[25](https://arxiv.org/html/2605.06679#bib.bib25)\]randomPND\(ours\)87\.3398\.2988\.9993\.41regular75\.6674\.0978\.9376\.43VCD78\.4379\.0479\.9279\.48VAF78\.8478\.4979\.4778\.98AGLA81\.1382\.3580\.7481\.54adversarialPND\(ours\)82\.2082\.9981\.0081\.98regular78\.0077\.3079\.2678\.27VCD82\.2482\.5681\.0281\.79VAF81\.7282\.5480\.2081\.36AGLA83\.2185\.9680\.4783\.12popularPND\(ours\)84\.8387\.8380\.8684\.20regular81\.1082\.4179\.0680\.70VCD85\.1288\.5780\.2084\.18VAF84\.8988\.7479\.7183\.98AGLA86\.3192\.3279\.7285\.56InstructBLIP\-7B\[[7](https://arxiv.org/html/2605.06679#bib.bib7)\]randomPND\(ours\)87\.6393\.5280\.8686\.73Other ModelsLLAVA1\.5\-13B\[[25](https://arxiv.org/html/2605.06679#bib.bib25)\]adversarialregular79\.8083\.6074\.1378\.58PND\(ours\)84\.9691\.9976\.6083\.59InstructBLIP\-13B\[[7](https://arxiv.org/html/2605.06679#bib.bib7)\]adversarialregular76\.0376\.2275\.6675\.94PND\(ours\)82\.8384\.7580\.0682\.34QwenVL\-7B\[[2](https://arxiv.org/html/2605.06679#bib.bib2)\]adversarialregular80\.9691\.0668\.6678\.29PND\(ours\)82\.4693\.5569\.7379\.90Qwen3VL\-2B\[[39](https://arxiv.org/html/2605.06679#bib.bib39)\]adversarialregular86\.5386\.2986\.8686\.57PND\(ours\)87\.2688\.0286\.2687\.13InternVL2\-2B\[[37](https://arxiv.org/html/2605.06679#bib.bib37)\]adversarialregular81\.0685\.3575\.0079\.84PND\(ours\)83\.3388\.2276\.9382\.19Table 2:Performance comparison of different hallucination mitigation methods on the MME benchmarkModelMethodObject\-levelAttribute\-levelTotal ScoreExistenceCountPositionColorLLaVA1\.5\-7Bregular160\.00121\.67105\.00145\.00531\.67VCD185\.00125\.33105\.00150\.00565\.33AGLA195\.00138\.33120\.33155\.00608\.67VAF195\.00133\.67105\.00150\.00583\.67PND\(ours\)195\.00143\.33123\.33160\.00621\.67InstructBLIP\-7Bregular170\.0050\.0053\.33113\.33386\.66VCD170\.0050\.0053\.33114\.33387\.67AGLA175\.0055\.0058\.33118\.33406\.67VAF175\.0051\.6757\.33113\.33397\.33PND\(ours\)180\.0060\.0063\.33120\.00423\.33Other ModelsQwenVL\-7Bregular150\.00145\.00113\.33160\.00568\.33PND\(ours\)170\.00150\.00123\.33173\.33616\.67Qwen3VL\-2Bregular190\.00155\.00143\.33150\.00638\.33PND\(ours\)190\.00160\.00158\.33160\.00668\.33InternVL2\-2Bregular168\.3388\.33103\.3390\.00450\.00PND\(ours\)185\.0093\.33141\.67110\.00530\.00Table 3:Performance comparison of different hallucination mitigation methods on the CHAIR benchmarkModelMethods𝒞\\mathcal\{C\}s↓\\downarrow𝒞\\mathcal\{C\}i↓\\downarrowRecall↑\\uparrowLLaVA1\.5\-7Bregular51\.017\.674\.4VCD51\.016\.777\.2AGLA47\.014\.277\.8VAF53\.016\.576\.9PND\(ours\)46\.014\.078\.1InstructBLIP\-7Bregular58\.016\.371\.1VCD59\.014\.872\.0AGLA46\.012\.371\.5VAF56\.015\.170\.4PND\(ours\)42\.011\.272\.1
## 4Experiments
### 4\.1Experiments Settings
##### Datasets
To avoid benchmark\-driven evaluation, we adopt a*structured, multi\-level*protocol that examines hallucination from complementary perspectives rather than relying on a single metric\.
- •POPE\[[20](https://arxiv.org/html/2605.06679#bib.bib20)\]\.Probes*object\-level*hallucination via a Yes/No formulation\. We evaluate on MSCOCO\[[23](https://arxiv.org/html/2605.06679#bib.bib23)\], A\-OKVQA\[[32](https://arxiv.org/html/2605.06679#bib.bib32)\], and GQA\[[16](https://arxiv.org/html/2605.06679#bib.bib16)\], which cover random, popular, and adversarial sampling strategies\.
- •MME\[[42](https://arxiv.org/html/2605.06679#bib.bib42)\]\.Measures*perceptual and attribute\-level*capabilities across more than ten dimensions\. We use MME primarily to assess whether PND reduces hallucination*without harming*standard multimodal competence\.
- •CHAIR\[[31](https://arxiv.org/html/2605.06679#bib.bib31)\]\.Evaluates hallucination in*open\-ended captioning*by comparing object nouns in generated captions to ground\-truth sets\. We report CHAIRsand CHAIRi\.
- •GCCCE \(ours\)\.Targets*high\-level semantic and commonsense*consistency\. Using GPT\-4\.1 as a judge, GCCCE scores responses on Relevancy, Accuracy, Common Sense, and Fine\-grained Precision\.
Together, these four benchmarks form a concise evaluation suite covering object grounding, perceptual attributes, open\-ended generation, and semantic coherence\. They enable multi\-perspective hallucination assessment across perception and reasoning\.
##### MLLM Backbones
To demonstrate PND’s generality, we evaluate it on four widely used open\-source MLLMs—LLaVA,InstructBLIP,InternVL, andQwen\-VL\. These models span vision–language fusion designs from projection\-based alignment to query\-driven transformer architectures\. Consistent gains across heterogeneous systems suggest PND acts as a model\-agnostic decoding enhancement rather than an architecture\-specific trick\. Our method trades modest inference overhead for improved visual grounding, mainly from BLIP\-based attention extraction\. PND adds small decoding\-time overhead, far lighter than training\-based mitigation, and can be enabled when higher reliability is needed\. Detailed efficiency analysis appears in Supplementary Material VI\.
### 4\.2Experiments Results
##### Results on POPE
Our primary evaluation on the POPE benchmark \([Tab\.1](https://arxiv.org/html/2605.06679#S3.T1)\) confirms PND’s efficacy in suppressing object hallucination\. Our framework yields substantial gains across LLaVA\-1\.5, InstructBLIP, and Qwen\-VL, achieving an average improvement of6\.4%in Accuracy and5\.5%in F1\-score over the greedy decoding baseline\. Crucially, the performance gains are most pronounced on the challenging popular and adversarial subsets\. This directly validates our Bayesian hypothesis: these subsets are specifically designed to trigger hallucinations via powerful, yet fallacious, linguistic priors\. PND’s success here indicates it effectively counteracts this over\-weightedlanguage prior, forcing the model to rely on thevisual likelihood\. Detailed robustness and scalability analyses, including 7B/13B variants, are in Supplementary Material II\.
##### Results on MME
To assess PND’s impact on broader multimodal capabilities, we evaluated its performance on the comprehensive MME benchmark \([Tab\.2](https://arxiv.org/html/2605.06679#S3.T2)\)\. Our framework establishes a new state\-of\-the\-art \(SOTA\) across the evaluated perception sub\-tasks\. Substantial improvements are observed not only in object\-level perception \(Existence,Count\) but also, crucially, in fine\-grained attribute grounding \(Position,Color\)\. This demonstrates that PND functions as a holistic enhancement for visual fidelity, improving a wide spectrum of perception skills rather than being a narrow fix with potential side effects\. Complete MME results are in Supplementary Material III\.
Figure 4:Performance comparison of different hallucination mitigation methods on the GCCCE benchmark\.Figure 5:Qualitative comparison for ”describe the scene captured in the image”\. The baseline model \(\[regular\]\) produces a description with significant factual errors \(inventing an ’ottoman’\), which are highlighted inred\. In contrast, our PND\-enhanced model successfully suppresses these errors and generates a visually faithful and accurate description\.
##### Results on CHAIR
To validate PND beyond discriminative, polling\-based tasks, we evaluated its performance on open\-ended caption generation using the CHAIR benchmark \([Tab\.3](https://arxiv.org/html/2605.06679#S3.T3)\)\. Our method significantly reduces object hallucination, evidenced by the sharp drop in both CHAIRsand CHAIRiscores for tested backbones\. This result directly validates the efficacy of our negative pathway in penalizing object tokens that lack sufficient visual grounding\.
##### Results on GCCCE
To assess higher\-level cognitive effects, we apply GCCCE toInstructBLIP,LLaVA\-1\.5, andQwen\-VL\. As shown in the radar plot \([Fig\.4](https://arxiv.org/html/2605.06679#S4.F4)\), PND consistently improves performance across all dimensions\. Gains in accuracy and fine\-grained precision indicate reduced hallucinations \(see[Fig\.5](https://arxiv.org/html/2605.06679#S4.F5)\), with further improvements in relevancy and commonsense plausibility\. These results suggest stronger visual grounding not only suppresses errors but also yields more coherent, on\-topic responses\. Full GCCCE results appear in Supplementary Material V\.
##### Ablation Study
To dissect our dual pathways, we conducted an ablation study on the POPE benchmark \([Tab\.4](https://arxiv.org/html/2605.06679#S4.T4)\)\. We compared ourPND \(Full\)framework against the Baseline and two variants:P\-only\(amplifying the visual likelihood\) andN\-only\(penalizing the language prior\)\. The results show that both single\-pathway variants outperform the baseline, validating their individual effectiveness\. More importantly, the full PND model markedly surpasses both, demonstrating that the two pathways are not redundant but strongly synergistic\. This synergy supports our hypothesis: the positive pathway acts as an enhancer, enriching visually grounded details, while the negative pathway acts as a suppressor, constraining ungrounded prior\-driven hallucinations\. Their interplay—simultaneously encouraging evidence\-backed detail and discouraging unsupported content—confirms dual\-pathway belief adjustment as essential for achieving both fidelity and descriptive richness\.
##### Hyperparameter Analysis
Supplementary Material IV reports a full hyperparameter study\. We examine decoding strategies \(temperature, top\-p, top\-k\) and key belief\-adjustment parametersα\\alpha,γ\\gamma, andβ\\beta\. Results show deterministic or near\-deterministic decoding yields strongest visual grounding\. Performance is most sensitive to theα\\alpha–γ\\gammatrade\-off controlling Bayesian adjustment, whileβ\\betahas milder impact\. All main experiments use fixed hyperparameters\.
Table 4:Ablation Study on Positive and Negative ComponentsOriginalPositiveNegativeAccuracyF1✓78\.5377\.04✓✓83\.1482\.21✓✓82\.2380\.67✓✓83\.6082\.35✓✓✓84\.0383\.48
## 5Conclusion
In this work, we addressed object hallucination in MLLMs by identifying it as aBayesian reasoning imbalance\. We introducedPositive\-and\-Negative Decoding \(PND\), a training\-free framework that performs real\-time Bayesian belief adjustment during generation\. PND’s core mechanism establishes a dynamic contrast: a positive pathway amplifies the visual likelihood by reinforcing evidence\-bearing regions, while a negative pathway isolates the language prior through a controlled, evidence\-blind counterfactual\. This symmetric design effectively suppresses prior\-dominant hallucinations and steers generation toward visually grounded outputs\. Extensive experiments on POPE, MME, and CHAIR demonstrate strong hallucination reduction, while GCCCE results show that PND improves not only grounding but also factual accuracy and commonsense plausibility\. Importantly, although PND introduces modest inference overhead due to attention extraction, the gains in visual fidelity and reliability substantially outweigh this cost\. Overall, this work demonstrates that dual\-path Bayesian adjustment is a principled and practical strategy for improving the robustness of modern MLLMs\.
## 6Acknowledgement
This work is sponsored by Beijing Nova Program\.
## References
- An et al\. \[2025\]Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, and Shijian Lu\.Mitigating object hallucinations in large vision\-language models with assembly of global and local attention\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pages 29915–29926, 2025\.
- Bai et al\. \[2023\]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou\.Qwen\-vl: A versatile vision\-language model for understanding, localization, text reading, and beyond, 2023\.
- Bai et al\. \[2024\]Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou\.Hallucination of multimodal large language models: A survey\.*arXiv e\-prints*, pages arXiv–2404, 2024\.
- Cai \[2008\]T Tony Cai\.Comment: Microarrays, empirical bayes and the two\-group model\.*Statist\. Sci\.*, 23\(1\):29–33, 2008\.
- Chen et al\. \[2024\]Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai\.Multi\-object hallucination in vision language models\.*Advances in Neural Information Processing Systems*, 37:44393–44418, 2024\.
- Chopin et al\. \[2015\]Nicolas Chopin, Sébastien Gadat, Benjamin Guedj, Arnaud Guyader, and Elodie Vernet\.On some recent advances on high dimensional bayesian statistics\.*ESAIM: Proceedings and Surveys*, 51:293–319, 2015\.
- Dai et al\. \[2023\]Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi\.Instructblip: Towards general\-purpose vision\-language models with instruction tuning\.*Advances in neural information processing systems*, 36:49250–49267, 2023\.
- Dathathri et al\. \[2020\]Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu\.Plug and play language models: A simple approach to controlled text generation\.In*International Conference on Learning Representations*, 2020\.
- Dosovitskiy et al\. \[2020\]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al\.An image is worth 16x16 words: Transformers for image recognition at scale\.In*International Conference on Learning Representations*, 2020\.
- Guan et al\. \[2024\]Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al\.Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision\-language models\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 14375–14385, 2024\.
- Gunjal et al\. \[2024\]Anisha Gunjal, Jihan Yin, and Erhan Bas\.Detecting and preventing hallucinations in large vision language models\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, pages 18135–18143, 2024\.
- Ho et al\. \[2020\]Jonathan Ho, Ajay Jain, and Pieter Abbeel\.Denoising diffusion probabilistic models\.*Advances in neural information processing systems*, 33:6840–6851, 2020\.
- Hu et al\. \[2024\]Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu\.Bliva: A simple multimodal llm for better handling of text\-rich visual questions\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, pages 2256–2264, 2024\.
- Huang et al\. \[2025\]Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al\.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.*ACM Transactions on Information Systems*, 43\(2\):1–55, 2025\.
- Huang et al\. \[2024\]Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu\.Opera: Alleviating hallucination in multi\-modal large language models via over\-trust penalty and retrospection\-allocation\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 13418–13427, 2024\.
- Hudson and Manning \[2019\]Drew A Hudson and Christopher D Manning\.Gqa: A new dataset for real\-world visual reasoning and compositional question answering\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 6700–6709, 2019\.
- Leng et al\. \[2023\]Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing\.Mitigating object hallucinations in large vision\-language models through visual contrastive decoding\.*arXiv preprint arXiv:2311\.16922*, 2023\.
- Li et al\. \[2022\]Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi\.Blip: Bootstrapping language\-image pre\-training for unified vision\-language understanding and generation\.In*International conference on machine learning*, pages 12888–12900\. PMLR, 2022\.
- \[19\]Rui Li, Marcus Klasson, Arno Solin, and Martin Trapp\.Streamlining prediction in bayesian deep learning\.In*The Thirteenth International Conference on Learning Representations*\.
- Li et al\. \[2023a\]Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji\-Rong Wen\.Evaluating object hallucination in large vision\-language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 292–305, 2023a\.
- Li et al\. \[2023b\]Zhenwei Li, Mengying Xu, Xiaoli Yang, Yanqi Han, and Jiawen Wang\.A multi\-label detection deep learning model with attention\-guided image enhancement for retinal images\.*Micromachines*, 14\(3\):705, 2023b\.
- Liang et al\. \[2022\]Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou\.Mind the gap: Understanding the modality gap in multi\-modal contrastive representation learning\.*Advances in Neural Information Processing Systems*, 35:17612–17625, 2022\.
- Lin et al\. \[2014\]Tsung\-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick\.Microsoft coco: Common objects in context\.In*European conference on computer vision*, pages 740–755\. Springer, 2014\.
- Liu et al\. \[2023\]Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee\.Visual instruction tuning\.*Advances in neural information processing systems*, 36:34892–34916, 2023\.
- Liu et al\. \[2024\]Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee\.Improved baselines with visual instruction tuning\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 26296–26306, 2024\.
- Liu et al\. \[2025\]Sheng Liu, Haotian Ye, and James Zou\.Reducing hallucinations in large vision\-language models via latent space steering\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Lu et al\. \[2024\]Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al\.Deepseek\-vl: towards real\-world vision\-language understanding\.*arXiv preprint arXiv:2403\.05525*, 2024\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*, 35:27730–27744, 2022\.
- Peng et al\. \[2024\]Jiaren Peng, Wenzhong Yang, Fuyuan Wei, Liang He, Long Yao, and Hongzhen Lv\.Event co\-occurrences for prompt\-based generative event argument extraction\.*Scientific Reports*, 14\(1\):31377, 2024\.
- Poudel and Lee \[2021\]Sahadev Poudel and Sang\-Woong Lee\.Deep multi\-scale attentional features for medical image segmentation\.*Applied Soft Computing*, 109:107445, 2021\.
- Rohrbach et al\. \[2018\]Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko\.Object hallucination in image captioning\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 4035–4045, 2018\.
- Schwenk et al\. \[2022\]Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi\.A\-okvqa: A benchmark for visual question answering using world knowledge\.In*European conference on computer vision*, pages 146–162\. Springer, 2022\.
- Shi et al\. \[2023\]Peiyang Shi, Michael C Welle, Mårten Björkman, and Danica Kragic\.Towards understanding the modality gap in clip\.In*ICLR 2023 workshop on multimodal representation learning: perks and pitfalls*, 2023\.
- Shu et al\. \[2025\]Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, and Mengnan Du\.Large vision\-language model alignment and misalignment: A survey through the lens of explainability\.*arXiv preprint arXiv:2501\.01346*, 2025\.
- Tang et al\. \[2024\]Shuyuan Tang, Yiqing Zhou, Jintao Li, Chang Liu, and Jinglin Shi\.Attention\-guided sample\-based feature enhancement network for crowded pedestrian detection using vision sensors\.*Sensors*, 24\(19\):6350, 2024\.
- Team et al\. \[2023\]Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean\-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al\.Gemini: a family of highly capable multimodal models\.*arXiv preprint arXiv:2312\.11805*, 2023\.
- Wang et al\. \[2024\]Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al\.Enhancing the reasoning ability of multimodal large language models via mixed preference optimization\.*arXiv preprint arXiv:2411\.10442*, 2024\.
- Watanabe \[2024\]Sumio Watanabe\.Recent advances in algebraic geometry and bayesian statistics\.*Information Geometry*, 7\(Suppl 1\):187–209, 2024\.
- Yang et al\. \[2025a\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu\.Qwen3 technical report, 2025a\.
- Yang et al\. \[2025b\]Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu\.Mitigating hallucination in large vision\-language models via modular attribution and intervention\.In*Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning*, 2025b\.
- Yin et al\. \[2025\]Hao Yin, Guangzong Si, and Zilei Wang\.Clearsight: Visual signal enhancement for object hallucination mitigation in multimodal large language models\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pages 14625–14634, 2025\.
- Yin et al\. \[2024\]Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen\.A survey on multimodal large language models\.*National Science Review*, 11\(12\):nwae403, 2024\.
- Zhang et al\. \[2024\]Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu\.Vision\-language models for vision tasks: A survey\.*IEEE transactions on pattern analysis and machine intelligence*, 46\(8\):5625–5644, 2024\.
- Zhong et al\. \[2024\]Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, and Bing Qin\.Investigating and mitigating the multimodal hallucination snowballing in large vision\-language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11991–12011, 2024\.
- Zhou et al\. \[2022\]Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu\.Learning to prompt for vision\-language models\.*International Journal of Computer Vision*, 130\(9\):2337–2348, 2022\.
- Zhu et al\. \[2025\]Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu\.Ibd: Alleviating hallucinations in large vision\-language models via image\-biased decoding\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pages 1624–1633, 2025\.Similar Articles
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD framework halves multimodal hallucination in LVLMs by using phase-wise self-reward decoding and a distilled lightweight reward model without extra supervision.
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding
This paper introduces MGAP, a training-free decoding method that reduces hallucinations in Multimodal Large Language Models by adaptively suppressing only the harmful parts of language priors while preserving the model's semantic manifold. The method outperforms prior baselines on POPE and CHAIR benchmarks.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
The paper introduces Hard Negative Captions (HNC), a dataset and method for training vision-language models to achieve fine-grained comprehension by addressing weak associations in web-collected image-text pairs.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.