TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

arXiv cs.AI 06/02/26, 04:00 AM Papers
Summary
TIGER is an inference-time framework that mitigates hallucinations in multimodal generation by extracting observation and claim graphs and assigning risk scores to repair unsupported facts. It reduces unsupported content across image-to-text, image+text-to-text, audio-to-text, and video-to-text tasks.
arXiv:2606.00232v1 Announce Type: new Abstract: We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:45 PM
# TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
Source: [https://arxiv.org/html/2606.00232](https://arxiv.org/html/2606.00232)
Kaixiang Zhao1Tianrun Yu1Shawn Huang1Porter Jenkins1Yushun Dong2Amanda Hughes1 1Brigham Young University2Florida State University \{kzhao2,tianruny,huang717,pjenkins,amanda\_hughes\}@byu\.edu yushun\.dong@fsu\.edu

###### Abstract

We study fact\-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input\. Existing inference\-time repair methods often generate feedback by jointly conditioning on the input and the current output\. This design has two limitations: hallucinated claims in the output can bias the model’s interpretation of the input, and free\-form feedback cannot be ranked or scheduled at the fact level\. We present TIGER111Our code is available at[https://github\.com/kzhao5/tiger](https://github.com/kzhao5/tiger), an inference\-time framework that redesigns feedback for localized repair\. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph\-conditioned risk score based on support and conflict\. The model repairs selected high\-risk claims while keeping the backbone frozen\. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions\. Experiments across four cross\-modal paths, including image→\\totext, image\+text→\\totext, audio→\\totext, and video→\\totext, show that TIGER reduces unsupported content while preserving task quality\. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi\-source settings\.

TIGER: Traceable Inference with Graph\-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

## 1Introduction

Recent unified multimodal language models support cross\-modal generation with inputs from text, images, audio, and video\(Wuet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib24); Luet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib25); Team,[2024](https://arxiv.org/html/2606.00232#bib.bib26); Zhanet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib27); Wuet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib28); Chenet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib29); Xuet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib31); Denget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib33); Yanget al\.,[2026](https://arxiv.org/html/2606.00232#bib.bib34)\)\. This capability is useful for applications such as content creation, multimodal decision support, disaster response, medical triage, and journalism\. In this paper, we focus on factual generation with textual outputs, where the input may contain one or more modalities\. A key challenge in this setting is*faithfulness hallucination*, where the generated response contains facts that are not supported by the input\(Zhanget al\.,[2024b](https://arxiv.org/html/2606.00232#bib.bib72); van Spranget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib73)\)\. The response is often fluent, and most of its content may be correct, but a small number of specific facts can contradict the evidence\. For example, an image with two trucks may be described as containing three trucks, an intact bridge may be described as collapsed, or “light rain” in speech may be summarized as a “heavy storm\.” These errors are local, but they can be harmful because downstream users may trust the generated content\. Section[2](https://arxiv.org/html/2606.00232#S2)shows that such errors can arise from spurious correlations during joint autoregressive generation\.*Our goal is to identify and repair unsupported facts at inference time on a frozen backbone, while preserving correct content\.*

![Refer to caption](https://arxiv.org/html/2606.00232v1/x1.png)Figure 1:Overview of TIGER\.TIGER first generates an initial output, extracts fact graphs from the input and output, ranks claims by risk, and locally repairs selected high\-risk facts while keeping the backbone frozen\.Iterative repair with feedback is a natural way to address this problem\. In this setting, the model revises its current output by conditioning on the input, the current output, and a feedback signal\. Existing multimodal self\-correction methods, including Volcano\(Leeet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib89)\), Woodpecker\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\), and DeGF\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib50)\), follow this pattern; see Appendix[A](https://arxiv.org/html/2606.00232#A1)\. However, these methods generate feedback by jointly conditioning on the input and the current output\. This design creates a failure mode: the feedback generator sees the hallucinated claims in the current output, so these claims can influence how the model interprets the input\. As a result, the feedback may endorse unsupported content instead of flagging it\(Fanouset al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib90); Sun and Wang,[2026](https://arxiv.org/html/2606.00232#bib.bib91); Liet al\.,[2023](https://arxiv.org/html/2606.00232#bib.bib87)\)\. Free\-form feedback also provides no explicit score for each claim, which makes it difficult to decide which facts should be repaired under a limited compute budget\. These limitations motivate our central question:

*Can we redesign the feedback mechanism in iterative multimodal repair so that it reduces spurious correlation during feedback generation and supports fact\-level scheduling?*

We answer this question withTIGER\(TraceableInference withGraph\-basedEvidenceRouting\), an inference\-time framework for fact\-level multimodal repair\. As shown in Figure[1](https://arxiv.org/html/2606.00232#S1.F1), TIGER redesigns feedback in two stages\. First, it applies*atomic projection*: the input and the current output are extracted independently into structured fact graphs\. This separation reduces the direct channel through which unsupported claims in the output can influence the model’s reading of the input\. Second, TIGER computes a*deterministic fact\-level risk*for each output claim by measuring its support and conflict against the input graph\. These scores make the feedback rankable, so TIGER can select a small set of high\-risk facts and repair them locally\. We also provide a convergence analysis which shows that the expected total risk decreases geometrically under mild assumptions\.

We evaluate TIGER on four text\-output cross\-modal generation paths: image→\\totext, image\+text→\\totext, audio→\\totext, and video→\\totext\. The main experiments use Qwen2\.5\-Omni\(Xuet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib31)\)as the primary frozen backbone, with additional results on LLaVA\-1\.5 and other open\-source and proprietary backbones\. Across COCO, AMBER, MMHal\-Bench, Clotho, and VideoHallucer, TIGER reduces unsupported claims while preserving or improving task quality\. On external hallucination metrics such as CHAIR\(Rohrbachet al\.,[2018](https://arxiv.org/html/2606.00232#bib.bib85)\), TIGER mitigates the common failure mode where repair improves fluency but introduces new unsupported objects\. A case study on CrisisFACTS\(Buntainet al\.,[2023](https://arxiv.org/html/2606.00232#bib.bib70)\)shows that TIGER can improve grounded situation reporting from noisy evidence collected from multiple sources\.

Our contributionsare summarized as follows:

- •We identify a feedback\-stage failure mode in iterative multimodal repair: when feedback is generated by jointly conditioning on the input and the current output, hallucinated claims in the output can bias the model’s interpretation of the input\.
- •We propose TIGER, an inference\-time repair framework that replaces free\-form joint feedback with independent atomic projection and deterministic fact\-level risk ranking\. This design makes feedback explicit, rankable, and suitable for localized repair under a fixed compute budget\.
- •We evaluate TIGER across four text\-output cross\-modal paths: image→\\totext, image\+text→\\totext, audio→\\totext, and video→\\totext\. Experiments show that TIGER reduces hallucination while preserving task quality, generalizes across multiple backbones, and remains effective in a real\-world crisis\-reporting case study\.

## 2A Motivating Observation

Figure[2](https://arxiv.org/html/2606.00232#S2.F2)\(top\) illustrates a simple failure mode\. The input image shows a person wakeboarding on water, but no boat is visible\. With a free\-form description prompt, Qwen2\.5\-Omni generates a caption that states the tow rope is attached to a boat\. With an atomic enumeration prompt and the same decoding setting, the model describes the tow rope as connected to an unseen source and does not introduce the absent boat\.

Figure[2](https://arxiv.org/html/2606.00232#S2.F2)\(bottom\) evaluates this effect distributionally\. Across four cue\-to\-absent object pairs, we compute the co\-occurrence hallucination rate \(CHR\), defined as the fraction of generations that mention the absent objectbbwhen the cueaais present\. We sort the pairs by their COCO co\-occurrence frequencyP\(b∣a\)P\(b\\mid a\)\. Free\-form generation has a higher CHR than atomic enumeration for every pair, and atomic enumeration reduces CHR by about2\.6×2\.6\\times\. Prior work reports similar hallucination driven by co\-occurrence on other backbones\(Datta and Sundararaman,[2025](https://arxiv.org/html/2606.00232#bib.bib88); Liet al\.,[2023](https://arxiv.org/html/2606.00232#bib.bib87)\), which suggests that this behavior is not specific to one model\. A likely mechanism is that free\-form generation conditions each token on the generated text, so early claims can activate training priors and lead to unsupported content\(Rohrbachet al\.,[2018](https://arxiv.org/html/2606.00232#bib.bib85); Zhouet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib86)\)\. Atomic enumeration constrains the model to produce short factual units, which reduces this prior\-driven generation path\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x2.png)Figure 2:Free\-form generation vs\. atomic enumeration\.Top: a qualitative example where free\-form generation adds an unsupported object, while atomic enumeration avoids it\. Bottom: co\-occurrence hallucination rate \(CHR\) across cue\-to\-absent object pairs\.This observation motivates TIGER\. If spurious correlation enters through joint generation, a feedback mechanism that also relies on joint generation can inherit the same bias\. The next section uses this observation to redesign the feedback loop with independent fact extraction and risk\-based repair\.

## 3Methodology

### 3\.1Problem Setup

We consider a multimodal\-to\-text generation task where the input𝐗∈ℳX\\mathbf\{X\}\\in\\mathcal\{M\}\_\{X\}may be image, text, audio, or video, and the output𝐘∈𝒯\\mathbf\{Y\}\\in\\mathcal\{T\}is textual\. Given a task prompt𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}, a multimodal modelΦ\\Phirealizes

𝐘=Φ\(𝒫gen,𝐗\)\.\\mathbf\{Y\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\)\.\(1\)As shown in Section[2](https://arxiv.org/html/2606.00232#S2), single\-shot generation can inject spurious correlations into𝐘\\mathbf\{Y\}through joint autoregressive decoding\. Prior work often uses an iterative repair form:

𝐘0=Φ\(𝒫gen,𝐗\),𝐘t\+1=Φ\(𝒫refine,𝐗,𝐘t,ℱt\),\\mathbf\{Y\}\_\{0\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\),\\qquad\\mathbf\{Y\}\_\{t\+1\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{refine\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\},\\mathcal\{F\}\_\{t\}\),\(2\)whereℱt\\mathcal\{F\}\_\{t\}indicates what in𝐘t\\mathbf\{Y\}\_\{t\}needs correction\. A common critique\-then\-revise instantiation, such as Volcano, generates natural\-language feedback by jointly conditioning on the input and the current output:

ℱt=Φ\(𝒫fb,𝐗,𝐘t\)\.\\mathcal\{F\}\_\{t\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{fb\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\}\)\.\(3\)Other correction methods use different response\-conditioned signals, such as tool\-based visual validation in Woodpecker and generative visual feedback in DeGF\. However, existing methods do not independently construct an observation graph from𝐗\\mathbf\{X\}and a claim graph from𝐘t\\mathbf\{Y\}\_\{t\}, nor do they compute deterministic per\-claim risk scores\.

This creates two limitations\. First, critique feedback in Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\) can inherit spurious correlations because the feedback generator observes𝐘t\\mathbf\{Y\}\_\{t\}, so unsupported claims may be endorsed rather than flagged\. Second, existing feedback or correction signals are not explicit fact\-level risk scores, makingarg⁡max\\arg\\maxor top\-KKscheduling difficult to enforce\.

### 3\.2Redesigningℱt\\mathcal\{F\}\_\{t\}

We address these limitations by replacing Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\) with two changes\. First,𝐗\\mathbf\{X\}and𝐘t\\mathbf\{Y\}\_\{t\}are independently projected into fact graphs:

G𝐗=Φ\(𝒫ext,𝐗\),G𝐘t=Φ\(𝒫ext,𝐘t\)\.G\_\{\\mathbf\{X\}\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\mathbf\{X\}\),\\qquad G\_\{\\mathbf\{Y\}\_\{t\}\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\mathbf\{Y\}\_\{t\}\)\.\(4\)Second, feedback is replaced by a graph\-conditioned selection operator:

ℱt=Ψα\(G𝐗,G𝐘t\)\.\\mathcal\{F\}\_\{t\}=\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)\.\(5\)Input isolation in Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\) mitigates the conditioning bias, whileα\\alpha\-budgeted selection inΨα\\Psi\_\{\\alpha\}enables fact\-level scheduling\. The refine step in Eq\. \([2](https://arxiv.org/html/2606.00232#S3.E2)\) remains stochastic\.

Atomic extraction\.The prompt𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}asksΦ\\Phito enumerate observable facts as triples\(e1,rel,e2\)\(e\_\{1\},\\text\{rel\},e\_\{2\}\), wheree1e\_\{1\},e2e\_\{2\}are head and tail entities andrelis drawn from a fixed relation vocabulary \(Appendix[B](https://arxiv.org/html/2606.00232#A2)\)\. Each fact is generated within the boundary of a single triple, encouragingΦ\\Phito operate in local context and reducing the co\-occurrence channel documented in Section[2](https://arxiv.org/html/2606.00232#S2)\. The input graphG𝐗G\_\{\\mathbf\{X\}\}and the output graphG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}are extracted in complete isolation\.G𝐗G\_\{\\mathbf\{X\}\}is computed once before any𝐘\\mathbf\{Y\}exists and remains fixed, so errors in𝐘t\\mathbf\{Y\}\_\{t\}cannot directly affect the extraction of input facts, blocking the direct conditioning path behind \(i\)\. Facts that share the same entity are connected by coreference edges, so each graph carries both node\-level and edge\-level structure\.

Risk\-based selectionΨα\\Psi\_\{\\alpha\}\.For each claimf∈G𝐘tf\\in G\_\{\\mathbf\{Y\}\_\{t\}\},Ψα\\Psi\_\{\\alpha\}computes a support score and a conflict score\. The similarity between two facts is computed as the mean of per\-field clipped cosine similarities using a frozen sentence transformer\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.00232#bib.bib99)\)\. The*local support*s0\(f\)=maxg∈G𝐗⁡sim\(f,g\)s\_\{0\}\(f\)=\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\mathrm\{sim\}\(f,g\)measures the best match in the observation graph\. To compensate for extraction omissions, local support is propagated along coreference edges inG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}with geometric decay:

s\(f\)=maxf′∈\{f\}∪𝒩K\(f\)⁡γd\(f,f′\)⋅s0\(f′\),s\(f\)=\\max\_\{f^\{\\prime\}\\in\\\{f\\\}\\cup\\mathcal\{N\}\_\{K\}\(f\)\}\\;\\gamma^\{\\,d\(f,\\,f^\{\\prime\}\)\}\\cdot s\_\{0\}\(f^\{\\prime\}\),\(6\)where𝒩K\(f\)\\mathcal\{N\}\_\{K\}\(f\)is theKK\-hop coreference neighborhood,d\(f,f′\)d\(f,f^\{\\prime\}\)is the hop distance, andγ∈\(0,1\)\\gamma\\in\(0,1\)is a decay factor\. This propagation allows facts about the same entity to reinforce each other, reducing false positives caused by extraction omissions\. The*conflict score*measures the strongest contradiction between a claim and the observation graph\. The functionconflict\(f,g\)\\mathrm\{conflict\}\(f,g\)is defined as the product of topic consistency \(how well subjects and predicates match\) and conclusion divergence \(how much the objects differ\), which acts as a soft gate: conflict is significant only when two facts discuss the same topic but reach different conclusions \(Appendix[C](https://arxiv.org/html/2606.00232#A3)\)\. The conflict score of a claim is:

c\(f\)=maxg∈G𝐗⁡conflict\(f,g\)\.c\(f\)=\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\;\\mathrm\{conflict\}\(f,g\)\.\(7\)
The risk combines lack of support and presence of conflict:

r\(f\)=\(1−s\(f\)\)\+λ⋅c\(f\),λ\>0\.r\(f\)=\(1\-s\(f\)\)\+\\lambda\\cdot c\(f\),\\qquad\\lambda\>0\.\(8\)Since the scoring components are deterministic, a fixed pair\(G𝐗,G𝐘t\)\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)yields the same per\-claim risk scoresr\(f\)r\(f\)for allf∈G𝐘tf\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\. WithN=\|G𝐘t\|N=\|G\_\{\\mathbf\{Y\}\_\{t\}\}\|,Ψα\\Psi\_\{\\alpha\}returns the⌈αN⌉\\lceil\\alpha N\\rceilhighest\-risk facts:

Ψα\(G𝐗,G𝐘t\)=argmax𝒮⊆G𝐘t\|𝒮\|=⌈αN⌉∑f∈𝒮r\(f\)\.\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)\\;=\\;\\operatorname\*\{arg\\,max\}\_\{\\begin\{subarray\}\{c\}\\mathcal\{S\}\\,\\subseteq\\,G\_\{\\mathbf\{Y\}\_\{t\}\}\\\\ \|\\mathcal\{S\}\|\\,=\\,\\lceil\\alpha N\\rceil\\end\{subarray\}\}\\sum\_\{f\\in\\mathcal\{S\}\}r\(f\)\.\(9\)Each selected fact keeps its source position, soΦ\\Phirepairs the corresponding span locally\. The full design and property verification ofr\(f\)r\(f\)are given in Appendix[C](https://arxiv.org/html/2606.00232#A3)\.

### 3\.3Algorithm and Convergence

The complete iterative repair procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.00232#alg1)\.

Algorithm 1Iterative Fact\-Level Repair1:input

𝐗\\mathbf\{X\}; backbone

Φ\\Phi; rounds

TT; budget

α∈\(0,1\]\\alpha\\in\(0,1\]; risk weight

λ\\lambda
2:repaired output

𝐘T\\mathbf\{Y\}\_\{T\}
3:

G𝐗←Φ\(𝒫ext,𝐗\)G\_\{\\mathbf\{X\}\}\\leftarrow\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\mathbf\{X\}\)
4:

𝐘0←Φ\(𝒫gen,𝐗\)\\mathbf\{Y\}\_\{0\}\\leftarrow\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\)
5:for

t=0,1,…,T−1t=0,1,\\dots,T\-1do

6:

G𝐘t←Φ\(𝒫ext,𝐘t\)G\_\{\\mathbf\{Y\}\_\{t\}\}\\leftarrow\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\mathbf\{Y\}\_\{t\}\)
7:foreach fact

f∈G𝐘tf\\in G\_\{\\mathbf\{Y\}\_\{t\}\}do

8:

s0\(f\)←maxg∈G𝐗⁡sim\(f,g\)s\_\{0\}\(f\)\\leftarrow\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\ \\mathrm\{sim\}\(f,g\)
9:

s\(f\)←maxf′∈\{f\}∪𝒩K\(f\)⁡γd\(f,f′\)⋅s0\(f′\)s\(f\)\\leftarrow\\max\_\{f^\{\\prime\}\\in\\\{f\\\}\\cup\\mathcal\{N\}\_\{K\}\(f\)\}\\ \\gamma^\{d\(f,f^\{\\prime\}\)\}\\cdot s\_\{0\}\(f^\{\\prime\}\)
10:

c\(f\)←maxg∈G𝐗⁡conflict\(f,g\)c\(f\)\\leftarrow\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\ \\mathrm\{conflict\}\(f,g\)
11:

r\(f\)←\(1−s\(f\)\)\+λ⋅c\(f\)r\(f\)\\leftarrow\(1\-s\(f\)\)\+\\lambda\\cdot c\(f\)
12:endfor

13:

ℱt←top\-⌈αN⌉facts ofG𝐘tbyr\(⋅\)\\mathcal\{F\}\_\{t\}\\leftarrow\\text\{top\-\}\\lceil\\alpha N\\rceil\\text\{ facts of \}G\_\{\\mathbf\{Y\}\_\{t\}\}\\text\{ by \}r\(\\cdot\)
14:

𝐘t\+1←Φ\(𝒫refine,𝐗,𝐘t,ℱt\)\\mathbf\{Y\}\_\{t\+1\}\\leftarrow\\Phi\(\\mathcal\{P\}\_\{\\text\{refine\}\},\\ \\mathbf\{X\},\\ \\mathbf\{Y\}\_\{t\},\\ \\mathcal\{F\}\_\{t\}\)
15:endfor

16:return

𝐘T\\mathbf\{Y\}\_\{T\}

LetR\(t\)=∑f∈G𝐘tr\(f\)R^\{\(t\)\}=\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\}r\(f\)denote the measured total risk\. Under explicit assumptions \(bounded graph size, per\-fact repair progressε\\varepsilon, bounded side effectsβ\\beta, bounded extraction lossξ\\xi; see Appendix[D](https://arxiv.org/html/2606.00232#A4)\), the following conditional bound holds\.

###### Theorem 3\.1\(Conditional geometric risk bound\)\.

AfterTTrounds of repair, the expected measured risk satisfies

𝔼\[R\(T\)\]≤\(1−αε\)TR\(0\)\+β\+ξαε\.\\mathbb\{E\}\\\!\\left\[R^\{\(T\)\}\\right\]\\;\\leq\\;\(1\-\\alpha\\varepsilon\)^\{T\}\\,R^\{\(0\)\}\\;\+\\;\\frac\{\\beta\+\\xi\}\{\\alpha\\varepsilon\}\.\(10\)

The full proof is in Appendix[D](https://arxiv.org/html/2606.00232#A4)\. The asymptotic floor\(β\+ξ\)/\(αε\)\(\\beta\+\\xi\)/\(\\alpha\\varepsilon\)reflects the residual risk caused by side effects and extraction loss\. Notably, because each refine step accesses raw input𝐗\\mathbf\{X\}rather than onlyG𝐗G\_\{\\mathbf\{X\}\}, the repair loop can recover facts lost during extraction, so𝐘T\\mathbf\{Y\}\_\{T\}can exceed the coverage ofG𝐗G\_\{\\mathbf\{X\}\}\. We verify this empirically in Section[4\.5](https://arxiv.org/html/2606.00232#S4.SS5)\.

## 4Experimental Evaluation

We structure our evaluation around five research questions\.RQ1asks whether TIGER reduces hallucination across cross\-modal generation paths\.RQ2tests whether each core component of TIGER is necessary, including iterative repair, atomic projection, and deterministic risk ranking\.RQ3studies the sensitivity of TIGER to its three main hyperparameters: the number of repair roundsTT, the batch budgetα\\alpha, and the conflict weightλ\\lambda\.RQ4examines the mechanism behind TIGER by testing whether atomic projection reduces spurious correlation during feedback generation and whether repair can recover facts that are missing fromG𝐗G\_\{\\mathbf\{X\}\}\. Finally,RQ5uses a case study to assess whether TIGER generalizes from benchmark inputs to real\-world settings with noisy data from multiple sources\.

### 4\.1Experimental Setup

Datasets\.We evaluate TIGER on five benchmarks that cover four cross\-modal generation paths and four modalities: \(1\)COCO CaptionsChenet al\.\([2015](https://arxiv.org/html/2606.00232#bib.bib80)\)for image→\\totext generation; \(2\)MMHal\-BenchSunet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib52)\)for image\+text→\\totext generation; \(3\)AMBERWanget al\.\([2023](https://arxiv.org/html/2606.00232#bib.bib93)\)for image→\\totext hallucination evaluation; \(4\)ClothoDrossoset al\.\([2020](https://arxiv.org/html/2606.00232#bib.bib69)\)for audio→\\totext generation; and \(5\)VideoHallucerWanget al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib94)\)for video→\\totext hallucination evaluation\. We report the results as mean±\\pmstd over three random seeds\. Additional dataset details are provided in Appendix[E\.1](https://arxiv.org/html/2606.00232#A5.SS1)\.

Baselines\.We compare TIGER with Frozen and three groups of baselines using the same frozen backbone whenever applicable\. Frozen uses direct decoding from the frozen backbone without post\-processing\. The resampling baselines include BoN\+CLIP, BoN\+VisualPRMWanget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib40)\), and BoN\+CycleRewardBahnget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib42)\), which select one output from multiple candidates\. The modality\-specific decoding baselines include VCDLenget al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib95)\)for image inputs, AADHsuet al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib96)\)for audio inputs, and TCDZhanget al\.\([2024a](https://arxiv.org/html/2606.00232#bib.bib97)\)for video inputs\. The iterative refinement baselines include WoodpeckerYinet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib79)\), DeGFZhanget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib50)\), and VolcanoLeeet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib89)\)\. The iterative refinement methods follow the joint\-conditioning feedback template in Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\), while the resampling and decoding methods do not produce fact\-level feedback\. Prompt templates and implementation details are provided in Appendices[B\.2](https://arxiv.org/html/2606.00232#A2.SS2)and[E\.2](https://arxiv.org/html/2606.00232#A5.SS2)\.Metrics\.To avoid self\-referential evaluation, every headline metric is computed independently of TIGER’s evidence graph\. Table[1](https://arxiv.org/html/2606.00232#S4.T1)lists the external metrics used for each evaluation path\.

Table 1:External metrics for each evaluation path\.Implementation details\.The primary backboneΦ\\Phiis Qwen2\.5\-Omni\-7BXuet al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib31)\)\. The frozen model serves as both the generator of𝐘t\\mathbf\{Y\}\_\{t\}and the extractor that producesG𝐗G\_\{\\mathbf\{X\}\}andG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}through Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\)\. The extraction prompt𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}asksΦ\\Phito enumerate facts as\(e1,rel,e2\)\(e\_\{1\},\\text\{rel\},e\_\{2\}\)triples\. It is applied once to𝐗\\mathbf\{X\}before the repair loop and then to each𝐘t\\mathbf\{Y\}\_\{t\}during repair\. Per\-dataset hyperparameter settings, including repair roundsTT, batch budgetα\\alpha, and conflict weightλ\\lambda, are listed in Appendix[E\.3](https://arxiv.org/html/2606.00232#A5.SS3)\. We report results with LLaVA\-1\.5\-7BLiuet al\.\([2024a](https://arxiv.org/html/2606.00232#bib.bib82)\)to evaluate cross\-backbone generalization\. Additional results with GPT\-5\.5, Gemini 3\.5 Flash, and Claude Haiku 4\.5 are reported in Appendix[F\.1](https://arxiv.org/html/2606.00232#A6.SS1)\. Appendix[F\.2](https://arxiv.org/html/2606.00232#A6.SS2)compares TIGER against deterministic modality\-specific extractors\. Per\-method computational cost is summarized in Appendix[F\.3](https://arxiv.org/html/2606.00232#A6.SS3)\.

### 4\.2Main Results \(RQ1\)

To answer RQ1, we evaluate whether TIGER reduces hallucination across image→\\totext, image\+text→\\totext, audio→\\totext, and video→\\totext generation paths\. Table[2](https://arxiv.org/html/2606.00232#S4.T2)reports results on COCO, AMBER, and MMHal\-Bench with Qwen2\.5\-Omni\-7B and LLaVA\-1\.5\-7B\. Table[3](https://arxiv.org/html/2606.00232#S4.T3)reports audio→\\totext results on Clotho, and Table[4](https://arxiv.org/html/2606.00232#S4.T4)reports video→\\totext results on VideoHallucer\.

As shown in Table[2](https://arxiv.org/html/2606.00232#S4.T2), TIGER reduces hallucination on image\-based tasks while preserving semantic quality\. On Qwen2\.5\-Omni\-7B, it reduces COCO CHAIRsfrom0\.0700\.070to0\.0500\.050and CHAIRifrom0\.0690\.069to0\.0350\.035, while improving BERTScore from0\.5880\.588to0\.6430\.643\. On LLaVA\-1\.5\-7B, it lowers COCO CHAIRsfrom0\.0600\.060to0\.0300\.030and achieves the highest BERTScore\. The same trend appears on AMBER and MMHal\-Bench, where TIGER improves faithfulness without degrading text quality\. These results suggest that the repair strategy is not tied to a single backbone or benchmark\.

Tables[3](https://arxiv.org/html/2606.00232#S4.T3)and[4](https://arxiv.org/html/2606.00232#S4.T4)show that the gains extend beyond image inputs\. On Clotho, TIGER improves captioning quality and audio\-text alignment, and it reduces AEHR from0\.8030\.803to0\.7570\.757, which indicates fewer unsupported audio event mentions\. On VideoHallucer, TIGER lowers HallucRate from0\.0150\.015to0\.0100\.010and improves Paired accuracy from0\.1700\.170to0\.1980\.198\. Overall, the three tables show that TIGER consistently reduces unsupported claims across visual, acoustic, and temporal inputs while maintaining task performance\.

Table 2:Main results across two paths and two backbones\. The image→\\totext path uses COCO and AMBER ; the image\+text→\\totext path uses MMHal\-Bench\.Bold: best per column within each backbone block\.Blue: TIGER\.Table 3:Audio→\\totext on Clotho with Qwen2\.5\-Omni\-7B\. CLAP is reference\-free caption\-audio similarity; AEHR is the audio analogue of CHAIRi\.Bold: best per column\.Blue: TIGER\.Table 4:Video→\\totext on VideoHallucer with Qwen2\.5\-Omni\-7B\. HallucRate is the analogue of CHAIRg; Paired requires both halves of a pair correct\.Bold: best per column\.Blue: TIGER\.
### 4\.3Ablation: Are All Three Components Necessary? \(RQ2\)

To answer RQ2, we conduct a component ablation that tests iterative repair, atomic projection, and deterministic risk ranking under the same refinement step𝐘t\+1=Φ\(𝒫refine,𝐗,𝐘t,ℱt\)\\mathbf\{Y\}\_\{t\+1\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{refine\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\},\\mathcal\{F\}\_\{t\}\)\. We define four levels that differ only in how the feedback signalℱt\\mathcal\{F\}\_\{t\}is produced\.L0 \(Frozen\)uses direct decoding from the backbone without repair, where𝐘=Φ\(𝒫gen,𝐗\)\\mathbf\{Y\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\)\.L1 \(Naive feedback\)adds iterative repair with joint\-conditioning feedbackℱt=Φ\(𝒫fb,𝐗,𝐘t\)\\mathcal\{F\}\_\{t\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{fb\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\}\)as in Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\)\.L2 \(Text feedback\)adds atomic projection by extractingG𝐗G\_\{\\mathbf\{X\}\}andG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}independently through Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\), but it still uses text feedback without deterministic risk ranking\.L3 \(TIGER\)adds deterministic risk ranking and setsℱt=Ψα\(G𝐗,G𝐘t\)\\mathcal\{F\}\_\{t\}=\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)as in Eq\. \([5](https://arxiv.org/html/2606.00232#S3.E5)\)\. Thus, L0→\\toL1 tests iterative repair alone, L1→\\toL2 isolates atomic projection, and L2→\\toL3 isolates deterministic risk ranking\.

Figure[3](https://arxiv.org/html/2606.00232#S4.F3)reports the ablation results on COCO\. Iterative repair alone does not improve the output: L1 increases CHAIRsfrom0\.0700\.070to0\.0800\.080and decreases BERTScore from0\.5880\.588to0\.5280\.528\. This result supports our analysis that a repair loop can amplify hallucinations when its feedback is produced by joint conditioning on the input and output\. Adding atomic projection in L2 reduces this bias and restores performance, raising BERTScore to0\.6030\.603while bringing CHAIRsand CHAIRiback to the Frozen level\. Adding deterministic risk ranking in L3 gives the best result on all three metrics, with CHAIR=s0\.050\{\}\_\{s\}=0\.050, CHAIR=i0\.035\{\}\_\{i\}=0\.035, and BERTScore=0\.643=0\.643\. Results show that the three components play different roles: iterative repair provides the correction mechanism, atomic projection improves the reliability of feedback, and deterministic risk ranking directs the repair budget to the facts most likely to be unsupported\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x3.png)Figure 3:Component ablation on COCO\.
### 4\.4Hyperparameter Sensitivity \(RQ3\)

To answer RQ3, we conduct a one\-at\-a\-time sensitivity analysis on COCO for the three main hyperparameters of TIGER: the number of repair roundsTT, the repair budgetα\\alpha, and the conflict weightλ\\lambda\. For each sweep, we vary one hyperparameter and keep the other two fixed at the default setting\. Figure[4](https://arxiv.org/html/2606.00232#S4.F4)reports CHAIRs, which measures object hallucination at the sentence level and is independent of the internal risk score used by TIGER\.

The results show that TIGER is stable near the default setting\. IncreasingTTreduces CHAIRssubstantially at first, but the curve flattens after a small number of repair rounds\. This suggests that most correctable high\-risk facts are handled early in the repair process\. The repair budgetα\\alphashows a trade\-off between coverage and stability\. A small budget may leave some risky facts unrepaired, while a large budget can revise too many facts in one round\. The conflict weightλ\\lambdaalso requires balance\. Whenλ=0\\lambda=0, the selector ignores direct contradictions; whenλ\\lambdais too large, conflict signals can dominate the ranking\. Overall, the default setting lies in a low\-CHAIRsregion across all three sweeps, which indicates that the method is not sensitive to small changes in its main hyperparameters\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x4.png)Figure 4:Hyperparameter sensitivity of TIGER on COCO\. Each curve varies one hyperparameter while keeping the other two fixed\. The vertical axis reports CHAIRs; lower is better\.
### 4\.5Mechanism: Projection and Refinement \(RQ4\)

To answer RQ4, we test two mechanisms that distinguish TIGER from joint\-feedback baselines: whether atomic projection reduces spurious correlation in feedback, and whether the refinement step can recover correct facts that are missing from the extracted input graphG𝐗G\_\{\\mathbf\{X\}\}\. These probes focus on the feedback stage because this is where unsupported claims can either be removed or reinforced\.

We first test whether atomic projection reduces the spurious correlation that joint feedback can inherit from the backbone\. We select10001000images from COCO val2014 where each image contains a scene cueaa\(e\.g\., beach or kitchen\) and does not contain an objectbbthat often co\-occurs withaain COCO captions\. We compare three feedback channels under the sameT=5T\\\!=\\\!5repair loop\. L1 uses naive joint feedback,ℱt=Φ\(𝒫fb,𝐗,𝐘t\)\\mathcal\{F\}\_\{t\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{fb\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\}\), as in Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\)\. L2 uses independently extracted graphs as text in the feedback prompt, but it does not perform deterministic risk ranking\. L3 uses the atomic feedback of TIGER,Ψα\(G𝐗,G𝐘t\)\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)\. We report the feedback mention rate, defined as the fraction of samples whose feedback mentions the absent objectbb\. Figure[5](https://arxiv.org/html/2606.00232#S4.F5)\(a\) shows that L1 mentionsbbin5\.0%5\.0\\%of samples and L2 in4\.0%4\.0\\%, while L3 mentionsbbin only2\.0%2\.0\\%\. This result indicates that joint feedback can reintroduce co\-occurrence priors, whereas atomic feedback reduces this channel\.

We next test whether refinement can recover correct content that the extractor misses\. We label each fact in the repaired output as correct if it is supported by a reference graphG𝐆𝐓G\_\{\\mathbf\{GT\}\}extracted from the five captions of the image, and as in\-extraction if it is supported byG𝐗G\_\{\\mathbf\{X\}\}\. Figure[5](https://arxiv.org/html/2606.00232#S4.F5)\(b\) reports the resulting fact composition\. Compared with Frozen, TIGER keeps more correct facts, recovers additional correct facts beyondG𝐗G\_\{\\mathbf\{X\}\}, and removes a large portion of wrong facts\. This shows that refinement does not simply copy the extracted graph\. Instead, it can still use the raw input𝐗\\mathbf\{X\}to recover facts that the extractor did not capture\. Figure[5](https://arxiv.org/html/2606.00232#S4.F5)\(c\) gives a sample\-level view of the same effect\. The similarity fromG𝐗G\_\{\\mathbf\{X\}\}toG𝐆𝐓G\_\{\\mathbf\{GT\}\}is concentrated around lower values, while the similarity fromG𝐘TG\_\{\\mathbf\{Y\}\_\{T\}\}toG𝐆𝐓G\_\{\\mathbf\{GT\}\}shifts toward higher values\. This shift shows that the repaired output covers more reference\-supported content than the extracted input graph alone\. Overall, these probes explain why TIGER improves over the L1 and L2 ablations: atomic projection reduces biased feedback, and input\-grounded refinement preserves the ability to recover missing correct facts\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x5.png)\(a\)Feedback mention rate\.
![Refer to caption](https://arxiv.org/html/2606.00232v1/x6.png)\(b\)Fact composition ofG𝐘G\_\{\\mathbf\{Y\}\}\.
![Refer to caption](https://arxiv.org/html/2606.00232v1/x7.png)\(c\)Sample\-level similarity toG𝐆𝐓G\_\{\\mathbf\{GT\}\}\.

Figure 5:Mechanism analysis on COCO val2014\. \(a\) Joint feedback channels mention the absent objectbbmore often than the atomic channel\. \(b\) Compared with Frozen, TIGER preserves correct facts, recovers additional correct facts beyondG𝐗G\_\{\\mathbf\{X\}\}, and removes wrong facts\. \(c\) The repaired output shifts toward higher similarity toG𝐆𝐓G\_\{\\mathbf\{GT\}\}, which shows that refinement recovers content that the extractor misses\.
### 4\.6Case Study: Multi\-Source Crisis Reporting \(RQ5\)

To answer RQ5, we test whether TIGER transfers from clean benchmark inputs to a noisy real\-world setting\. We use*CrisisFACTS*Buntainet al\.\([2023](https://arxiv.org/html/2606.00232#bib.bib70)\), a NIST benchmark that contains Twitter, Reddit, Facebook, and online\-news streams with assessor\-annotated gold facts\. We select one peak event\-day from each of five disaster types: Hurricane, Wildfire, Flood, Explosion, and Tornado\. For each event, the backbone receives text streams and six high\-priority images, and it generates a situation report\. Figure[6](https://arxiv.org/html/2606.00232#S4.F6)shows the Hurricane example, where the multi\-source input is consolidated into an evidence graphG𝐗G\_\{\\mathbf\{X\}\}with387387facts and339339edges\. This setting is substantially noisier and larger than the benchmark inputs, so it tests whether fact\-level repair remains useful beyond controlled datasets\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x8.png)Figure 6:Case\-study pipeline on the Hurricane event\. Multi\-source inputs are consolidated intoG𝐗G\_\{\\mathbf\{X\}\}\.Table[5](https://arxiv.org/html/2606.00232#S4.T5)reports the Frozen and TIGER results on Qwen2\.5\-Omni\-7B\. This case study tests whether the repair mechanism can improve grounding in a realistic multi\-source application\. Across the five disasters, TIGER improves average Precision from0\.650\.65to0\.960\.96, Recall from0\.600\.60to0\.650\.65, and F1 from0\.660\.66to0\.740\.74\. The precision gain is especially important in this setting because situation reports should avoid claims that are not supported by the evidence\. At the same time, the recall and F1 improvements show that TIGER does not simply remove uncertain content; it also preserves or recovers facts that are supported by the evidence graph\. Qualitatively, the repaired reports replace vague statements with grounded entities and statistics, such as changing “in some areas” to “Creston, North Carolina” and “numerous road closures” to “a 9\-mile stretch of Interstate 95 in Dillon County\.” These results show that TIGER can support grounded summarization when the input contains noisy evidence from multiple sources\.

Table 5:CrisisFACTS case study on five disaster event\-days\. P/R/F1 are gold\-match precision, recall, and F1\.Bold: better score\. TIGER columns are shadedblue\.

## 5Conclusion

We presented TIGER, an inference\-time framework for fact\-level hallucination repair in multimodal generation\. TIGER separates input observation and output claim extraction, ranks claims by deterministic support/conflict risk, and locally repairs high\-risk claims with a frozen backbone\. Experiments across four cross\-modal paths show that TIGER reduces unsupported content while preserving task quality, and ablations confirm the importance of iterative repair, atomic projection, and risk\-based ranking\.

## Limitations

This work mainly focuses on four cross\-modal generation paths and evaluates TIGER on the benchmarks considered in the paper\. Broader settings, such as 3D scenes, streaming multimodal inputs, and highly specialized domains, may require additional evaluation because they contain richer temporal and spatial structure than the inputs studied here\. In addition, TIGER represents evidence as localized factual triples, which is well suited for objects, attributes, relations, and events, but may be less direct for abstract, causal, or subjective content\. Finally, iterative repair adds inference\-time computation, so future work should study more efficient scheduling strategies for real\-time deployment\.

## References

- Spice: semantic propositional image caption evaluation\.InEuropean conference on computer vision,pp\. 382–398\.Cited by:[§C\.1](https://arxiv.org/html/2606.00232#A3.SS1.SSS0.Px1.p1.5)\.
- H\. Bahng, C\. Chan, F\. Durand, and P\. Isola \(2025\)Cycle consistency as reward: learning image\-text alignment without human preferences\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 22934–22946\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p2.1),[Table 9](https://arxiv.org/html/2606.00232#A5.T9.15.7.7.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- Z\. Bai, P\. Wang, T\. Xiao, T\. He, Z\. Han, Z\. Zhang, and M\. Z\. Shou \(2024\)Hallucination of multimodal large language models: a survey\.arXiv preprint arXiv:2404\.18930\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- C\. Buntain, A\. L\. Hughes, R\. McCreadie, B\. D\. Horne, M\. Imran, and H\. Purohit \(2023\)CrisisFACTS 2023\-overview paper\.\.InTREC,Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p5.4),[§4\.6](https://arxiv.org/html/2606.00232#S4.SS6.p1.3)\.
- X\. Chen, Z\. Wu, X\. Liu, Z\. Pan, W\. Liu, Z\. Xie, X\. Yu, and C\. Ruan \(2025\)Janus\-pro: unified multimodal understanding and generation with data and model scaling\.arXiv preprint arXiv:2501\.17811\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- X\. Chen, H\. Fang, T\. Lin, R\. Vedantam, S\. Gupta, P\. Dollár, and C\. L\. Zitnick \(2015\)Microsoft coco captions: data collection and evaluation server\.arXiv preprint arXiv:1504\.00325\.Cited by:[Table 6](https://arxiv.org/html/2606.00232#A5.T6.1.1.1.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p1.6)\.
- Z\. Chen, J\. Wu, Z\. Lei, and C\. W\. Chen \(2024\)What makes a scene? scene graph\-based evaluation and feedback for controllable generation\.arXiv preprint arXiv:2411\.15435\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[2nd item](https://arxiv.org/html/2606.00232#A5.I1.i2.p1.1)\.
- S\. Datta and D\. Sundararaman \(2025\)Evaluating hallucination in large vision\-language models based on context\-aware object similarities\.arXiv preprint arXiv:2501\.15046\.Cited by:[§2](https://arxiv.org/html/2606.00232#S2.p2.4)\.
- C\. Deng, D\. Zhu, K\. Li, C\. Gou, F\. Li, Z\. Wang, S\. Zhong, W\. Yu, X\. Nie, Z\. Song,et al\.\(2025\)Emerging properties in unified multimodal pretraining\.arXiv preprint arXiv:2505\.14683\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- K\. Drossos, S\. Lipping, and T\. Virtanen \(2020\)Clotho: an audio captioning dataset\.InICASSP 2020\-2020 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 736–740\.Cited by:[Table 6](https://arxiv.org/html/2606.00232#A5.T6.4.4.4.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p1.6)\.
- A\. Fanous, J\. Goldberg, A\. Agarwal, J\. Lin, A\. Zhou, S\. Xu, V\. Bikia, R\. Daneshjou, and S\. Koyejo \(2025\)Syceval: evaluating llm sycophancy\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.8,pp\. 893–900\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p2.1)\.
- Z\. Gao, W\. Huang, J\. Zhang, A\. Kembhavi, and R\. Krishna \(2024\)Generate any scene: scene graph driven data synthesis for visual generation training\.arXiv preprint arXiv:2412\.08221\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- T\. Hsu, K\. Lu, C\. Chiang, and H\. Lee \(2025\)Reducing object hallucination in large audio\-language models via audio\-aware decoding\.arXiv preprint arXiv:2506\.07233\.Cited by:[Table 9](https://arxiv.org/html/2606.00232#A5.T9.21.13.17.3.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- S\. Lee, S\. H\. Park, Y\. Jo, and M\. Seo \(2024\)Volcano: mitigating multimodal hallucination through self\-feedback guided revision\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 391–404\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p2.1),[Appendix B](https://arxiv.org/html/2606.00232#A2.SS0.SSS0.Px2),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p3.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p4.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[Table 9](https://arxiv.org/html/2606.00232#A5.T9.20.12.12.2),[§1](https://arxiv.org/html/2606.00232#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- S\. Leng, H\. Zhang, G\. Chen, X\. Li, S\. Lu, C\. Miao, and L\. Bing \(2024\)Mitigating object hallucinations in large vision\-language models through visual contrastive decoding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13872–13882\.Cited by:[Table 9](https://arxiv.org/html/2606.00232#A5.T9.21.13.16.2.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- Y\. Li, Y\. Du, K\. Zhou, J\. Wang, X\. Zhao, and J\. Wen \(2023\)Evaluating object hallucination in large vision\-language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 292–305\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p2.1),[§2](https://arxiv.org/html/2606.00232#S2.p2.4)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024a\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§E\.2](https://arxiv.org/html/2606.00232#A5.SS2.SSS0.Px2.p3.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p3.12)\.
- S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, Q\. Jiang, C\. Li, J\. Yang, H\. Su,et al\.\(2024b\)Grounding dino: marrying dino with grounded pre\-training for open\-set object detection\.InEuropean conference on computer vision,pp\. 38–55\.Cited by:[Appendix B](https://arxiv.org/html/2606.00232#A2.SS0.SSS0.Px3.p1.1)\.
- J\. Lu, C\. Clark, S\. Lee, Z\. Zhang, S\. Khosla, R\. Marten, D\. Hoiem, and A\. Kembhavi \(2024\)Unified\-io 2: scaling autoregressive multimodal models with vision language audio and action\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 26439–26455\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- T\. Nguyen, P\. Nguyen, J\. Cothren, A\. Yilmaz, and K\. Luu \(2025\)Hyperglm: hypergraph for video scene graph generation and anticipation\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 29150–29160\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- Y\. Park, D\. Lee, J\. Choe, and B\. Chang \(2025\)Convis: contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 6434–6442\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§C\.1](https://arxiv.org/html/2606.00232#A3.SS1.SSS0.Px1.p2.1),[§3\.2](https://arxiv.org/html/2606.00232#S3.SS2.p3.5)\.
- A\. Rohrbach, L\. A\. Hendricks, K\. Burns, T\. Darrell, and K\. Saenko \(2018\)Object hallucination in image captioning\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 4035–4045\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p5.4),[§2](https://arxiv.org/html/2606.00232#S2.p2.4)\.
- Y\. Sun and T\. Wang \(2026\)Be friendly, not friends: how llm sycophancy shapes user trust\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,pp\. 1–15\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p2.1)\.
- Y\. Sun, K\. Li, P\. Guo, J\. Liu, and Q\. Tan \(2026\)Mario: multimodal graph reasoning with large language models\.arXiv preprint arXiv:2603\.05181\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- Z\. Sun, S\. Shen, S\. Cao, H\. Liu, C\. Li, Y\. Shen, C\. Gan, L\. Gui, Y\. Wang, Y\. Yang,et al\.\(2024\)Aligning large multimodal models with factually augmented rlhf\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 13088–13110\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1),[Table 6](https://arxiv.org/html/2606.00232#A5.T6.3.3.3.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p1.6)\.
- J\. Sung, M\. Kim, S\. An, S\. Lyu, A\. Nagrani, and P\. H\. Seo \(2025\)Getting to the crux: graph\-based data generation for advancing multi\-hop cross\-modal reasoning\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1)\.
- C\. Team \(2024\)Chameleon: mixed\-modal early\-fusion foundation models\.arXiv preprint arXiv:2405\.09818\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- A\. van Sprang, L\. Samson, A\. Lucic, E\. Acar, S\. Ghebreab, and Y\. M\. Asano \(2025\)Same content, different answers: cross\-modal inconsistency in mllms\.arXiv preprint arXiv:2512\.08923\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1),[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- J\. Wang, Y\. Wang, G\. Xu, J\. Zhang, Y\. Gu, H\. Jia, J\. Wang, H\. Xu, M\. Yan, J\. Zhang,et al\.\(2023\)Amber: an llm\-free multi\-dimensional benchmark for mllms hallucination evaluation\.arXiv preprint arXiv:2311\.07397\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1),[Table 6](https://arxiv.org/html/2606.00232#A5.T6.2.2.2.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p1.6)\.
- W\. Wang, Z\. Gao, L\. Chen, Z\. Chen, J\. Zhu, X\. Zhao, Y\. Liu, Y\. Cao, S\. Ye, X\. Zhu,et al\.\(2025\)Visualprm: an effective process reward model for multimodal reasoning\.arXiv preprint arXiv:2503\.10291\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p2.1),[Table 9](https://arxiv.org/html/2606.00232#A5.T9.13.5.5.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- Y\. Wang, Y\. Wang, D\. Zhao, C\. Xie, and Z\. Zheng \(2024\)Videohallucer: evaluating intrinsic and extrinsic hallucinations in large video\-language models\.arXiv preprint arXiv:2406\.16338\.Cited by:[Table 6](https://arxiv.org/html/2606.00232#A5.T6.5.5.5.3),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p1.6)\.
- C\. Wu, X\. Chen, Z\. Wu, Y\. Ma, X\. Liu, Z\. Pan, W\. Liu, Z\. Xie, X\. Yu, C\. Ruan,et al\.\(2025\)Janus: decoupling visual encoding for unified multimodal understanding and generation\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 12966–12977\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- S\. Wu, H\. Fei, L\. Qu, W\. Ji, and T\. Chua \(2024\)Next\-gpt: any\-to\-any multimodal llm\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang, B\. Zhang, X\. Wang, Y\. Chu, and J\. Lin \(2025\)Qwen2\.5\-omni technical report\.External Links:2503\.20215,[Link](https://arxiv.org/abs/2503.20215)Cited by:[§E\.2](https://arxiv.org/html/2606.00232#A5.SS2.SSS0.Px1.p1.4),[§1](https://arxiv.org/html/2606.00232#S1.p1.1),[§1](https://arxiv.org/html/2606.00232#S1.p5.4),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p3.12)\.
- L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang \(2026\)Mmada: multimodal large diffusion language models\.Advances in Neural Information Processing Systems38,pp\. 138867–138907\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- S\. Yin, C\. Fu, S\. Zhao, T\. Xu, H\. Wang, D\. Sui, Y\. Shen, K\. Li, X\. Sun, and E\. Chen \(2024\)Woodpecker: hallucination correction for multimodal large language models\.Science China Information Sciences67\(12\),pp\. 220105\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p2.1),[Appendix B](https://arxiv.org/html/2606.00232#A2.SS0.SSS0.Px3),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p2.2),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p5.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p6.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p7.pic1.5.5.5.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p8.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p9.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1),[Table 9](https://arxiv.org/html/2606.00232#A5.T9.21.13.19.5.1),[§1](https://arxiv.org/html/2606.00232#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- J\. Zhan, J\. Dai, J\. Ye, Y\. Zhou, D\. Zhang, Z\. Liu, X\. Zhang, R\. Yuan, G\. Zhang, L\. Li,et al\.\(2024\)Anygpt: unified multimodal llm with discrete sequence modeling\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9637–9662\.Cited by:[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- C\. Zhang, Z\. Wan, Z\. Kan, M\. Q\. Ma, S\. Stepputtis, D\. Ramanan, R\. Salakhutdinov, L\. Morency, K\. Sycara, and Y\. Xie \(2025\)Self\-correcting decoding with generative feedback for mitigating hallucinations in large vision\-language models\.arXiv preprint arXiv:2502\.06130\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1),[Appendix A](https://arxiv.org/html/2606.00232#A1.p2.1),[Appendix B](https://arxiv.org/html/2606.00232#A2.SS0.SSS0.Px4),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p10.pic1.6.6.6.1.1.1),[§B\.2](https://arxiv.org/html/2606.00232#A2.SS2.p11.pic1.8.8.8.1.1.1),[Table 9](https://arxiv.org/html/2606.00232#A5.T9.21.13.20.6.1),[§1](https://arxiv.org/html/2606.00232#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- J\. Zhang, Y\. Jiao, S\. Chen, N\. Zhao, Z\. Tan, H\. Li, X\. Ma, and J\. Chen \(2024a\)Eventhallusion: diagnosing event hallucinations in video llms\.arXiv preprint arXiv:2409\.16597\.Cited by:[Table 9](https://arxiv.org/html/2606.00232#A5.T9.21.13.18.4.1),[§4\.1](https://arxiv.org/html/2606.00232#S4.SS1.p2.1)\.
- X\. Zhang, S\. Li, N\. Shi, B\. Hauer, Z\. Wu, G\. Kondrak, M\. Abdul\-Mageed, and L\. V\. Lakshmanan \(2024b\)Cross\-modal consistency in multimodal large language models\.arXiv preprint arXiv:2411\.09273\.Cited by:[Appendix A](https://arxiv.org/html/2606.00232#A1.p1.1),[§1](https://arxiv.org/html/2606.00232#S1.p1.1)\.
- Y\. Zhou, C\. Cui, J\. Yoon, L\. Zhang, Z\. Deng, C\. Finn, M\. Bansal, and H\. Yao \(2024\)Analyzing and mitigating object hallucination in large vision\-language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 56969–56998\.Cited by:[§2](https://arxiv.org/html/2606.00232#S2.p2.4)\.

## Appendix

## Appendix ARelated Work

Hallucination and verification\.Unified multimodal models can process text, images, audio, and video, but their textual outputs may contain facts that are not supported by the input\. This problem is closely related to cross\-modal inconsistency, where models produce conflicting claims for the same content across modalitiesZhanget al\.\([2024b](https://arxiv.org/html/2606.00232#bib.bib72)\); van Spranget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib73)\)\. Multimodal hallucination has been studied through surveys, benchmarks, and decoding methodsBaiet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib51)\); Sunet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib52)\); Wanget al\.\([2023](https://arxiv.org/html/2606.00232#bib.bib93)\); Parket al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib53)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib50)\)\. These studies provide useful evaluation tools, but many methods still score or revise the whole response, which makes it difficult to identify the specific facts that require correction\. Structured verification addresses this limitation by representing objects, attributes, and relations explicitly\. Scene graph and graph based methods have been used for fine grained evaluation, controllable generation, and multimodal reasoningChenet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib45)\); Gaoet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib46)\); Nguyenet al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib47)\); Sunet al\.\([2026](https://arxiv.org/html/2606.00232#bib.bib74)\); Sunget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib75)\)\. TIGER follows this direction, but uses structured representations for iterative repair\. It extracts an observation graph from the input and a claim graph from the current output, then compares each claim against the input through support and conflict scores\.

Repair at inference time\.Several methods reduce hallucination without updating the backbone\. Resampling methods select one output from multiple candidates using external signals such as visual alignment, process reward models, or cycle consistencyWanget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib40)\); Bahnget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib42)\)\. These methods can improve output quality, but they do not identify which facts in the response are unsupported\. Feedback based repair is closer to our setting\. WoodpeckerYinet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib79)\), VolcanoLeeet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib89)\), and DeGFZhanget al\.\([2025](https://arxiv.org/html/2606.00232#bib.bib50)\)revise an initial output by conditioning on both the input and the current response\. This design is simple, but the current response can affect how the model interprets the input, so hallucinated claims may be reinforced rather than removed\. The feedback is also written in natural language, which makes it difficult to rank facts or enforce a repair budget\. TIGER redesigns this step by extracting the input and output independently, assigning each claim a graph\-conditioned risk score, and repairing only the selected high risk facts\. This design makes feedback quantifiable and reduces the dependence of repair on joint generation\.

## Appendix BPrompts

This appendix lists the verbatim prompt templates used by TIGER and by the three text\-feedback baselines\. All prompt templates are applicable to every backbone model evaluated in this paper \(both open\-source and proprietary APIs\) and do not depend on any specific model architecture\.

Notation follows Section[3](https://arxiv.org/html/2606.00232#S3)\. All methods share the same generation step under a unified iterative repair framework\. The generation step produces the initial output via𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}:

𝐘0=Φ\(𝒫gen,𝐗\)\.\\displaystyle\\mathbf\{Y\}\_\{0\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\,\\mathbf\{X\}\)\.The methods differ in how the feedback signalℱt\\mathcal\{F\}\_\{t\}is produced and how the repair prompt𝒫refine\(⋅\)\\mathcal\{P\}\_\{\\text\{refine\}\}^\{\(\\cdot\)\}is instantiated\. We describe each in turn\.

#### TIGER\.

TIGER decomposes feedback generation into two stages: independent extraction and graph\-conditioned risk computation\. First, the input and the current output are each independently projected into a structured fact graph:G𝐗=Φ\(𝒫ext,𝐗\)G\_\{\\mathbf\{X\}\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\,\\mathbf\{X\}\),G𝐘t=Φ\(𝒫ext,𝐘t\)G\_\{\\mathbf\{Y\}\_\{t\}\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{ext\}\},\\,\\mathbf\{Y\}\_\{t\}\);G𝐗G\_\{\\mathbf\{X\}\}is extracted once before the repair loop and remains fixed\. The deterministic operatorΨα\\Psi\_\{\\alpha\}then computes per\-fact risk over the two graphs and selects the high\-risk set as the feedback signal:ℱt=Ψα\(G𝐗,G𝐘t\)\\mathcal\{F\}\_\{t\}=\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},\\,G\_\{\\mathbf\{Y\}\_\{t\}\}\)\. Finally, the repair prompt𝒫refine\\mathcal\{P\}\_\{\\text\{refine\}\}produces the revised output from the raw input, the current output, and the high\-risk fact set:

𝐘t\+1=Φ\(𝒫refine,𝐗,𝐘t,ℱt\)\.\\displaystyle\\mathbf\{Y\}\_\{t\+1\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{refine\}\},\\,\\mathbf\{X\},\\,\\mathbf\{Y\}\_\{t\},\\,\\mathcal\{F\}\_\{t\}\)\.
Per sample, TIGER calls the backboneΦ\\Phia total of2\+2T2\+2Ttimes:𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}once,𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}1\+T1\{\+\}Ttimes \(once for𝐗\\mathbf\{X\}and once per round for each𝐘t\\mathbf\{Y\}\_\{t\}\), and𝒫refine\\mathcal\{P\}\_\{\\text\{refine\}\}TTtimes\.

#### Volcano\(Leeet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib89)\)\.

We adapt Volcano’s critique\-then\-revise protocol to the shared frozen backbone\. We do not use a separate Volcano checkpoint, so the comparison remains backbone\-matched: each round first produces a natural\-language critique conditioned on the input and the current response, then revises the response conditioned on its own critique\. Formally, in roundtt:

ℱt\\displaystyle\\mathcal\{F\}\_\{t\}=Φ\(𝒫fbV,𝐗,𝐘t\),\\displaystyle=\\Phi\(\\mathcal\{P\}\_\{\\text\{fb\}\}^\{\\text\{V\}\},\\,\\mathbf\{X\},\\,\\mathbf\{Y\}\_\{t\}\),𝐘t\+1\\displaystyle\\mathbf\{Y\}\_\{t\+1\}=Φ\(𝒫refineV,𝐘t,ℱt\)\.\\displaystyle=\\Phi\(\\mathcal\{P\}\_\{\\text\{refine\}\}^\{\\text\{V\}\},\\,\\mathbf\{Y\}\_\{t\},\\,\\mathcal\{F\}\_\{t\}\)\.We adopt the critique and revise prompt style released in the official Volcano repository \(reproduced in Appendix[B\.2](https://arxiv.org/html/2606.00232#A2.SS2)\) and use the paper’s defaultT=3T=3critique\-revise rounds\. Per sample Volcano calls the shared backboneΦ\\Phia total of1\+2T1\+2Ttimes: the initial generation once, thenTTcritique passes andTTrevise passes\. This adaptation applies uniformly across all modalities \(image, audio, video\); the procedure and prompts are unchanged, only the backbone is shared with the other methods\.

#### Woodpecker\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)\.

We faithfully reimplement Woodpecker’s five\-stage hallucination correction pipeline as described in the original paper\. Each stage uses a dedicated prompt template adapted from the official Woodpecker codebase; the visual\-validation stage uses Grounding DINO\(Liuet al\.,[2024b](https://arxiv.org/html/2606.00232#bib.bib100)\)\(already loaded for our deterministic\-extractor comparison in Appendix[F\.2](https://arxiv.org/html/2606.00232#A6.SS2)\) as the open\-vocabulary visual detector\. The pipeline runs once per sample \(T=1T=1\); formally:

𝐊\\displaystyle\\mathbf\{K\}=Φ\(𝒫kceW,𝐘0\),\\displaystyle=\\Phi\(\\mathcal\{P\}^\{\\text\{W\}\}\_\{\\text\{kce\}\},\\,\\mathbf\{Y\}\_\{0\}\),𝐐\\displaystyle\\mathbf\{Q\}=Φ\(𝒫qfW,𝐊\),\\displaystyle=\\Phi\(\\mathcal\{P\}^\{\\text\{W\}\}\_\{\\text\{qf\}\},\\,\\mathbf\{K\}\),𝐕\\displaystyle\\mathbf\{V\}=DINO\(𝐐,𝐗\),\\displaystyle=\\mathrm\{DINO\}\(\\mathbf\{Q\},\\,\\mathbf\{X\}\),𝐂\\displaystyle\\mathbf\{C\}=Φ\(𝒫vcgW,𝐕\),\\displaystyle=\\Phi\(\\mathcal\{P\}^\{\\text\{W\}\}\_\{\\text\{vcg\}\},\\,\\mathbf\{V\}\),𝐘T\\displaystyle\\mathbf\{Y\}\_\{T\}=Φ\(𝒫refineW,𝐗,𝐘0,𝐂\),\\displaystyle=\\Phi\(\\mathcal\{P\}^\{\\text\{W\}\}\_\{\\text\{refine\}\},\\,\\mathbf\{X\},\\,\\mathbf\{Y\}\_\{0\},\\,\\mathbf\{C\}\),where𝐊\\mathbf\{K\}is the list of key concepts extracted from𝐘0\\mathbf\{Y\}\_\{0\},𝐐\\mathbf\{Q\}is the corresponding set of verification questions,𝐕\\mathbf\{V\}is the Grounding\-DINO output \(boxes and labels\), and𝐂\\mathbf\{C\}is the visual\-claim list that serves as the explicit feedback signalℱ\\mathcal\{F\}for the refine step\. The five stages decompose as follows:

1. 1\.Key concept extraction\.A prompt extracts the list of object\-level concepts mentioned in𝐘0\\mathbf\{Y\}\_\{0\}that need verification\. Output: a JSON list of \(entity, attribute\) pairs\.
2. 2\.Question formulation\.For each extracted concept, a prompt formulates a yes/no verification question of the form “Does the image contain a \{concept\}?” or “Is the \{entity\} \{attribute\}?”\.
3. 3\.Visual knowledge validation\.For each verification question, Grounding DINO is invoked with a text query corresponding to the concept \(confidence threshold0\.350\.35, top\-55detections per query\)\. The detected bounding boxes and labels constitute the “visual evidence” for that question\.
4. 4\.Visual claim generation\.The detected evidence is formatted into a structured claim list, e\.g\. “GroundingDINO finds 2 boxes labelled \{label\} with scores \{…\}”\. This list becomes the explicit feedbackℱ\\mathcal\{F\}for the correction stage\.
5. 5\.Hallucination correction\.The backboneΦ\\Phiis called once more with the image, the original response, and the visual claim list to produce the corrected response𝐘T\\mathbf\{Y\}\_\{T\}\.

Per sample Woodpecker callsΦ\\Phifour times \(𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}, key\-concept extraction, question formulation, hallucination correction\) plus one Grounding DINO call per extracted concept\. The five\-stage pipeline runs once per sample rather than iteratively, matching the original Woodpecker protocol\. For modalities outside Grounding DINO’s coverage \(audio, video\), Woodpecker is not applicable and is therefore omitted from the audio/video benchmarks \(Tables[3](https://arxiv.org/html/2606.00232#S4.T3),[4](https://arxiv.org/html/2606.00232#S4.T4)\)\. Prompts for the five stages are reproduced verbatim in Appendix[B\.2](https://arxiv.org/html/2606.00232#A2.SS2)\.

#### DeGF\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib50)\)\.

DeGF \(Self\-Correcting Decoding with Generative Feedback\) uses a text\-to\-image diffusion model to generate an auxiliary visual reference from the model’s initial response, then performs contrastive decoding between the original image and the generated image\. We follow the official protocol with two adjustments noted below\. Unlike Volcano and TIGER, DeGF runs a single repair pass \(T=1T=1\) and consumes the feedback at the logit level rather than the prompt level:

𝐘0\\displaystyle\\mathbf\{Y\}\_\{0\}=Φ\(𝒫gen,𝐗\),\\displaystyle=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\,\\mathbf\{X\}\),𝐗aux\\displaystyle\\mathbf\{X\}\_\{\\text\{aux\}\}=SD\(𝐘0\),\\displaystyle=\\mathrm\{SD\}\(\\mathbf\{Y\}\_\{0\}\),𝐘T\\displaystyle\\mathbf\{Y\}\_\{T\}=ΦCD\(𝒫gen,𝐗,𝐗aux;α\),\\displaystyle=\\Phi^\{\\text\{CD\}\}\\bigl\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\,\\mathbf\{X\},\\,\\mathbf\{X\}\_\{\\text\{aux\}\};\\,\\alpha\\bigr\),whereSD\(⋅\)\\mathrm\{SD\}\(\\cdot\)is a Stable\-Diffusion text\-to\-image generator andΦCD\\Phi^\{\\text\{CD\}\}decodes greedily from contrastive logits per DeGF:

s~\(k\)=\(1\+α\)Φ\(𝒫gen,𝐗\)\(k\)−αΦ\(𝒫gen,𝐗aux\)\(k\),\\tilde\{s\}^\{\(k\)\}=\(1\+\\alpha\)\\,\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\)^\{\(k\)\}\-\\alpha\\,\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\_\{\\text\{aux\}\}\)^\{\(k\)\},at every decoding stepkk\.

The pipeline for each sample is:

1. 1\.Initial generation\.Produce𝐘0=Φ\(𝒫gen,𝐗\)\\mathbf\{Y\}\_\{0\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{gen\}\},\\mathbf\{X\}\)as usual, where𝐗\\mathbf\{X\}is the original imageIorigI\_\{\\text\{orig\}\}\.
2. 2\.Generative feedback via diffusion\.Take the caption\-like response𝐘0\\mathbf\{Y\}\_\{0\}as a prompt and synthesise an auxiliary imageIgen=SD\(𝐘0\)I\_\{\\text\{gen\}\}=\\mathrm\{SD\}\(\\mathbf\{Y\}\_\{0\}\)using Stable Diffusion\. We use SD\-Turbo \(stabilityai/sd\-turbo\\mathrm\{stabilityai/sd\\text\{\-\}turbo\}\) with11denoising step and the default scheduler, matching the original paper’s “efficient” configuration\.
3. 3\.Contrastive decoding\.Run two forward passes of the backbone with the same text prompt: one conditioned onIorigI\_\{\\text\{orig\}\}producing logitssorig\(k\)s\_\{\\text\{orig\}\}^\{\(k\)\}at decoding stepkk, the other conditioned onIgenI\_\{\\text\{gen\}\}producingsgen\(k\)s\_\{\\text\{gen\}\}^\{\(k\)\}\. The corrected output𝐘T\\mathbf\{Y\}\_\{T\}is decoded greedily from the contrastive logits s~\(k\)=\(1\+α\)sorig\(k\)−αsgen\(k\),α=0\.5,\\tilde\{s\}^\{\(k\)\}=\(1\+\\alpha\)\\,s\_\{\\text\{orig\}\}^\{\(k\)\}\\;\-\\;\\alpha\\,s\_\{\\text\{gen\}\}^\{\(k\)\},\\qquad\\alpha=0\.5,matching DeGF and the sameα\\alphaas the original paper\.

Per sample DeGF runs one diffusion call plus1\+2L1\+2Lbackbone forward passes \(whereLLis the number of decoded tokens\), so it is comparable in cost to VCD\-style contrastive decoding on the same backbone\. The two protocol adjustments are: \(i\) we use SD\-Turbo \(1\-step\) instead of the original SD 1\.5 \(50\-step\) for tractability across1,0001\{,\}000images, and \(ii\) we use the same frozen Qwen2\.5\-Omni\-7B / LLaVA\-1\.5\-7B backbones as every other baseline rather than LLaVA\-1\.5\-13B; both deviations are documented in our code release\.

#### Key difference from TIGER\.

These baselines use response\-conditioned correction signals, but they do not independently extract an observation graph from the input and a claim graph from the current output, nor do they compute deterministic per\-claim support/conflict risk for budgeted repair\. TIGER differs by making the feedback signal explicit, fact\-level, and rankable\.

### B\.1TIGER Prompts

#### 𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}\.

TIGER does not rewrite the task prompt at the first round; the raw dataset prompt is fed toΦ\\Phiverbatim\. The actual prompts used in our experiments are listed below\. For MMHal\-Bench, Clotho, and VideoHallucer, the prompt varies per instance according to the benchmark default; we show the template form\.

𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}— COCO captionsPlease describe this image in detail\.

𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}— MMHal\-Bench\{per\-instance question\}

𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}— Clotho\{per\-instance audio question\}

𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\}— VideoHallucer\{per\-instance video question\}

#### 𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}\.

A single extraction template instantiated under modality conditioning, as described in Section[3\.2](https://arxiv.org/html/2606.00232#S3.SS2)\. The multimodal form is applied to𝐗\\mathbf\{X\}to produceG𝐗G\_\{\\mathbf\{X\}\}; the text form is applied to𝐘t\\mathbf\{Y\}\_\{t\}to produceG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}\. Both forms share an identical schema \(field roles, standardized predicate vocabulary, and triple output format\) and differ only in their few\-shot exemplars and the source\-domain phrasing\.

𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}— multimodal form, applied to𝐗\\mathbf\{X\}You are a multimodal fact extraction system\. Extract ALL observable facts from the provided input across all modalities \(image, video, audio, text\)\. Each fact must be a triple \(subject, predicate, object\)\.STRICT OUTPUT FORMAT — numbered list, one triple per line, parentheses required:1\. \(subject, predicate, object\)2\. \(subject, predicate, object\)…FIELD ROLES:subject: head entity, written as a bare noun phrase WITHOUT attributes\. GOOD: \(car, is, red\)\. BAD: \(red car, is, parked\)\.predicate: the relation\. Use the standardized forms below when applicable\.object: tail entity, attribute value, or count\.STANDARDIZED PREDICATES:Attributes \(color, size, material, shape\):is\. e\.g\., \(car, is, red\)\.Counts:count\. e\.g\., \(lights, count, 3\)\.Spatial \(directional; subject is the figure, object is the ground\):on / under / above / below / left of / right of / in front of / behind / inside\.Possession or wearing or carrying:holding / wearing / carrying / riding\.Existence:exists in\. e\.g\., \(dog, exists in, image\)\.Actions: bare verb \(jogging / cooking / talking\)\.Example — given an image of a park with a dog and a man jogging:1\. \(dog, exists in, image\)2\. \(dog, is, golden\)3\. \(man, exists in, image\)4\. \(man, jogging on, path\)5\. \(man, wearing, red shirt\)6\. \(dog, left of, man\)7\. \(trees, count, 5\)8\. \(sky, is, blue\)Example — given an audio clip of a busy street:1\. \(cars, honking, loudly\)2\. \(people, talking, nearby\)3\. \(engine, running, idle\)4\. \(music, playing from, shop\)Example — given text ‘The president announced a new policy on Tuesday’:1\. \(president, announced, new policy\)2\. \(announcement, happened on, Tuesday\)Rules:Extract every object, attribute, spatial relation, action, and count you can verify\.One fact per triple\. Do NOT combine multiple claims into one line\.Always put the bare entity in subject; never bundle adjectives into subject\.For spatial predicates, subject is the figure positioned relative to object\.Aim for 8–20 triples\. Be thorough but only include facts grounded in the input\.Output ONLY the numbered triple list\. No explanation, no headers, no prose\.

𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}— text form, applied to𝐘t\\mathbf\{Y\}\_\{t\}You are a fact extraction system\. Read the text below and decompose it into ALL factual claims as structured triples \(subject, predicate, object\)\.STRICT OUTPUT FORMAT — numbered list, one triple per line, parentheses required:1\. \(subject, predicate, object\)2\. \(subject, predicate, object\)…FIELD ROLES:subject: head entity, written as a bare noun WITHOUT adjectives\. GOOD: \(car, is, red\)\. BAD: \(red car, is, parked\)\.predicate: the relation\. Prefer the standardized forms below\.object: tail entity, attribute value, or count\.STANDARDIZED PREDICATES:Attributes:is\. e\.g\., \(car, is, red\)\.Counts:count\. e\.g\., \(lights, count, 3\)\.Existence:exists in\. e\.g\., \(dog, exists in, image\)\.Spatial \(directional; subject is figure, object is ground\):on / under / above / below / left of / right of / in front of / behind / inside\.Possession / action:holding / wearing / carrying / riding / using\.Free actions: bare verb \(cooking, jogging, talking, …\)\.Example 1 — input: ‘A red car is parked near a tall building on a sunny day\.’1\. \(car, is, red\)2\. \(car, parked near, building\)3\. \(building, is, tall\)4\. \(day, is, sunny\)Example 2 — input: ‘The fire hydrant cap is yellow\.’1\. \(fire hydrant cap, is, yellow\)Example 3 — input: ‘There are three traffic lights in the image\.’1\. \(traffic lights, exists in, image\)2\. \(traffic lights, count, 3\)Example 4 — input: ‘A man in a white shirt is cooking in the kitchen while holding a knife\.’1\. \(man, exists in, image\)2\. \(man, wearing, white shirt\)3\. \(man, cooking in, kitchen\)4\. \(man, holding, knife\)Example 5 — input: ‘The dog is on top of the car\.’1\. \(dog, on, car\)Rules:Extract every claim, even from a short single sentence\.Keep the subject a bare noun; never bundle attributes into subject\.Each claim = one triple\. Do NOT combine multiple facts\.For spatial predicates, subject is the figure positioned relative to object\.Output ONLY the numbered triple list\. No prose\.Now extract from this text:\{<input text \>\}

#### 𝒫refine\\mathcal\{P\}\_\{\\text\{refine\}\}\.

The high\-risk setℱt=Ψα\(G𝐗,G𝐘t\)\\mathcal\{F\}\_\{t\}=\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)is rendered as a natural\-language list and inserted into the template below\. Following the same deletion\-permissive editing policy used for the text\-feedback baselines in §[B\.2](https://arxiv.org/html/2606.00232#A2.SS2), the prompt allows removal only when a flagged claim cannot be verified and no grounded replacement is available\. The key distinction is that TIGER applies this policy only to the risk\-ranked claims selected byΨα\\Psi\_\{\\alpha\}\.

𝒫refine\\mathcal\{P\}\_\{\\text\{refine\}\}— repair𝐘t→𝐘t\+1\\mathbf\{Y\}\_\{t\}\\to\\mathbf\{Y\}\_\{t\+1\}Below is your previous response, followed by a list of claims that have been flagged as HIGH RISK under the current evidence\. A claim is flagged when it has weak support, possible conflict, or cross\-modal disagreement with the input\.Your goal: produce a revised response with high factual reliability while preserving all verified content\. Only re\-examine the flagged claims listed below\. For each flagged claim:\(a\) If the claim CONTRADICTS the input, correct it to match the input\.\(b\) If the claim is directly supported by the input, keep it\.\(c\) If the claim cannot be verified and no grounded replacement is available, remove it from the response\.Do not add new factual details\. Do not modify unflagged claims except for minimal edits needed to keep the response coherent\. A shorter response is acceptable only when unsupported flagged claims are removed\. Preserve the tone and the overall answer to the task prompt\.Previous response:“\{original\_text\} ”High\-risk claims to re\-examine:\{facts\_block\}Revised response:

The slot\{original\_text\}is filled with𝐘t\\mathbf\{Y\}\_\{t\};\{facts\_block\}is filled with the verbalizedℱt\\mathcal\{F\}\_\{t\}\.

### B\.2Baseline Prompts

The calling conventions of the three text\-feedback baselines are described at the beginning of this appendix\. Below we list the concrete prompt templates for each baseline, annotated with their symbolic roles\.

All baseline numbers in Tables[2](https://arxiv.org/html/2606.00232#S4.T2),[3](https://arxiv.org/html/2606.00232#S4.T3), and[4](https://arxiv.org/html/2606.00232#S4.T4)are produced by our own runs under a shared frozen\-backbone protocol — they are not copied from the original papers, whose datasets, metrics, and backbones differ from ours\. Within this protocol we strive for the strongest faithful reproduction available: \(i\)Volcanoadapts the official critique\-then\-revise protocol with the released prompt templates, run on the same shared backbone as the other methods so the comparison is backbone\-matched \(we do not use a separate Volcano checkpoint\); \(ii\)Woodpeckerruns the full five\-stage pipeline of Yin et al\.Yinet al\.\([2024](https://arxiv.org/html/2606.00232#bib.bib79)\), with Grounding DINO as the open\-vocabulary detector for the visual\-validation stage and the same backboneΦ\\Phifor the LLM stages; \(iii\)DeGFuses Stable Diffusion \(SD\-Turbo\) for generative feedback and contrastive decoding per DeGF Eq\. \(4\), differing from the original only by the SD variant \(Turbo vs\. 1\.5\) and by the backbone choice required to match our shared protocol; \(iv\)VCD,AAD, andTCDimplement the published contrastive\-decoding formulas~=\(1\+α\)swith−αswithout\\tilde\{s\}=\(1\+\\alpha\)\\,s\_\{\\text\{with\}\}\-\\alpha\\,s\_\{\\text\{without\}\}verbatim, with modality\-specific neutralization \(Gaussian\-noised image / silent audio / time\-shuffled video frames\) taken from the corresponding original recipes; \(v\)BoN\+CLIP,BoN\+VisualPRM, andBoN\+CycleRewarduse the officially released reward/scoring checkpoints as the BoN reranker\. Where a baseline’s original pipeline depends on modules outside our shared protocol \(e\.g\. modality\-specific tools for audio or video\), we note the adaptation inline; otherwise the implementation reproduces the original method on the shared backbone\.

Volcano — critique prompt𝒫fbV\\mathcal\{P\}\_\{\\text\{fb\}\}^\{\\text\{V\}\}\(verbatim from the official Volcano repository\)\(Leeet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib89)\)You are given an image and a candidate response describing it\. Critique the response for any visual hallucinations, factual errors, or unsupported claims\. Be specific: for each problem, name the object, attribute, count, or relation that is incorrect and explain why\. Do not rewrite the response; only provide the critique\.Image:\{image\}Candidate response:\{response\_text\}Critique:

Volcano — revise prompt𝒫refineV\\mathcal\{P\}\_\{\\text\{refine\}\}^\{\\text\{V\}\}\(verbatim from the official Volcano repository\)\(Leeet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib89)\)You previously generated the candidate response below and received the following critique\. Revise the response to address the critique while keeping all correctly described content\. Output only the revised response\.Image:\{image\}Candidate response:\{response\_text\}Critique:\{critique\_text\}Revised response:

Woodpecker stage 1 — key concept extraction𝒫kceW\\mathcal\{P\}\_\{\\text\{kce\}\}^\{\\text\{W\}\}\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)Read the following response and list the visual concepts \(objects, their attributes, counts, and pairwise relations\) that it asserts about the image\. Output as a JSON list of items with fields “entity”, “attribute”, “count”\. Do not infer; copy only what the response explicitly claims\.Response:\{response\_text\}JSON list:

Woodpecker stage 2 — question formulation𝒫qfW\\mathcal\{P\}\_\{\\text\{qf\}\}^\{\\text\{W\}\}\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)For each item in the JSON list below, formulate one short yes/no verification question of the form “Does the image contain a \{entity\}?” or “Is the \{entity\} \{attribute\}?” or “How many \{entity\} are in the image?”\. Output as a JSON list of \{“question”: …, “query”: …\} where “query” is the open\-vocabulary noun phrase to detect \(e\.g\. “red car”\)\.JSON list of concepts:\{concepts\_json\}JSON list of questions:

Woodpecker stage 3 — visual validation \(Grounding DINO call, no LLM prompt\)\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)For each \{“query”:qq\} in the question list, call Grounding DINO with text queryqqat confidence threshold0\.350\.35\. Record the top\-55detected bounding boxes and their confidence scores\. Skip an item if Grounding DINO returns no detection\.

Woodpecker stage 4 — visual claim generation𝒫vcgW\\mathcal\{P\}\_\{\\text\{vcg\}\}^\{\\text\{W\}\}\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)Given the visual detections below, write a short factual statement about what the image contains\. Use the detected counts and labels literally; do not introduce concepts that were not detected\. Output as plain text, one sentence per detected concept\.Detections:\{dino\_detections\}Visual claims:

Woodpecker stage 5 — hallucination correction𝒫refineW\\mathcal\{P\}\_\{\\text\{refine\}\}^\{\\text\{W\}\}\(Yinet al\.,[2024](https://arxiv.org/html/2606.00232#bib.bib79)\)You previously generated the response below for the given image\. A visual grounding tool has produced the following factual claims about what the image actually contains\. Rewrite the response so that it agrees with the visual claims: correct any contradictions, drop unsupported assertions, and keep the original phrasing wherever it already matches the visual claims\.Image:\{image\}Original response:\{response\_text\}Visual claims from grounding tool:\{visual\_claims\}Corrected response:

DeGF — diffusion feedback \(Stable Diffusion call, no LLM prompt\)\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib50)\)Use the initial response𝐘0\\mathbf\{Y\}\_\{0\}as the text prompt to Stable Diffusion \(SD\-Turbo,11denoising step, default scheduler, guidance scale1\.01\.0\) and synthesise an auxiliary imageIgenI\_\{\\text\{gen\}\}of resolution512×512512\\times 512\. This auxiliary image serves as the generative\-feedback visual reference for the contrastive\-decoding step\.

DeGF — contrastive decoding \(no LLM prompt; logit\-level operation\)\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib50)\)At each decoding stepkk, compute two backbone forward passes with the same text prompt: one conditioned on the original imageIorigI\_\{\\text\{orig\}\}giving logitssorig\(k\)s\_\{\\text\{orig\}\}^\{\(k\)\}, and one conditioned on the diffusion\-generated imageIgenI\_\{\\text\{gen\}\}givingsgen\(k\)s\_\{\\text\{gen\}\}^\{\(k\)\}\. Decode greedily from the contrastive logitss~\(k\)=\(1\+α\)sorig\(k\)−αsgen\(k\),α=0\.5,\\tilde\{s\}^\{\(k\)\}=\(1\+\\alpha\)\\,s\_\{\\text\{orig\}\}^\{\(k\)\}\\;\-\\;\\alpha\\,s\_\{\\text\{gen\}\}^\{\(k\)\},\\qquad\\alpha=0\.5,matching DeGF Eq\. \(4\)\. No further LLM prompt is used; the corrected response𝐘T\\mathbf\{Y\}\_\{T\}is the greedy decode unders~\(k\)\\tilde\{s\}^\{\(k\)\}\.

## Appendix CRisk Function Design and Property Verification

This appendix presents the full design of per\-fact supports\(f\)s\(f\), conflictc\(f\)c\(f\), and riskr\(f\)r\(f\)used in TIGER \(§[C\.1](https://arxiv.org/html/2606.00232#A3.SS1)\), and verifies that the risk function satisfies three natural properties \(§[C\.2](https://arxiv.org/html/2606.00232#A3.SS2)\)\.

### C\.1Support, Conflict, and Risk Design

#### Preliminaries\.

Every fact in the observation graphG𝐗G\_\{\\mathbf\{X\}\}and the claim graphG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}is represented as a triplef=\(sf,pf,of\)f=\(s\_\{f\},p\_\{f\},o\_\{f\}\)of subject, predicate, and object text fields, following the scene\-graph tuple structure of SPICE\(Andersonet al\.,[2016](https://arxiv.org/html/2606.00232#bib.bib98)\)\. The two graphs are not loose triple sets: facts that share the same entity are connected by coreference edges\. For example,\(man,wearing,red shirt\)\(\\text\{man\},\\text\{wearing\},\\text\{red shirt\}\)and\(man,holding,coffee\)\(\\text\{man\},\\text\{holding\},\\text\{coffee\}\)share the subject “man” and are therefore linked\.

We encode each text field with a frozen sentence transformerΦenc\\Phi\_\{\\text\{enc\}\}\(Sentence\-BERT;Reimers and Gurevych,[2019](https://arxiv.org/html/2606.00232#bib.bib99); checkpointall\-MiniLM\-L6\-v2\) into an L2\-normalized vector\. We convert the raw cosine similarity into a clipped cosine similarity

κ\(t1,t2\)=max⁡\{0,Φenc\(t1\)⊤Φenc\(t2\)\}∈\[0,1\],\\kappa\(t\_\{1\},t\_\{2\}\)=\\max\\\{0,\\Phi\_\{\\text\{enc\}\}\(t\_\{1\}\)^\{\\top\}\\Phi\_\{\\text\{enc\}\}\(t\_\{2\}\)\\\}\\in\[0,1\],\(11\)where‖Φenc\(⋅\)‖2=1\\\|\\Phi\_\{\\text\{enc\}\}\(\\cdot\)\\\|\_\{2\}=1\. All subsequent field similarities useκ\(⋅,⋅\)\\kappa\(\\cdot,\\cdot\)instead of the raw cosine\.

Given two factsf=\(sf,pf,of\)f=\(s\_\{f\},p\_\{f\},o\_\{f\}\)andg=\(sg,pg,og\)g=\(s\_\{g\},p\_\{g\},o\_\{g\}\), letσs=κ\(sf,sg\)\\sigma\_\{s\}=\\kappa\(s\_\{f\},s\_\{g\}\),σp=κ\(pf,pg\)\\sigma\_\{p\}=\\kappa\(p\_\{f\},p\_\{g\}\), andσo=κ\(of,og\)\\sigma\_\{o\}=\\kappa\(o\_\{f\},o\_\{g\}\)denote the per\-field bounded similarities\. The similarity betweenffandggis defined as the equal\-weight mean of the three fields:

sim\(f,g\)=13\(σs\+σp\+σo\)\.\\mathrm\{sim\}\(f,g\)\\;=\\;\\tfrac\{1\}\{3\}\(\\sigma\_\{s\}\+\\sigma\_\{p\}\+\\sigma\_\{o\}\)\.\(12\)Since each field similarity lies in\[0,1\]\[0,1\], we havesim\(f,g\)∈\[0,1\]\\mathrm\{sim\}\(f,g\)\\in\[0,1\]\. Computing similarity per field rather than encoding the concatenated triple preserves structural information and avoids the problem that whole\-triple encoding assigns nearly identical similarity to “man riding horse” and “horse riding man\.”

#### Supports\(f\)s\(f\)\.

Support measures how strongly a claim is backed by evidence in the observation graph\. The computation has two steps\.

First, the*local support*of each claimf∈G𝐘tf\\in G\_\{\\mathbf\{Y\}\_\{t\}\}is the maximum similarity over all facts in the observation graph:

s0\(f\)=maxg∈G𝐗⁡sim\(f,g\)\.s\_\{0\}\(f\)\\;=\\;\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\;\\mathrm\{sim\}\(f,g\)\.\(13\)The maximum is taken because a claim needs only one matching fact inG𝐗G\_\{\\mathbf\{X\}\}to be supported\. Becausesim\(f,g\)∈\[0,1\]\\mathrm\{sim\}\(f,g\)\\in\[0,1\], we also haves0\(f\)∈\[0,1\]s\_\{0\}\(f\)\\in\[0,1\]\.

Second, local support may be too low when the extractor misses a fact\. For instance, if the input image shows a man wearing a red shirt and holding coffee but the extractor only recovers\(man,holding,coffee\)\(\\text\{man\},\\text\{holding\},\\text\{coffee\}\), then the claim “wearing red shirt” receives a lows0s\_\{0\}and would be incorrectly flagged as high risk\. To compensate for extraction omissions, we propagate local support along coreference edges in the claim graphG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}:

s\(f\)=maxf′∈\{f\}∪𝒩K\(f\)⁡γd\(f,f′\)⋅s0\(f′\),s\(f\)\\;=\\;\\max\_\{f^\{\\prime\}\\in\\\{f\\\}\\cup\\mathcal\{N\}\_\{K\}\(f\)\}\\;\\gamma^\{\\,d\(f,\\,f^\{\\prime\}\)\}\\cdot s\_\{0\}\(f^\{\\prime\}\),\(14\)where𝒩K\(f\)\\mathcal\{N\}\_\{K\}\(f\)is theKK\-hop coreference neighborhood offfinG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}\(BFS expansion\),d\(f,f′\)d\(f,f^\{\\prime\}\)is the hop distance, andγ∈\(0,1\)\\gamma\\in\(0,1\)is a decay factor\. The geometric decayγd\\gamma^\{d\}ensures that distant neighbors contribute progressively less, so that indirect support remains local to the entity neighborhood\. In the example above, “wearing red shirt” is linked to “holding coffee” through the shared entity “man” and therefore receives indirect supportγ⋅s0\(holding coffee\)\\gamma\\cdot s\_\{0\}\(\\text\{holding coffee\}\)\. We setK=3K=3andγ=0\.7\\gamma=0\.7throughout all experiments\. WhenK=0K=0, the propagation is disabled ands\(f\)s\(f\)reduces tos0\(f\)s\_\{0\}\(f\)\. Sinces0\(f′\)∈\[0,1\]s\_\{0\}\(f^\{\\prime\}\)\\in\[0,1\]andγd\(f,f′\)∈\[0,1\]\\gamma^\{d\(f,f^\{\\prime\}\)\}\\in\[0,1\], the propagated support also satisfiess\(f\)∈\[0,1\]s\(f\)\\in\[0,1\]\.

Graph propagation is what distinguishes TIGER from plain triple\-set comparison: without graph structure every fact is evaluated independently; with propagation, facts about the same entity reinforce each other and reduce false positives caused by extraction omissions\.

#### Conflictc\(f\)c\(f\)\.

Conflict measures the strongest contradiction between a claim and the evidence in the observation graph\. A conflict arises when two facts discuss the same topic \(matching subject and predicate\) but reach different conclusions \(different object\), corresponding to the definition of*contradiction*in natural language inference: a contradiction requires the same premise but an opposing conclusion, as opposed to*neutral*\(unrelated topics\)\.

Givenffandgg, conflict is defined as the product of topic consistency and conclusion divergence:

conflict\(f,g\)=12\(σs\+σp\)⏟topic consistency⋅\(1−σo\)⏟conclusion divergence\.\\mathrm\{conflict\}\(f,g\)\\;=\\;\\underbrace\{\\tfrac\{1\}\{2\}\(\\sigma\_\{s\}\+\\sigma\_\{p\}\)\}\_\{\\text\{topic consistency\}\}\\;\\cdot\\;\\underbrace\{\(1\-\\sigma\_\{o\}\)\}\_\{\\text\{conclusion divergence\}\}\.\(15\)The multiplicative structure implements a natural soft gate\. When the subjects or predicates do not match, the left factor is close to 0 and the conflict is suppressed regardless of how different the objects are, because the two facts do not discuss the same topic\. When the subjects and predicates match but the objects also match, the right factor is close to 0 and there is no conflict because the conclusions agree\. Only when the same entity and relation are paired with different conclusions do both factors take high values and produce a significant conflict\. This soft gate introduces no threshold hyperparameters\.

The conflict score of a claimffis the maximum conflict over all facts in the observation graph:

c\(f\)=maxg∈G𝐗⁡conflict\(f,g\)\.c\(f\)\\;=\\;\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\;\\mathrm\{conflict\}\(f,g\)\.\(16\)Unlike support, conflict is*not*propagated along coreference edges\. Conflict is specific to an individual claim and does not transfer to neighbors: if\(man,is,tall\)\(\\text\{man\},\\text\{is\},\\text\{tall\}\)conflicts with the observation\(man,is,short\)\(\\text\{man\},\\text\{is\},\\text\{short\}\), the neighboring fact\(man,wearing,hat\)\(\\text\{man\},\\text\{wearing\},\\text\{hat\}\)is unaffected because hat and height are unrelated\. Propagating conflict would let one erroneous fact contaminate all facts about the same entity\.

Sinceσs,σp,σo∈\[0,1\]\\sigma\_\{s\},\\sigma\_\{p\},\\sigma\_\{o\}\\in\[0,1\], topic consistency12\(σs\+σp\)∈\[0,1\]\\tfrac\{1\}\{2\}\(\\sigma\_\{s\}\+\\sigma\_\{p\}\)\\in\[0,1\]and conclusion divergence\(1−σo\)∈\[0,1\]\(1\-\\sigma\_\{o\}\)\\in\[0,1\]\. Therefore,conflict\(f,g\)∈\[0,1\]\\mathrm\{conflict\}\(f,g\)\\in\[0,1\], and the maximum overG𝐗G\_\{\\mathbf\{X\}\}guaranteesc\(f\)∈\[0,1\]c\(f\)\\in\[0,1\]\.

#### Riskr\(f\)r\(f\)\.

Combining support and conflict, the per\-fact risk is defined as

r\(f\)=\(1−s\(f\)\)\+λ⋅c\(f\),λ\>0,r\(f\)\\;=\\;\(1\-s\(f\)\)\+\\lambda\\cdot c\(f\),\\qquad\\lambda\>0,\(17\)where\(1−s\(f\)\)\(1\-s\(f\)\)measures lack of support andλ⋅c\(f\)\\lambda\\cdot c\(f\)measures active conflict\. Becauses\(f\)s\(f\)andc\(f\)c\(f\)are computed by deterministic scoring operations after the graphs are fixed,r\(⋅\)r\(\\cdot\)always returns the same value for a fixed\(G𝐗,G𝐘t\)\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)input, satisfying the deterministic scoring requirement of the operatorΨα\\Psi\_\{\\alpha\}\(Section[3\.2](https://arxiv.org/html/2606.00232#S3.SS2)\)\.

### C\.2Property Verification

We verify that the risk functionr\(s,c\)=\(1−s\)\+λcr\(s,c\)=\(1\-s\)\+\\lambda c\(λ\>0\\lambda\>0,s,c∈\[0,1\]s,c\\in\[0,1\]\) satisfies three natural properties\.

###### Proposition C\.1\.

The risk functionr\(s,c\)=\(1−s\)\+λcr\(s,c\)=\(1\-s\)\+\\lambda cwithλ\>0\\lambda\>0satisfies the following properties:

\(P1\)Non\-negativity:r\(s,c\)≥0r\(s,c\)\\geq 0for all\(s,c\)∈\[0,1\]2\(s,c\)\\in\[0,1\]^\{2\}\.

\(P2\)Boundary condition:r\(s,c\)=0r\(s,c\)=0if and only ifs=1s=1andc=0c=0\.

\(P3\)Monotonicity:rris strictly decreasing inssand strictly increasing incc\.

###### Proof\.

*\(P1\)\.*Froms∈\[0,1\]s\\in\[0,1\]we have1−s≥01\-s\\geq 0; fromc∈\[0,1\]c\\in\[0,1\]andλ\>0\\lambda\>0we haveλc≥0\\lambda c\\geq 0\. The sum of two non\-negative terms is non\-negative:r\(s,c\)=\(1−s\)\+λc≥0r\(s,c\)=\(1\-s\)\+\\lambda c\\geq 0\.

*\(P2\)\.*Sufficiency: substitutings=1s=1,c=0c=0givesr\(1,0\)=0\+0=0r\(1,0\)=0\+0=0\. Necessity: supposer\(s,c\)=0r\(s,c\)=0, i\.e\.,\(1−s\)\+λc=0\(1\-s\)\+\\lambda c=0\. By \(P1\) both terms are non\-negative; the sum of two non\-negative numbers is zero if and only if both are zero\.1−s=01\-s=0givess=1s=1;λc=0\\lambda c=0withλ\>0\\lambda\>0givesc=0c=0\. Hence zero risk is equivalent to full support and no conflict: a fact is risk\-free only when it has a perfect match in the observation graph and contradicts no observed fact\.

*\(P3\)\.*∂r/∂s=−1<0\\partial r/\\partial s=\-1<0, sorris strictly decreasing inss: higher support lowers risk\.∂r/∂c=λ\>0\\partial r/\\partial c=\\lambda\>0, sorris strictly increasing incc: higher conflict raises risk\. ∎

#### Interpretation\.

P1 ensures that risk scores can be used for ranking and selection\. P2 gives a precise semantics to zero risk: full support and no conflict is the only condition under which a fact is marked as safe\. P3 ensures that risk ranking is consistent with intuition: more support lowers risk, more conflict raises it\.λ\\lambdais the sole hyperparameter and controls the relative penalty of contradiction versus lack of support\.

## Appendix DConvergence Proof for Iterative Risk Reduction

This appendix proves Theorem[3\.1](https://arxiv.org/html/2606.00232#S3.Thmtheorem1), establishing the geometric convergence of the expected total risk under Algorithm[1](https://arxiv.org/html/2606.00232#alg1)\.

### D\.1Setup and Notation

At roundtt, the claim graphG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}containsNt:=\|G𝐘t\|N\_\{t\}:=\|G\_\{\\mathbf\{Y\}\_\{t\}\}\|facts\. The total measured risk is

R\(t\)=∑f∈G𝐘tr\(t\)\(f\),R^\{\(t\)\}\\;=\\;\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\}r^\{\(t\)\}\(f\),wherer\(t\)\(f\)r^\{\(t\)\}\(f\)is the per\-fact risk defined in Eq\. \([17](https://arxiv.org/html/2606.00232#A3.E17)\)\. The high\-risk setℱt\\mathcal\{F\}\_\{t\}selected at line 10 of Algorithm[1](https://arxiv.org/html/2606.00232#alg1)contains the⌈αNt⌉\\lceil\\alpha N\_\{t\}\\rceilfacts with the largestr\(⋅\)r\(\\cdot\)\. The total risk decomposes into the selected and retained portions:

R\(t\)=∑f∈ℱtr\(t\)\(f\)⏟to repair\+∑f∈G𝐘t∖ℱtr\(t\)\(f\)⏟retained\.R^\{\(t\)\}\\;=\\;\\underbrace\{\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\}\_\{\\text\{to repair\}\}\\;\+\\;\\underbrace\{\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\\setminus\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\}\_\{\\text\{retained\}\}\.
The repair step changes the risk of facts inℱt\\mathcal\{F\}\_\{t\}fromr\(t\)\(f\)r^\{\(t\)\}\(f\)tor~\(t\+1\)\(f\)\\widetilde\{r\}^\{\(t\+1\)\}\(f\)but may also affect facts outsideℱt\\mathcal\{F\}\_\{t\}\. We define the*ideal*post\-repair total risk \(assuming perfect extraction\) as

R~\(t\+1\)=∑f∈ℱtr~\(t\+1\)\(f\)\+∑f∈G𝐘t∖ℱtr\(t\)\(f\)\+Δt\+,\\widetilde\{R\}^\{\(t\+1\)\}=\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}\\widetilde\{r\}^\{\(t\+1\)\}\(f\)\+\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\\setminus\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\\;\+\\;\\Delta\_\{t\}^\{\+\},whereΔt\+≥0\\Delta\_\{t\}^\{\+\}\\geq 0is the incremental risk that repairingℱt\\mathcal\{F\}\_\{t\}introduces on the remaining facts\. The gap between the*measured*risk on the actually extracted graphG𝐘t\+1G\_\{\\mathbf\{Y\}\_\{t\+1\}\}and the ideal risk is

Δtext:=R\(t\+1\)−R~\(t\+1\)\.\\Delta\_\{t\}^\{\\mathrm\{ext\}\}\\;:=\\;R^\{\(t\+1\)\}\-\\widetilde\{R\}^\{\(t\+1\)\}\.Combining the two definitions gives the one\-step decomposition:

R\(t\+1\)=\\displaystyle R^\{\(t\+1\)\}\\;=\\;∑f∈ℱtr~\(t\+1\)\(f\)\+∑f∈G𝐘t∖ℱtr\(t\)\(f\)\\displaystyle\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}\\widetilde\{r\}^\{\(t\+1\)\}\(f\)\\;\+\\;\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\\setminus\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\+Δt\+\+Δtext\.\\displaystyle\\;\+\\;\\Delta\_\{t\}^\{\+\}\\;\+\\;\\Delta\_\{t\}^\{\\mathrm\{ext\}\}\.\(18\)

### D\.2Assumptions

###### Assumption D\.1\(Bounded graph size\)\.

There exists a constantNmaxN\_\{\\max\}such thatNt≤NmaxN\_\{t\}\\leq N\_\{\\max\}for all roundstt\.

###### Assumption D\.2\(Per\-fact repair progress\)\.

There exists a constantε∈\(0,1\]\\varepsilon\\in\(0,1\]such that for everyf∈ℱtf\\in\\mathcal\{F\}\_\{t\},

𝔼\[r~\(t\+1\)\(f\)∣R\(t\)\]≤\(1−ε\)r\(t\)\(f\)\.\\mathbb\{E\}\\\!\\left\[\\widetilde\{r\}^\{\(t\+1\)\}\(f\)\\mid R^\{\(t\)\}\\right\]\\;\\leq\\;\(1\-\\varepsilon\)\\,r^\{\(t\)\}\(f\)\.

###### Assumption D\.3\(Bounded side effects\)\.

There exists a constantβ≥0\\beta\\geq 0such that𝔼\[Δt\+∣R\(t\)\]≤β\\mathbb\{E\}\[\\Delta\_\{t\}^\{\+\}\\mid R^\{\(t\)\}\]\\leq\\beta\.

###### Assumption D\.4\(Bounded extraction loss\)\.

There exists a constantξ≥0\\xi\\geq 0such that𝔼\[Δtext∣R\(t\)\]≤ξ\\mathbb\{E\}\[\\Delta\_\{t\}^\{\\mathrm\{ext\}\}\\mid R^\{\(t\)\}\]\\leq\\xi\.

### D\.3Proof of Theorem[3\.1](https://arxiv.org/html/2606.00232#S3.Thmtheorem1)

###### Proof\.

Taking conditional expectation of the one\-step decomposition \([D\.1](https://arxiv.org/html/2606.00232#A4.Ex20)\) givenR\(t\)R^\{\(t\)\}:

𝔼\[R\(t\+1\)∣R\(t\)\]\\displaystyle\\mathbb\{E\}\\\!\\left\[R^\{\(t\+1\)\}\\mid R^\{\(t\)\}\\right\]=\\displaystyle\\;=\\;∑f∈ℱt𝔼\[r~\(t\+1\)\(f\)∣R\(t\)\]\+∑f∈G𝐘t∖ℱtr\(t\)\(f\)\\displaystyle\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}\\\!\\mathbb\{E\}\\\!\\left\[\\widetilde\{r\}^\{\(t\+1\)\}\(f\)\\mid R^\{\(t\)\}\\right\]\+\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\\setminus\\mathcal\{F\}\_\{t\}\}\\\!r^\{\(t\)\}\(f\)\+𝔼\[Δt\+∣R\(t\)\]\+𝔼\[Δtext∣R\(t\)\]\\displaystyle\+\\mathbb\{E\}\\\!\\left\[\\Delta\_\{t\}^\{\+\}\\mid R^\{\(t\)\}\\right\]\+\\mathbb\{E\}\\\!\\left\[\\Delta\_\{t\}^\{\\mathrm\{ext\}\}\\mid R^\{\(t\)\}\\right\]≤\(a\)\\displaystyle\\;\\overset\{\(a\)\}\{\\leq\}\\;\(1−ε\)∑f∈ℱtr\(t\)\(f\)\+∑f∈G𝐘t∖ℱtr\(t\)\(f\)\\displaystyle\(1\-\\varepsilon\)\\\!\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}\\\!r^\{\(t\)\}\(f\)\\;\+\\;\\sum\_\{f\\in G\_\{\\mathbf\{Y\}\_\{t\}\}\\setminus\\mathcal\{F\}\_\{t\}\}\\\!r^\{\(t\)\}\(f\)\+β\+ξ\\displaystyle\\;\+\\;\\beta\+\\xi=\\displaystyle\\;=\\;R\(t\)−ε∑f∈ℱtr\(t\)\(f\)\+β\+ξ\\displaystyle R^\{\(t\)\}\-\\varepsilon\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\+\\beta\+\\xi≤\(b\)\\displaystyle\\;\\overset\{\(b\)\}\{\\leq\}\\;\(1−αε\)R\(t\)\+β\+ξ,\\displaystyle\(1\-\\alpha\\varepsilon\)\\,R^\{\(t\)\}\+\\beta\+\\xi,\(19\)where \(a\) applies Assumption[D\.2](https://arxiv.org/html/2606.00232#A4.Thmtheorem2)to eachf∈ℱtf\\in\\mathcal\{F\}\_\{t\}and sums, and applies Assumptions[D\.3](https://arxiv.org/html/2606.00232#A4.Thmtheorem3)and[D\.4](https://arxiv.org/html/2606.00232#A4.Thmtheorem4)to the side\-effect and extraction terms; \(b\) holds becauseℱt\\mathcal\{F\}\_\{t\}contains the⌈αNt⌉\\lceil\\alpha N\_\{t\}\\rceilhighest\-risk facts, so their average risk is at least the overall averageR\(t\)/NtR^\{\(t\)\}/N\_\{t\}, giving∑f∈ℱtr\(t\)\(f\)≥⌈αNt⌉⋅R\(t\)/Nt≥αR\(t\)\\sum\_\{f\\in\\mathcal\{F\}\_\{t\}\}r^\{\(t\)\}\(f\)\\geq\\lceil\\alpha N\_\{t\}\\rceil\\cdot R^\{\(t\)\}/N\_\{t\}\\geq\\alpha\\,R^\{\(t\)\}\.

Taking unconditional expectation of \([19](https://arxiv.org/html/2606.00232#A4.E19)\) yields the recurrence𝔼\[R\(t\+1\)\]≤\(1−αε\)𝔼\[R\(t\)\]\+β\+ξ\\mathbb\{E\}\[R^\{\(t\+1\)\}\]\\leq\(1\-\\alpha\\varepsilon\)\\,\\mathbb\{E\}\[R^\{\(t\)\}\]\+\\beta\+\\xi\. Unrolling by induction, suppose the bound holds at roundtt; then

𝔼\[R\(t\+1\)\]\\displaystyle\\mathbb\{E\}\\\!\\left\[R^\{\(t\+1\)\}\\right\]≤\(c\)\\displaystyle\\;\\overset\{\(c\)\}\{\\leq\}\\;\(1−αε\)𝔼\[R\(t\)\]\+β\+ξ\\displaystyle\(1\-\\alpha\\varepsilon\)\\,\\mathbb\{E\}\[R^\{\(t\)\}\]\+\\beta\+\\xi≤\(d\)\\displaystyle\\;\\overset\{\(d\)\}\{\\leq\}\\;\(1−αε\)\[\(1−αε\)tR\(0\)\+\(β\+ξ\)∑j=0t−1\(1−αε\)j\]\\displaystyle\(1\-\\alpha\\varepsilon\)\\\!\\left\[\(1\-\\alpha\\varepsilon\)^\{t\}R^\{\(0\)\}\+\(\\beta\{\+\}\\xi\)\\\!\\sum\_\{j=0\}^\{t\-1\}\(1\{\-\}\\alpha\\varepsilon\)^\{j\}\\right\]\+β\+ξ\\displaystyle\+\\beta\+\\xi=\\displaystyle\\;=\\;\(1−αε\)t\+1R\(0\)\+\(β\+ξ\)∑j=0t\(1−αε\)j,\\displaystyle\(1\-\\alpha\\varepsilon\)^\{t\+1\}R^\{\(0\)\}\+\(\\beta\+\\xi\)\\\!\\sum\_\{j=0\}^\{t\}\(1\-\\alpha\\varepsilon\)^\{j\},where \(c\) is the drift inequality \([19](https://arxiv.org/html/2606.00232#A4.E19)\) and \(d\) is the inductive hypothesis\. The base caset=0t=0holds with equality\. Sinceαε∈\(0,1\]\\alpha\\varepsilon\\in\(0,1\],∑j=0T−1\(1−αε\)j≤1/\(αε\)\\sum\_\{j=0\}^\{T\-1\}\(1\{\-\}\\alpha\\varepsilon\)^\{j\}\\leq 1/\(\\alpha\\varepsilon\)\. Evaluating att=Tt=Tgives

𝔼\[R\(T\)\]≤\(1−αε\)TR\(0\)\+β\+ξαε\.∎\\mathbb\{E\}\\\!\\left\[R^\{\(T\)\}\\right\]\\;\\leq\\;\(1\-\\alpha\\varepsilon\)^\{T\}\\,R^\{\(0\)\}\\;\+\\;\\frac\{\\beta\+\\xi\}\{\\alpha\\varepsilon\}\.\\qed

#### Asymptotic behavior\.

Forαε∈\(0,1\)\\alpha\\varepsilon\\in\(0,1\),\(1−αε\)T→0\(1\-\\alpha\\varepsilon\)^\{T\}\\to 0asT→∞T\\to\\infty, and the bound converges to the residual\(β\+ξ\)/\(αε\)\(\\beta\+\\xi\)/\(\\alpha\\varepsilon\), which characterizes the capability boundary of the framework\.

## Appendix EExperimental Details

### E\.1Datasets and Preprocessing

We evaluate on five benchmarks that together cover the four cross\-modal generation paths reported in the main paper: image→\\totext \(COCO, AMBER\), image\+text→\\totext \(MMHal\-Bench\), audio→\\totext \(Clotho\), and video→\\totext \(VideoHallucer\)\. We also use one curated probe set \(SCS\-1000\) for the spurious\-correlation analysis in Section[2](https://arxiv.org/html/2606.00232#S2)and the feedback\-mention\-rate experiment in Section[4\.5](https://arxiv.org/html/2606.00232#S4.SS5)\. Splits are the official splits as redistributed by their upstream sources; no custom random splits are introduced\. SCS\-1000 is a curated*probe*set, not a train / dev / test split\.

Table 6:Datasets and splits\. “Full size” is the upstream benchmark size; “\# used” is the sample count reported in the main paper\. For COCO and SCS\-1000, CHAIR\-style object presence uses COCOinstances\_val2014\.jsonaugmented with a synonym table\.No further preprocessing is applied: images are loaded at native resolution by the chosen backbone’s vision tower, audio at the dataset’s native sample rate\. License information for each dataset follows the upstream repository\.

#### SCS\-1000 cue pairs\.

SCS\-1000 is the curated probe set used in Section[2](https://arxiv.org/html/2606.00232#S2)\(co\-occurrence hallucination rate, Figure[2](https://arxiv.org/html/2606.00232#S2.F2)\) and in the feedback\-mention\-rate analysis in Section[4\.5](https://arxiv.org/html/2606.00232#S4.SS5)\. We select nine\(a,b\)\(a,b\)pairs of COCO object categories that co\-occur frequently in COCO captions, then for each pair sample roughly110110val2014 images that contain the cue objectaabut*not*the absent objectbb\(1,0001\{,\}000images in total\)\. The cue is verified present and the absent object verified absent against COCOinstances\_val2014\.json\. Table[7](https://arxiv.org/html/2606.00232#A5.T7)lists the nine pairs\.

Table 7:The nine SCS\-1000 cue pairs\. Each row holds the scene cueaathat is present in the image and the COCO categorybbthat is verified absent from the image but frequently co\-occurs withaain COCO captions\. The co\-occurrence hallucination rate \(CHR\) in Figure[2](https://arxiv.org/html/2606.00232#S2.F2)is the fraction of generations that mentionbbwhen conditioned on a SCS\-1000 image containingaa\.

### E\.2Models and Baselines

#### Primary backbone\.

The primary backboneΦ\\Phiis Qwen2\.5\-Omni\-7B \(Qwen/Qwen2\.5\-Omni\-7B\)\(Xuet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib31)\), which accepts text, image, audio, and video inputs and produces text and audio outputs in a single architecture\. The same model serves as both the generator of𝐘t\\mathbf\{Y\}\_\{t\}and the extractor that producesG𝐗G\_\{\\mathbf\{X\}\}andG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}via Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\)\.

Table 8:Primary backbone configuration\. The backbone is frozen throughout; no parameter updates occur at any stage\.
#### Secondary backbones \(API\)\.

Used for the cross\-backbone generalization experiment in Appendix[F\.1](https://arxiv.org/html/2606.00232#A6.SS1)only\. All API calls disable temperature and rely on the endpoint default \(none of these models expose a deterministic decoding flag at the time of writing\)\.

- •GPT\-5\.5 \(Azure OpenAI, modelgpt\-5\.5\)\.
- •Gemini 3\.5 Flash \(Google AI Studio, modelgemini\-3\.5\-flash\)\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.00232#bib.bib84)\)\.
- •Claude Haiku 4\.5 \(Anthropic API, modelclaude\-haiku\-4\-5\)\.

We additionally report results with LLaVA\-1\.5\-7B\(Liuet al\.,[2024a](https://arxiv.org/html/2606.00232#bib.bib82)\)on the same image\-path benchmarks as the Qwen primary backbone, also in Appendix[F\.1](https://arxiv.org/html/2606.00232#A6.SS1)\. API access windows are listed in Appendix[E\.7](https://arxiv.org/html/2606.00232#A5.SS7)\.

#### Baselines compared\.

All baselines are reimplementations on the same Qwen2\.5\-Omni\-7B backbone \(and on LLaVA\-1\.5\-7B where applicable\), called through the same dispatcher with a different\-\-modeflag, so the data path, sampling hyperparameters, and evaluation protocol are matched across methods; the per\-method inference budgets differ by design \(e\.g\., best\-of\-NNmethods drawNNcandidates per sample, iterative methods runTTrepair rounds, contrastive decoders run two forward passes per generated token\) and are reported separately in Table[9](https://arxiv.org/html/2606.00232#A5.T9)\. Per\-baseline prompt templates are listed in Appendix[B\.2](https://arxiv.org/html/2606.00232#A2.SS2)\. We compare against ten baselines grouped into three families: resampling \(BoN\+CLIP, BoN\+VisualPRM, BoN\+CycleReward\), modality\-specific contrastive decoding \(VCD on image, AAD on audio, TCD on video\), and iterative refinement \(Volcano, Woodpecker, DeGF\)\. Frozen is the no\-repair lower bound\.

Table 9:Baselines used in the main experiments\. “Paths” lists the modalities each baseline is evaluated on: image==COCO/AMBER/MMHal\-Bench, audio==Clotho, video==VideoHallucer, all==every benchmark\. Budget is in backbone forward passes; contrastive methods \(VCD, AAD, TCD\) run two passes per generated*token*\.

### E\.3Inference and Evaluation Protocol

#### Per\-dataset hyperparameters\.

Table[10](https://arxiv.org/html/2606.00232#A5.T10)lists the operating point of TIGER on each dataset\. The operating points forTT,α\\alpha, andλ\\lambdaare listed in Table[10](https://arxiv.org/html/2606.00232#A5.T10)\. The sensitivity study in Figure[4](https://arxiv.org/html/2606.00232#S4.F4)is conducted on COCO to assess robustness around the default setting\. We fix the graph\-propagation parameters toK=3K=3andγ=0\.7\\gamma=0\.7across all experiments\.

Table 10:Per\-dataset hyperparameters of TIGER\. All five benchmarks share the same Qwen2\.5\-Omni\-7B backbone and the same sampling decoder \(temperature=0\.7=0\.7, top\-p=0\.9p=0\.9\), so the std reported across three seeds in Appendix[E\.5](https://arxiv.org/html/2606.00232#A5.SS5)reflects only the per\-dataset operating\-point differences\. The two question\-answering datasets \(MMHal\-Bench, VideoHallucer\) use a single\-fact repair regime \(one flagged claim per round\); the free\-form datasets \(COCO, AMBER, Clotho\) use the batch\-repair regime \(α=0\.2\\alpha\\\!=\\\!0\.2\)\.
#### Chain\-of\-thought and answer extraction\.

No chain\-of\-thought prompting is used\. Outputs are taken verbatim from the backbone\. Object\-presence judgements for CHAIR are taken from the COCOinstances\_val2014\.jsonannotations and the AMBER object\-set annotations, not from the model\. For MMHal\-Bench we report sentence\-transformer BERTScore against the reference answer rather than the original GPT\-judge protocol, because BERTScore is reproducible without an external LLM call and correlates with the GPT\-judge score on this benchmark\. For VideoHallucer we apply a robust yes/no parser to the free\-form output \(regex\-based, treating hedged responses such as “hard to tell” as “no”\); the parser is shared across all methods in Table[4](https://arxiv.org/html/2606.00232#S4.T4)so the comparison is fair\.

#### Evaluation metrics\.

All headline metrics reported in the main paper are computed independently of TIGER’s evidence graph, so the reported gains cannot be an artefact of TIGER optimizing its own internal quantities\. Table[11](https://arxiv.org/html/2606.00232#A5.T11)lists every metric used in the main paper alongside the dataset it scores and the external tool that produces it\.

Table 11:Independent \(non\-TIGER\) evaluation metrics used in the main paper\. Every metric is computed by an external tool against ground\-truth annotations distributed with the benchmark, so no TIGER\-internal score \(supportss, conflictcc, riskrr, or theG𝐗G\_\{\\mathbf\{X\}\}/G𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}graphs\) enters the headline numbers\.MetricDataset\(s\)Tool / referenceCHAIRs, CHAIRi↓\\downarrowCOCOCOCOinstances\_val2014\.json\+ synonym tableBERTScore↑\\uparrowCOCO, MMHal\-Benchbert\_scorepkg vs\. COCO captions / MMHal refsDisc\. Acc↑\\uparrowAMBERofficial discriminative split, exact\-match yes/noCHAIRg↓\\downarrowAMBER \(generative\)AMBER object set per imageRougeL↑\\uparrow, BLEU↑\\uparrowClothorouge\_score/nltkvs\. 5 ref captionsCLAP↑\\uparrowClotholaion/larger\_clap\_generalaudio\-text cosineAEHR↓\\downarrowClothoPANNs CNN14 top\-10 events vs\. predicted mentionsHallucRate↓\\downarrowVideoHallucerrate of “yes” on hallucinated questionsAccb\{\}\_\{\\text\{b\}\}, Acch\{\}\_\{\\text\{h\}\}, Paired↑\\uparrowVideoHallucerrobust yes/no parser, paired protocolFor the spurious\-correlation probe \(SCS\-1000, Figure[2](https://arxiv.org/html/2606.00232#S2.F2)in Section[2](https://arxiv.org/html/2606.00232#S2)\) we additionally report the Co\-occurrence Hallucination Rate \(CHR\), defined as the fraction of generations that mention an absent objectbbwhen the cue objectaais present; this follows the standard CHAIR\-style fraction over the curated cue pairs\.

### E\.4Mechanism Analysis Methodology

This subsection documents how the three panels of Figure[5](https://arxiv.org/html/2606.00232#S4.F5)in Section[4\.5](https://arxiv.org/html/2606.00232#S4.SS5)are produced\. All three probes use the SCS\-1000 image set \(Table[7](https://arxiv.org/html/2606.00232#A5.T7)\) and the Qwen2\.5\-Omni\-7B backbone at the COCO operating point in Table[10](https://arxiv.org/html/2606.00232#A5.T10)\(T=5T\\\!=\\\!5,α=0\.2\\alpha\\\!=\\\!0\.2,λ=0\.5\\lambda\\\!=\\\!0\.5,γ=0\.7\\gamma\\\!=\\\!0\.7,K=3K\\\!=\\\!3\)\.

#### Panel \(a\): Feedback mention rate\.

For each image in SCS\-1000 the cue objectaais verified present and the absent objectbbis verified absent against COCOinstances\_val2014\.json\. We run the sameT=5T\\\!=\\\!5repair loop with three different feedback channels:L1\(naive joint feedback\) instantiatesℱt\\mathcal\{F\}\_\{t\}via the joint\-conditioning prompt of Eq\. \([3](https://arxiv.org/html/2606.00232#S3.E3)\), i\.e\.,ℱt=Φ\(𝒫fb,𝐗,𝐘t\)\\mathcal\{F\}\_\{t\}=\\Phi\(\\mathcal\{P\}\_\{\\text\{fb\}\},\\mathbf\{X\},\\mathbf\{Y\}\_\{t\}\);L2\(text feedback\) first extractsG𝐗G\_\{\\mathbf\{X\}\}andG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\}independently via Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\) and then verbalizes both graphs into the feedback prompt as natural\-language text, but does not perform the deterministic risk computationΨα\\Psi\_\{\\alpha\};L3\(TIGER\) usesℱt=Ψα\(G𝐗,G𝐘t\)\\mathcal\{F\}\_\{t\}=\\Psi\_\{\\alpha\}\(G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{t\}\}\)of Eq\. \([5](https://arxiv.org/html/2606.00232#S3.E5)\) and verbalizes only the top\-⌈αN⌉\\lceil\\alpha N\\rceilrisk\-ranked atomic claims\. The feedback mention rate is the fraction of samples whose feedback text contains a case\-insensitive whole\-word match for the absent objectbb\(plus the synonym table used by the CHAIR scorer\)\. Error bars are 95% Wilson confidence intervals over the 1,000 binary outcomes\. A lower rate means the feedback channel is less prone to re\-introducing the co\-occurrence prior identified in Section[2](https://arxiv.org/html/2606.00232#S2)\.

#### Panel \(b\): Fact composition ofG𝐘G\_\{\\mathbf\{Y\}\}\.

For each image we \(i\) extractG𝐗G\_\{\\mathbf\{X\}\}once from the image via Eq\. \([4](https://arxiv.org/html/2606.00232#S3.E4)\); \(ii\) extract a reference graphG𝐆𝐓G\_\{\\mathbf\{GT\}\}from the five COCO ground\-truth captions per image using the text form of𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}; \(iii\) extractG𝐘G\_\{\\mathbf\{Y\}\}from the model output \(Frozen𝐘0\\mathbf\{Y\}\_\{0\}or TIGER𝐘T\\mathbf\{Y\}\_\{T\}\)\. Each factf∈G𝐘f\\in G\_\{\\mathbf\{Y\}\}is classified into one of three disjoint bins by the sentence\-transformer similarity defined in Eq\. \([12](https://arxiv.org/html/2606.00232#A3.E12)\):

- •correct∈G𝐗\\in G\_\{\\mathbf\{X\}\}:maxg∈G𝐆𝐓⁡sim\(f,g\)≥τ\\max\_\{g\\in G\_\{\\mathbf\{GT\}\}\}\\mathrm\{sim\}\(f,g\)\\geq\\tau*and*maxg∈G𝐗⁡sim\(f,g\)≥τ\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\mathrm\{sim\}\(f,g\)\\geq\\tau;
- •correct∉G𝐗\\notin G\_\{\\mathbf\{X\}\}:maxg∈G𝐆𝐓⁡sim\(f,g\)≥τ\\max\_\{g\\in G\_\{\\mathbf\{GT\}\}\}\\mathrm\{sim\}\(f,g\)\\geq\\tau*and*maxg∈G𝐗⁡sim\(f,g\)<τ\\max\_\{g\\in G\_\{\\mathbf\{X\}\}\}\\mathrm\{sim\}\(f,g\)<\\tau;
- •wrong:maxg∈G𝐆𝐓⁡sim\(f,g\)<τ\\max\_\{g\\in G\_\{\\mathbf\{GT\}\}\}\\mathrm\{sim\}\(f,g\)<\\tau\.

We use the same support thresholdτ=0\.55\\tau\\\!=\\\!0\.55as the repair loop’s risk computation, so the classification rule is consistent with how the rest of the framework labels facts\. The bars in panel \(b\) are per\-sample means of the bin counts, averaged across the 1,000 SCS\-1000 images\. The*correct∉G𝐗\\notin G\_\{\\mathbf\{X\}\}*slice \(light green\) is the key diagnostic: it counts facts that the extractor missed but the refine step recovered by reading the raw input𝐗\\mathbf\{X\}directly, which is the empirical signature of the asymptotic floor analysis in Appendix[D](https://arxiv.org/html/2606.00232#A4)\.

#### Panel \(c\): Sample\-level similarity toG𝐆𝐓G\_\{\\mathbf\{GT\}\}\.

For each sampleiiand source graphG∈\{G𝐗,G𝐘T\}G\\in\\\{G\_\{\\mathbf\{X\}\},G\_\{\\mathbf\{Y\}\_\{T\}\}\\\}, we compute the mean per\-fact best\-match similarity toG𝐆𝐓G\_\{\\mathbf\{GT\}\}:

σ¯i\(G\)=1\|G\|∑f∈Gmaxg∈G𝐆𝐓⁡sim\(f,g\),\\bar\{\\sigma\}\_\{i\}\(G\)=\\frac\{1\}\{\|G\|\}\\sum\_\{f\\in G\}\\,\\max\_\{g\\in G\_\{\\mathbf\{GT\}\}\}\\,\\mathrm\{sim\}\(f,g\),wheresim\(⋅,⋅\)\\mathrm\{sim\}\(\\cdot,\\cdot\)is the per\-field cosine mean of Eq\. \([12](https://arxiv.org/html/2606.00232#A3.E12)\)\. The two density curves in panel \(c\) are kernel density estimates of\{σ¯i\(G𝐗\)\}i=11000\\\{\\bar\{\\sigma\}\_\{i\}\(G\_\{\\mathbf\{X\}\}\)\\\}\_\{i=1\}^\{1000\}\(red\) and\{σ¯i\(G𝐘T\)\}i=11000\\\{\\bar\{\\sigma\}\_\{i\}\(G\_\{\\mathbf\{Y\}\_\{T\}\}\)\\\}\_\{i=1\}^\{1000\}\(green\)\. The bandwidth is set by Scott’s rule\. The rightward shift of the green curve indicates that, on average, the repaired output’s claim graph covers more of the reference content than the extracted input graph alone\.

### E\.5Statistical Reporting

All standard deviations reported in the main paper and the appendix are computed across three decoding seeds\{42,43,44\}\\\{42,43,44\\\}\. We rerun each method end\-to\-end under each seed and report the per\-method mean±\\pmunbiased standard deviation over the three runs\. The seeds are propagated via the standardset\_seedhelper to Python’srandom, NumPy, and PyTorch\. Because the backbone is frozen, the only stochastic operations are the sampling decisions made during decoding of𝒫gen\\mathcal\{P\}\_\{\\text\{gen\}\},𝒫ext\\mathcal\{P\}\_\{\\text\{ext\}\}, and𝒫refine\\mathcal\{P\}\_\{\\text\{refine\}\}\.

### E\.6Computing Infrastructure and Software

#### Hardware\.

The actual GPU depends on the SLURM partition the job lands on: NVIDIA A100 , H100, or B200\. Single\-GPU inference is used throughout; no distributed\-training framework \(DDP, FSDP, DeepSpeed, Accelerate\) is used because no parameter updates occur\. The submission node is a 2\-socket Intel Xeon Gold 6230 \(80 logical cores\), 752 GiB RAM, RHEL 9\.6, NVIDIA driver570\.195\.03, CUDA toolkit 12\.8\.

#### Software\.

The main software components are listed in Table[12](https://arxiv.org/html/2606.00232#A5.T12)\. FlashAttention is not installed; SDPA is used instead viaattn\_implementation="sdpa"\. vLLM, DeepSpeed, PEFT, and TRL are not used\. Quantization is not applied; weights are loaded in bf16 \(or fp16 on non\-Ampere hardware\) viadtype: auto\.

Table 12:Main software versions\. Fullpip freezein supplementary\.

### E\.7Reproducibility Assets

The complete pipeline source, evaluation configs, SLURM submission scripts, independent\-metric scripts, SCS\-1000 cue\-pair manifest, raw per\-sample outputs, and plotting scripts will be released under a permissive license at the camera\-ready stage\. The entry point for every experiment is

> python \-m tiger\.eval \-\-config <yaml\> \-\-mode <mode\>

where<mode\>selects between TIGER, the ten baselines listed in Table[9](https://arxiv.org/html/2606.00232#A5.T9)\(Frozen, three BoN variants, three contrastive\-decoding variants, three iterative\-refinement variants\), and the five internal ablations \(flat\_baseline,tiger\_no\_graph,tiger\_no\_gy,tiger\_no\_lambda,tiger\_no\_nu\) used in Section[4\.3](https://arxiv.org/html/2606.00232#S4.SS3)and Appendix[F\.4](https://arxiv.org/html/2606.00232#A6.SS4)\. Per\-dataset YAMLs underconfigs/experiment/carry the values listed in Table[10](https://arxiv.org/html/2606.00232#A5.T10)\.

## Appendix FAdditional Experimental Results

### F\.1Cross\-Backbone Generalization

Table[13](https://arxiv.org/html/2606.00232#A6.T13)reports additional COCO image→\\totext results on three proprietary API backbones\. Together with the open\-source LLaVA\-1\.5\-7B in the main results and the primary Qwen2\.5\-Omni\-7B, this gives five backbones in total\. The results show that TIGER is not tied to a specific model family\. On GPT\-5\.5, TIGER achieves the best score on all three metrics, reducing CHAIRsfrom \.120 to \.050 and improving BERTScore from \.668 to \.686\. On Gemini 3\.5 Flash, TIGER obtains the lowest CHAIRsand the highest BERTScore, which suggests that the method reduces hallucination without sacrificing semantic coverage\. On Claude Haiku 4\.5, TIGER again achieves the lowest CHAIRsand the highest BERTScore, while its CHAIRiremains close to the best text\-feedback variant\. These trends are consistent with the main\-table results on Qwen2\.5\-Omni\-7B and LLaVA\-1\.5\-7B\.

Table[14](https://arxiv.org/html/2606.00232#A6.T14)summarizes the Frozen\-to\-TIGER change across all five backbones\. TIGER reduces CHAIRsfor every model, including both open\-source and proprietary backbones\. The relative reduction is 29% on Qwen2\.5\-Omni\-7B, 50% on LLaVA\-1\.5\-7B, 58% on GPT\-5\.5, 67% on Gemini 3\.5 Flash, and 38% on Claude Haiku 4\.5\. BERTScore also improves on all five models\. This pattern is important because CHAIR measures unsupported object mentions, while BERTScore measures semantic coverage relative to human captions\. The joint improvement indicates that TIGER does not lower hallucination by deleting useful content\. Instead, the independent graph extraction and deterministic risk ranking provide a model\-agnostic feedback signal that transfers across both open\-source and proprietary backbones\.

Table 13:COCO image→\\totext results on three proprietary API backbones\.Bold: best per column within each backbone block\.Blue: TIGER\.Table 14:Frozen\-to\-TIGER comparison across open\-source and proprietary backbones on COCO image→\\totext\.
### F\.2Comparison with Deterministic Modality\-Specific Extractors

TIGER uses the same frozen backboneΦ\\Phias both the generator and the fact extractor \(Eq\.[4](https://arxiv.org/html/2606.00232#S3.E4)\)\. A natural alternative is to replaceΦ\\Phiin the extractor role with a deterministic modality\-specific tool that does not autoregress, e\.g\., Grounding DINO for object detection, PANNs CNN14 for audio event tagging, or a generic dependency parser for text\. To test whether this design choice matters, we re\-ran COCO image→\\totext withG𝐗G\_\{\\mathbf\{X\}\}produced by Grounding DINO instead of Qwen2\.5\-Omni\-7B, keeping the rest of the pipeline \(risk function,Ψα\\Psi\_\{\\alpha\}selection, refine step\) fixed\.

Table 15:Effect of replacing the multimodal\-backbone extractor with a deterministic detector \(Grounding DINO\) on COCO with Qwen2\.5\-Omni\-7B\. The detector reduces extraction noise on common COCO classes but discards attributes, spatial relations, and counts that the backbone\-extractor captures\.Grounding DINO reduces CHAIRsto0\.0550\.055, close to the defaultΦ\\Phi\-extractor’s0\.0500\.050, because it is highly reliable on the 80 COCO object classes\. However, BERTScore drops from0\.6430\.643to0\.6150\.615: a detection\-onlyG𝐗G\_\{\\mathbf\{X\}\}contains no attributes \(“red shirt”\), no spatial relations \(“man riding skateboard”\), and no counts, so the refine step has nothing to anchor when it checks non\-object claims\. The same trade\-off appears for the audio and video paths: PANNs covers the AudioSet ontology but discards acoustic\-event ordering, and a generic dependency parser captures syntactic structure but not entity coreference across sentences\. For the multimodal paths considered in this paper, usingΦ\\Phiitself as the extractor yields a richerG𝐗G\_\{\\mathbf\{X\}\}at the cost of a per\-sample extraction call, which we judge worthwhile\.

### F\.3Computational Cost

Figure[7](https://arxiv.org/html/2606.00232#A6.F7)reports wall\-clock per sample on COCO val2014 for all image\-path methods under the same Qwen2\.5\-Omni\-7B backbone on a single GPU\. TIGER atT=5T\\\!=\\\!5takes199199s per sample, about5×5\\\!\\timesFrozen \(3838s\); the cost can be lowered by reducingTT\(the sensitivity curve in Figure[4](https://arxiv.org/html/2606.00232#S4.F4)flattens forT≥3T\\\!\\geq\\\!3\)\. Iterative methods \(Volcano, Woodpecker\) are in the same range as TIGER, and contrastive baselines \(VCD, DeGF\) are cheaper because they run only a single decoding pass\.

![Refer to caption](https://arxiv.org/html/2606.00232v1/x9.png)Figure 7:Wall\-clock seconds per COCO sample on a single GPU \(Qwen2\.5\-Omni\-7B backbone\)\. Bars ordered by ascending cost\.
### F\.4Graph Structure and Risk\-Function Ablation

The L0–L3 ablation in Section[4\.3](https://arxiv.org/html/2606.00232#S4.SS3)varies the feedback*paradigm*\(no repair→\\tojoint feedback→\\toatomic projection→\\tograph\-conditioned risk ranking\)\. Here we hold the paradigm fixed at L3 \(full TIGER\) and instead remove one*internal*component at a time, to localize which piece of the graph machinery contributes to the final score\. Table[16](https://arxiv.org/html/2606.00232#A6.T16)reports CHAIRs, CHAIRi, and BERTScore on COCO val2014 under Qwen2\.5\-Omni\-7B for three ablations of full TIGER:−G𝐘\-G\_\{\\mathbf\{Y\}\}\(direct rewrite\)replaces fact\-level risk\-driven repair with a single evidence\-conditioned rewrite: the backbone receives allG𝐗G\_\{\\mathbf\{X\}\}facts and rewrites the entire output without constructingG𝐘tG\_\{\\mathbf\{Y\}\_\{t\}\};−\-graph \(flat\+repair\)setsK=0K\\\!=\\\!0in Eq\. \([14](https://arxiv.org/html/2606.00232#A3.E14)\), sos\(f\)=s0\(f\)s\(f\)=s\_\{0\}\(f\)with no coreference propagation;−λ\-\\lambda\(λ=0\\lambda\\\!=\\\!0\) zeros out the conflict term in the risk function sor\(f\)=1−s\(f\)r\(f\)=1\-s\(f\)\. The Frozen row \(the L0 baseline in Figure[3](https://arxiv.org/html/2606.00232#S4.F3)\) is included for reference\.

Table 16:Internal\-component ablation on COCO\. Each row removes one component from full TIGER\.Three findings stand out\. First, every ablation lies between full TIGER and Frozen on all three metrics, which confirms that the three components are jointly necessary: removing any single one gives up a fraction of the TIGER\-over\-Frozen gap but does not collapse the repair loop\. Second, the ablations order cleanly by the component they remove\. The−λ\-\\lambdavariant \(λ=0\\lambda\\\!=\\\!0,r\(f\)=1−s\(f\)r\(f\)\\\!=\\\!1\-s\(f\)\) is closest to full TIGER; without the conflict term the selector can no longer separate weakly\-supported claims from actively contradicted ones, but most flagged facts are still correctly low\-support and the repair loop remains useful\. The−\-graph variant \(K=0K\\\!=\\\!0, no coreference propagation\) sits slightly further from TIGER; without propagation the support score uses only direct matches, so facts that are supported only indirectly through a shared entity get over\-flagged and revised away\. Removing the claim graphG𝐘G\_\{\\mathbf\{Y\}\}entirely is the strongest single ablation: withoutG𝐘G\_\{\\mathbf\{Y\}\}the refine step has no per\-claim risk to act on and falls back to a single evidence\-conditioned rewrite that cannot localize edits\. Third, none of the three single\-component ablations matches the jointL0→L1L\_\{0\}\\\!\\to\\\!L\_\{1\}regression in Figure[3](https://arxiv.org/html/2606.00232#S4.F3), where the full feedback paradigm is changed from atomic\-with\-risk to joint\-conditioning\. This indicates that the paradigm choice \(atomic projection \+ risk ranking, taken as a whole\) is the dominant source of TIGER’s gain over self\-correction baselines, while each internal component within that paradigm contributes a smaller improvement\.
TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

Similar Articles

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

Why Retrieval-Augmented Generation Fails: A Graph Perspective

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

Submit Feedback

Similar Articles

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection
Why Retrieval-Augmented Generation Fails: A Graph Perspective
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation