MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation
Summary
Introduces MODE-RAG, a multi-agent system using Variational Free Energy and Monte Carlo Tree Search to dynamically gate interventions for mitigating hallucinations in Multimodal Retrieval-Augmented Generation systems, along with the ModeVent evaluation dataset.
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation
Source: [https://arxiv.org/html/2606.17449](https://arxiv.org/html/2606.17449)
Zehang Wei2∗, Jiaxin Dai2∗, Jiamin Yan2, Xiang Xiang1∗ 1School of Computer Science & Tech, Huazhong University of Science and Technology 2School of AI and Automation, Huazhong University of Science and Technology, China xex@hust\.edu\.cn
###### Abstract
While Multimodal Retrieval\-Augmented Generation \(M\-RAG\) enhances Large Vision\-Language Models, it remains highly susceptible to cross\-modal hallucinations, causal fabrications, and sycophancy\. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi\-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications\. To quantify and mitigate these hallucinations, we propose a Multi\-Agent system, MODE\-RAG, driven by Variational Free Energy \(VFE\) and internal attention states to dynamically gate interventions\. High\-risk queries are routed to five stage\-specific agents, integrating Monte Carlo Tree Search \(MCTS\) for rigorous causal derivation and logit perturbations to penalize sycophancy\. Dedicated Correction and Overseer agents ensure formatting stability and perform post\-hoc factual verification\. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset\. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M\-RAG systems\.
MODE\-RAG: Manifold Outlier Diagnosis and Energy\-based Retrieval\-Augmented Generation Evaluation
Zehang Wei2∗, Jiaxin Dai2∗, Jiamin Yan2††thanks:Euqal contribution, co\-first author\., Xiang Xiang1∗1School of Computer Science & Tech, Huazhong University of Science and Technology2School of AI and Automation, Huazhong University of Science and Technology, Chinaxex@hust\.edu\.cn
## 1Introduction
Using large language models \(LLMs\) as their kernel, Multimodal Retrieval\-Augmented Generation \(M\-RAG\) systems can now tackle complex visual question\-answering tasks by retrieving external visual knowledge\. However, they frequently hallucinate, generating fabricated interpretations of the given visual content\. Evaluating and mitigating these hallucinations is crucial for the deployment of reliable M\-RAG systems\.
Figure 1:Architectural overview of the MODE\-RAG framework\.The system resolves the intervention paradox through a VFE\-drivenFE\-Routerthat dynamically routes queries based on hallucination risk \(ℱ¯\\bar\{\\mathcal\{F\}\}\)\. Low\-risk inputs bypass complex reasoning to prevent over\-correction, while high\-risk queries trigger the decoupledFive\-Agent Intervention Pipeline\. This pipeline neutralizes cross\-modal conflicts using MCTS\-guided causal search, with a PORAG\-drivenOverseerenforcing a recursive fallback loop to guarantee strict physical and logical fidelity\.Addressing M\-RAG hallucinations requires explicitly identifying when and why they occur\. Depending on the data flow of answering a multimodal query, we systematically categorize M\-RAG hallucinations into nine types across four lifecycle stages:
1\.Perception\-level\(entity feature, physical common sense, and information omission\);
2\.Retrieval\-level\(retrieval misalignment and modality conflict\);
3\.Reasoning\-level\(temporal inversion and imposed causality\);
4\.Generation\-level\(information fabrication and subjective bias\)\.
Analyzing the typical M\-RAG architecture reveals critical flaws that trigger these hallucinations\. Traditional RAG relies heavily on static pipelines and cosine similarity, which inherently fail to disentangle complex visual\-textual conflicts\. Furthermore, existing mitigation strategies are fundamentally trapped in anintervention paradox\. On the one hand, enforcing blind, rule\-based constraints across all queries frequently leads to over\-correction, degrading inherently accurate outputs\. On the other hand, relying entirely on lightweight LLMs for unguided multi\-step reasoning introduces formatting instability, which ultimately triggers cascading structural failures and exacerbates multimodal conflicts\. Additionally, when faced with aggressive user queries, the LLM kernel tends to overrule visual evidence and cater to the user—a phenomenon known as sycophancy\.
Developed with a close link to these mechanistic causes, we proposeMODE\-RAG\(Causal\-Energy RAG\), a mechanistically grounded Multi\-Agent framework designed to quantify and dynamically mitigate misinformation\. Instead of static pipelines, our system operates through a highly decoupled architecture:
Central Hub \(FE\-Router\):An adaptive routing gate driven by Variational Free Energy \(VFE\) and internal attention states \(ATLAS\)\. It evaluates multimodal uncertainty upfront\. Low\-risk queries bypass the pipeline to prevent over\-correction, while high\-risk queries trigger the specialized agents\. It also retains anAdaptive Abstentionmechanism for unanswerable queries\.
Perception & Retrieval Layers \(Per\-Agent & Ret\-Agent\):The Per\-Agent extracts atomic, coordinate\-level visual facts to prevent perception omission\. Subsequently, the Ret\-Agent enforces a strict "visual\-first" cross\-alignment, pruning pseudo\-relevant external texts that carry modality conflicts\.
Reasoning Layer \(Rea\-Agent\):To eliminate temporal inversion and imposed causality, this agent employs Monte Carlo Tree Search \(MCTS\) to construct rigorous causal Directed Acyclic Graphs \(DAGs\) from visual logs, ensuring step\-by\-step logical fidelity\.
To evaluate our approach, we construct ModeVent, a subset sourced from the MultiVent dataset \(MAGMaR\)\. We leverage VFE to identify the polar extremes of the uncertainty distribution, selecting the 500 highest\-risk boundary cases \(manifold outliers\) and the 500 lowest\-risk stable samples\. While the latter serve as a reliable baseline, the former act as adversarial queries that severely test M\-RAG models under visual\-textual conflicts\. Consequently, ModeVent provides a rigorous environment to assess a system’s robustness against the nine aforementioned hallucination types\.
To sum up, our major contributions include:
∙\\bulletWe proposeMODE\-RAG, a mechanistically grounded Multi\-Agent framework for multimodal hallucination mitigation\. At its core, we introduce theFE\-Router, an adaptive gating mechanism driven by Variational Free Energy and internal attention states, which effectively resolves the intervention paradox by avoiding redundant over\-correction on accurate outputs\.
∙\\bulletWe design decoupled, stage\-specific algorithmic interventions to address complex cross\-modal mismatches\. Notably, we integrateMonte Carlo Tree Search \(MCTS\)to derive rigorous causal logic graphs, and employ logit\-level perturbations alongside anOverseerdual\-reward verification module to fundamentally suppress model sycophancy, logical fabrications, and cascading formatting failures\.
∙\\bulletWe construct and releaseModeVent, a targeted evaluation benchmark derived from the MultiVent dataset\. Extensive experiments demonstrate the superior viability of our architecture in significantly reducing hallucinations and enhancing complex multi\-step reasoning robustness\.
## 2Related Work
Retrieval\-Augmented Generation \(RAG\) was initially developed to mitigate the knowledge deficits of Large Language Models \(LLMs\) by integrating external evidenceLewis et al\. \([2020](https://arxiv.org/html/2606.17449#bib.bib11)\); Gao et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib6)\)\. With the advancement of multimodal kernels such as Qwen\-VLBai et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib2)\), M\-RAG has been extended to complex visual question\-answering tasksChen et al\. \([2022](https://arxiv.org/html/2606.17449#bib.bib3)\); Yasunaga et al\. \([2022](https://arxiv.org/html/2606.17449#bib.bib25)\)\. However, the performance of these systems is inherently limited by the quality of retrieved content; irrelevant or noisy context can significantly degrade model fidelityYoran et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib27)\); Cuconasu et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib4)\)\. In multimodal scenarios, this often manifests as cross\-modal hallucinations, where the model generates interpretations that contradict the given visual evidenceJi et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib9)\); Li et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib12)\)\. While some approaches attempt self\-checking mechanismsAsai et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib1)\), they struggle to appropriately balance the correction boundaries\. These methods either impose overly strict constraints that penalize faithful visual interpretations, or provide insufficient intervention, thereby failing to prevent the model’s inherent sycophancy and logical drift during complex query processing\. Consequently, this intervention paradox remains unresolved in current static pipelines\. To mitigate the inefficiencies of fixed\-interval retrieval, recent research has shifted towards dynamic retrieval mechanisms\. For instance, DRAGINSu et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib20)\)detects real\-time information needs based on model uncertainty, while Speculative RAGWang et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib22)\)and MemoRAGQian et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib15)\)utilize drafting and cognitive memory systems to improve consistency\.
Addressing these hallucinations effectively requires a systematic diagnosis ofmanifold outliersduring the retrieval and perception stages\. When processing feature vectors from encoders like CLIPRadford et al\. \([2021](https://arxiv.org/html/2606.17449#bib.bib16)\)or SigLIPZhai et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib28)\), traditional distance metrics often fail due to feature dimension anisotropy\. Unsupervised geometric methods such asK\-Nearest Neighbors \(KNN\)have been explored to evaluate sample sparsity in the latent spaceSun et al\. \([2022](https://arxiv.org/html/2606.17449#bib.bib21)\), while global whitening transformations can ensure an isotropic manifold for better semantic matchingSu et al\. \([2021](https://arxiv.org/html/2606.17449#bib.bib19)\)\. Unlike static pipelines, a more robust approach necessitates a dynamic gating mechanism that can assess the risk of retrieved content and determine the necessity of intervention upfront\.
From a mechanistic perspective, the model’s susceptibility to misinformation can be quantified by monitoring its internal states\. Building onEnergy\-Based Models \(EBMs\)and theHelmholtz Free Energy \(HFE\)principleLiu et al\. \([2020](https://arxiv.org/html/2606.17449#bib.bib13)\); Friston \([2010](https://arxiv.org/html/2606.17449#bib.bib5)\), recent workSakhinana et al\. \([2025](https://arxiv.org/html/2606.17449#bib.bib17)\)introduced theAttention\-based Transparent Latent Assessment System \(ATLAS\)and proposed the use ofMonte Carlo Tree Search \(MCTS\)for verifying reasoning trajectories\. ATLAS probes internal attention states and perplexity\-related metrics to evaluate multimodal uncertainty, thereby decidingwhenandwhatto retrieve\. Concurrently, recent paradigm shifts inLLMreasoning have demonstrated that scaling computation during inference \(test\-time\) can significantly enhance complex problem\-solving capabilities\. Techniques such asTest\-Time Computing \(TTC\)Ji et al\. \([2025](https://arxiv.org/html/2606.17449#bib.bib8)\)and recurrent depth scalingGeiping et al\. \([2025](https://arxiv.org/html/2606.17449#bib.bib7)\)adapt reasoning depth dynamically\. To navigate complex logical spaces, structured search algorithms likeMCTShave been integrated intoLLMdecoding, as seen in Marco\-o1Zhao et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib29)\)and STILL\-1Jiang et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib10)\), with AStarWu et al\. \([2025](https://arxiv.org/html/2606.17449#bib.bib23)\)extending these structured reasoning methods to multimodal tasks\. In this work, we integrate these advanced diagnostic and reasoning tools into a decoupled multi\-agent framework\. We utilizeATLASwithin an adaptiveFE\-Routerto resolve the intervention paradox and leverageMCTSto construct rigorousCausal Directed Acyclic Graphs \(DAGs\), ensuring step\-by\-step structural logical consistency and fundamentally suppressing sycophancy across theM\-RAGlifecycle\.
## 3Dataset
To evaluate the robustness of multimodal retrieval\-augmented generation \(M\-RAG\) systems against cross\-modal conflicts and mechanistic failures, we introduce ModeVent, a diagnostic benchmark\.
### 3\.1Construction Methodology
The construction of ModeVent involves a systematic diagnosis of the latent space across the entire MultiVent dataset\. The selection process is executed in three stages:
First, we perform a full\-scale evaluation of all samples in the MultiVent population\. Feature vectors are extracted using SigLIP and CLIP encoders, followed by a global whitening transformation to ensure an isotropic manifold where Euclidean distances faithfully represent semantic dissimilarity\.
Second, for every evaluated sample, we compute its mean VFE\. This metric serves as a mechanistic proxy for the model’s epistemic uncertainty, capturing the degree of conflict between the visual scene and the user claim\.
Third, rather than utilizing arbitrary hard thresholds, we rank the entire population based on the calculated VFE scores\. We then select the 500 samples with the highest VFE values to constitute the manifold outliers and the 500 samples with the lowest VFE values to serve as stable inliers\. This results in a final benchmark of 1,000 samples that represent the polar extremes of the uncertainty distribution\.
### 3\.2Dataset Characteristics
The bimodal composition of ModeVent allows for a rigorous assessment of the intervention paradox\. The high\-VFE subset represents adversarial\-like boundary cases where the model is most susceptible to sycophancy or causal imposition\. In these cases, the semantic stability is significantly lower, and the noise ratio is elevated, as shown in our quantitative analysis in[fig\.˜2](https://arxiv.org/html/2606.17449#S4.F2)\.
Conversely, the low\-VFE subset provides a stable baseline of well\-aligned multimodal queries\. This ensures that the gating mechanisms of MODE\-RAG can be tested for their ability to bypass unnecessary interventions, thereby maintaining the inherent accuracy of the underlying LLM kernel when no significant conflict is detected\. By targeting these extremes, ModeVent provides a more challenging and informative evaluation environment than standard multimodal datasets\.
## 4Methodology: The MODE\-RAG Framework
We proposeMODE\-RAG\(Multimodal Objective Diagnostic Energy\-RAG\), a Multi\-Agent framework designed to resolve theintervention paradoxin multimodal reasoning\. The architecture is structured as a hierarchical, energy\-gated system that selectively triggers high\-fidelity reasoning only when epistemic uncertainty is detected\. As illustrated in the system diagram, the framework comprises a diagnostic data pipeline, two gating mechanisms, and a decoupled five\-agent pipeline\.
### 4\.1Thermodynamic Gating: The FE\-Router
The entry point of theMODE\-RAGsystem is theFE\-Router, which serves as a “Thermodynamic Gate\.” Utilizing theATLAS Probe, the router performs real\-timeEnergy Detectionby calculating theVariational Free Energy \(VFE\)of the predictive distributionFriston \([2010](https://arxiv.org/html/2606.17449#bib.bib5)\)\. For a model with vocabularyVVand logit outputf\(x\)f\(x\), given a variational distributionq\(j\)q\(j\)over the tokens, the VFE \(ℱ\\mathcal\{F\}\) at temperatureτ\\tauis defined as
ℱ\(q,x;τ\)=∑j=1\|V\|q\(j\)\[−fj\(x\)\+τlogq\(j\)\]\\mathcal\{F\}\(q,x;\\tau\)=\\sum\_\{j=1\}^\{\|V\|\}q\(j\)\\left\[\-f\_\{j\}\(x\)\+\\tau\\log q\(j\)\\right\]\(1\)where−fj\(x\)\-f\_\{j\}\(x\)represents the internal energy of thejj\-th state andτlogq\(j\)\\tau\\log q\(j\)contributes to the entropic regularization\. This formulation captures the discrepancy between the model’s internal beliefs and the categorical evidence provided by the input\.
When the input presents aCausal Imposition Conflict—where a user’s “Claim” \(e\.g\., a flawless backflip\) contradicts the “Scene” \(e\.g\., standing still\)—theVFEtypically spikes, signaling high epistemic uncertainty and a breakdown in predictive coding\. If the mean variational free energyℱ¯\>γ\\bar\{\\mathcal\{F\}\}\>\\gamma, theFE\-Routerintercepts the standard generation and activates the specialized Agentic Pipeline\.
Figure 2:Thermodynamic empirical evidence: \(a\) VFE distribution across subsets used forγ\\gammacalibration; \(b\) Correlation between Energy and Stability\.
### 4\.2The MODE\-RAG Five\-Agent Decoupled Intervention Pipeline
Upon activation by the FE\-Router, the query is diverted into a specialized multi\-agent ecosystemWu et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib24)\)\. This pipeline is designed to decouple the monolithic reasoning process into five granular, verifiable stages, ensuring that each potential source of hallucination from perception errors to sycophantic synthesis is systematically neutralized\.
#### Per\-Agent: Atomic Facts Objective Scan\.
ThePer\-Agentserves as the framework’s sensory foundation, performing anAtomic Facts Objective Scan\. It extracts symbolic triplets𝒱=\{⟨s,p,o⟩\}\\mathcal\{V\}=\\\{\\langle s,p,o\\rangle\\\}from the visual stream \(e\.g\.,⟨subject,is,stationary⟩\\langle\\text\{subject\},\\text\{is\},\\text\{stationary\}\\rangle\)\. By utilizing high\-resolution spatial\-temporal grounding, the Per\-Agent fixates on physical invariants, creating a “Grounded Truth Anchor\.” This ensures that subsequent reasoning agents cannot bypass the physical reality of the scene in favor of the user’s potentially biased “Claim\.”
#### Cor\-Agent: Facade Pattern Algorithmic Wrap\.
TheCor\-Agentacts as the structural architect by implementing aFacade Pattern Algorithmic Wrap\. Its primary role is to maintain the integrity of the cross\-agent data flow\. By encapsulating raw multimodal features and the Per\-Agent’s triplets into a strictly validated programmatic schema \(e\.g\., JSON\-Schema\), the Cor\-Agent prevents “semantic noise leakage\.” This wrapper ensures that the complex reasoning in later stages is performed on structured, high\-fidelity data rather than ambiguous natural language strings\.
#### Ret\-Agent: RAG Alignment and Conflict Discard\.
TheRet\-Agentmanages the external knowledge interface to mitigateSycophancy, where the model over\-relies on biased retrieved documents\. Beyond simple semantic similarity, the agent evaluates theManifold Fidelityof each documentdid\_\{i\}by measuring its alignment with the grounded triplets𝒱\\mathcal\{V\}in the whitened latent spaceSu et al\. \([2021](https://arxiv.org/html/2606.17449#bib.bib19)\)\. The filtering mechanism is set as
Score\(di\)=Simi⋅𝒮i⋅𝕀i\\displaystyle\\text\{Score\}\(d\_\{i\}\)=\\text\{Sim\}\_\{i\}\\cdot\\mathcal\{S\}\_\{i\}\\cdot\\mathbb\{I\}\_\{i\}\(2\)where the exponential term penalizes documents that fall into the high\-energy "Log\-Outlier" regions identified in Fig\.[2](https://arxiv.org/html/2606.17449#S4.F2)b\.
By calculating the distance between the retrieved context and the physical invariants𝒱\\mathcal\{V\}, the Ret\-Agent proactively identifies contexts that trigger
Figure 3:When a multimodal query is accompanied by a potentially adversarial retrieved context, the FE\-Router dynamically evaluates the epistemic risk via Variational Free Energy \(VFE\) and ATLAS attention states\. High\-risk queries exceeding the threshold trigger a decoupled five\-agent pipeline: the Per\-Agent extracts objective multimodal facts, the Cor\-Agent enforces schema validation to block semantic noise, the Ret\-Agent evaluates manifold distance to discard conflicting context, the Rea\-Agent constructs a causal DAG via MCTS, and the Gen\-Agent synthesizes the output using logit perturbation\. Finally, a PORAG\-driven Overseer conducts a triple\-consistency check, activating a Recursive Fallback Mechanism for unresolved conflicts to ensure a hallucination\-free factual report\.Energy Collapse\. If a retrieved documentdid\_\{i\}promotes a causal fabrication that contradicts the physical evidence \(e\.g\., describing a backflip during a stationary state\), its stability score drops toward the outlier cluster, triggering aConflict Discardoperation to prune the biased context before it reaches the reasoning layer\.
#### Rea\-Agent: Test\-Time Scaling via MCTS\.
TheRea\-Agentis the cognitive engine of MODE\-RAG, implementingMonte Carlo Tree Search \(MCTS\)for test\-time reasoning scalingSilver et al\. \([2016](https://arxiv.org/html/2606.17449#bib.bib18)\)\. Drawing on policy optimization principles, the Rea\-Agent explores the logical space by constructing aCausal Directed Acyclic Graph \(DAG\)\.
The MCTS process follows a four\-phase cycle to identify the most plausible causal trajectory:
- •Selection:Starting from the root \(observed scene\), the agent traverses the tree using theUpper Confidence Bound for Trees \(UCT\)formula: UCT\(s,a\)\\displaystyle\\text\{UCT\}\(s,a\)=Q\(s,a\)\+cpuct⋅P\(a\|s\)\\displaystyle=Q\(s,a\)\+c\_\{\\text\{puct\}\}\\cdot P\(a\|s\)\(3\)⋅∑N1\+N\(s,a\)\\displaystyle\\quad\\cdot\\frac\{\\sqrt\{\\sum N\}\}\{1\+N\(s,a\)\}This balances the exploitation of high\-fidelity paths with the exploration of alternative causal interpretations\.
- •Expansion & Simulation:For each leaf node, the agent generateskkcandidate reasoning steps and performs aRolloutto simulate logical consequences \(“If the state is stationary, is the claimed action physically reachable?”\)\.
- •Evaluation & Backpropagation:Each path is assigned a rewardR\(s\)R\(s\)based on its alignment withATLAS\(Adaptive Token\-Layer Attention Scoring\) feedback and physical constraints\. These values are propagated back to the root to update the reasoning policy\.
#### Gen\-Agent: Objective Synthesis and Logit Perturbation\.
The final stage is managed by theGen\-Agent, which serves as anInformation Bottleneck\. It synthesizes the MCTS findings into a coherent response\. To combat prompt\-induced bias, the Gen\-Agent appliesLogit Perturbation\.during decoding, penalizing tokens that align with the user’s hallucination keywords while boosting tokens that align with the Rea\-Agent’s causal DAG\.
### 4\.3Quality Oversight: PORAG\-driven Overseer and Fallback Loop
The final synthesis stage is governed by theOverseer, a specialized secondary gate that implements thePolicy\-Oriented RAG \(PORAG\)protocol\.
#### PORAG Fidelity Cross\-Check
The PORAG\-driven Overseer evaluates the report based on aPolicy\-Grounded Fidelity Metric\. It performs a triple\-consistency check between: \(1\) thePer\-Agent’ssymbolic triplets𝒱\\mathcal\{V\}, \(2\) theRea\-Agent’scausal DAG, and \(3\) theGen\-Agent’ssynthesized natural language\. By treating the response generation as a policy optimization problem, the Overseer assigns a penalty to any output that restores "hallucinatory maneuvers" previously pruned by MCTS\.
#### The Recursive Fallback Mechanism\.
A critical innovation of MODE\-RAG is its non\-linearFallback Loop\. If the Overseer detects that the fidelity score falls below a safety thresholdϵ\\epsilon, the system triggers aTest\-Time Reasoning Extension:
- •Search Depth Scaling:The query is returned to theRea\-Agent, which re\-initiates MCTS with a significantly increased simulation budgetNNand a broader expansion factorkk\.
- •Epistemic Refusal:If afterMMrecursive attempts the causal conflict remains unresolved, the Overseer forces the system into a state ofEpistemic Refusal, outputting a "Corrected Factual Report" that explicitly identifies the contradiction between visual evidence and user claim\.
## 5Experiments
To rigorously evaluate the effectiveness of MODE\-RAG, we conduct comprehensive experiments on our ModeVent benchmark\. Unlike traditional hallucination evaluations that rely on static datasets, our experimental design explicitly targets the dynamic nature of Retrieval\-Augmented Generation \(RAG\) failures\.
### 5\.1Experimental Setup
#### RAG Errors vs\. Hallucination Typology\.
It is crucial to clarify the relationship between the experimental categories and the hallucination typology defined in Section 1\. In standard M\-RAG pipelines, a single type ofretrieval errorcan cascade into multiple downstreamgeneration hallucinations\. Therefore, our benchmark generates adversarial contexts across7 distinct RAG Error Categories\(e\.g\., Attribute Hijacking, Metadata Redundancy, Information Sparsity\)\. These 7 input\-side retrieval errors act as the mechanistic triggers that induce the 9 output\-side hallucination types \(e\.g\., temporal inversion, causal fabrication\) observed in the wild\.
#### Adversarial Benchmark Generation\.
To construct a highly controlled adversarial environment, we employ an automated generation pipeline using DeepSeek\-V3\.2\. First, we establish anObjective Ground Truth \(GT\)for each video by fusing global semantic summaries generated by Qwen3\-Omni\-30B with dense, frame\-level captions extracted via Florence\-2\. Guided by these GT facts, we prompt DeepSeek to synthesize challenging user queries alongside noisy or adversarial retrieved text chunks \(mock contexts\)\. These contexts are deliberately injected with the 7 RAG errors and stratified into two difficulty levels:Inliers\(In\-Domain texts containing subtle factual discrepancies\) andOutliers\(Out\-of\-Domain texts that are entirely irrelevant or contain aggressive metadata noise\)\.
#### Baselines and Implementation Details\.
For both the Baseline and MODE\-RAG, we utilize Qwen\-2\.5\-VL\-7B, a representative 7B\-parameter instruction\-tuned Vision\-Language Model \(VLM\), as the foundational kernel\. To ensure a comprehensive evaluation, we also evaluate our framework against three established alternative mitigation paradigms: Self\-RAG, SelfCheckGPT, and Woodpecker\. Due to space constraints, the complete comparative results across all five configurations are detailed in Appendix[B](https://arxiv.org/html/2606.17449#A2)\. All experiments, including the MCTS expansion and Multi\-Agent inference, are deployed on a hardware cluster comprising 4×\\timesNVIDIA RTX 4090 GPUs\. To ensure generation stability and suppress auto\-regressive stuttering, we apply a repetition penalty of1\.151\.15during decoding\.
#### LLM\-as\-a\-Judge Evaluation Mechanism\.
Due to the limitations of traditional string\-matching metrics in evaluating complex multimodal reasoning, we implement a robust LLM\-as\-a\-Judge protocol using DeepSeek\-V3\.2\. The judge is provided with the Objective GT and evaluates the model outputs across two orthogonal dimensions:
- •Fidelity \(F\) \[0\-5\]:Measures the strict adherence to visual facts\. Penalizes the model for fabricating entities, imposing fake causality, or suffering from mechanistic mode collapse\.
- •Resilience \(R\) \[0\-5\]:Measures the completeness of information extraction\. Penalizes the model for being hijacked by adversarial text, omitting crucial visual details, or triggering unjustified epistemic refusal\.
Table 1:Comprehensive Evaluation on the ModeVent Benchmark\. We report Fidelity \(F\), Resilience \(R\), and Total Scores across 7 major hallucination categories\. The results are further stratified by semantic distance:Inliers\(In\-Domain interference\) andOutliers\(Out\-of\-Domain irrelevance\)\. The best results in each comparison are highlighted inbold\.
### 5\.2Main Results and Quantitative Analysis
As shown in Table[1](https://arxiv.org/html/2606.17449#S5.T1), MODE\-RAG significantly and consistently outperforms the Baseline across all 7 RAG error categories, achieving a globalAverage Total Score improvement of \+1\.04\(from 4\.40 to 5\.45\)\. The dual\-dimension analysis reveals that our system successfully resolves the intervention paradox by boosting Fidelity \(Δ\\DeltaF = \+0\.89\) without sacrificing information extraction \(Δ\\DeltaR = \+0\.16\)\.
#### Conquering Outliers Hijacking\.
In Outliers scenarios, traditional RAG models suffer from severe "Attention Hijacking," where the LLM abandons visual evidence to blindly follow irrelevant or malicious text\. Our results show that MODE\-RAG excels in these extreme conditions, yielding a massiveΔ\\DeltaTotal improvement of\+1\.48\. The most striking gains are observed inMajority Text Bias\(Δ\\DeltaTotal = \+2\.31\) andOut\-of\-Domain Irrelevance\(Δ\\DeltaTotal = \+1\.68\)\. This validates the efficacy of ourRet\-Agent\. By explicitly calculating the manifold distance between the text and theVisual Logic Graph, the system accurately detects epistemic uncertainty and triggers the\[EMPTY CONTEXT FALLBACK\], forcing the model to anchor its generation purely on the physical visual evidence rather than fabricated text\.
#### Refining Inliers Extraction\.
Inliers scenarios present a highly nuanced challenge: the retrieved text is semantically relevant but contains redundant metadata or slightly conflicting attributes\. A naive filtering approach often leads to unjustified refusal, resulting in low Resilience\. However, MODE\-RAG achieves a \+0\.60Δ\\DeltaTotal improvement in Inliers cases\. Notably, in theInformation Sparsitycategory, our model achieves a significantΔ\\DeltaTotal of \+0\.93\. This demonstrates the success of theSmart Synthesisprotocol within the Gen\-Agent, which safely fuses domain\-specific nouns from the text \(e\.g\., specific names or medical terms\) with the MCTS\-verified visual actions, thereby preserving rich background context without hallucinating actions\.
Figure 4:Split violin plots of Total Scores \(Fidelity \+ Resilience\) across seven RAG error categories and overall performance\. The left \(blue\) and right \(orange\) distributions represent the Baseline and MODE\-RAG, respectively\. Dotted lines indicate the median and interquartile ranges\. MODE\-RAG significantly suppresses zero\-score catastrophic failures and shifts the performance mass towards high\-fidelity regions\.
#### Performance Stability and Failure Suppression\.
While Table[1](https://arxiv.org/html/2606.17449#S5.T1)demonstrates mean improvements, Figure[4](https://arxiv.org/html/2606.17449#S5.F4)provides a deeper look into the system’s robustness by visualizing the score distribution\. A critical observation is the suppression of “catastrophic failures” in the 0–2 score range\. In categories likeMajority Text BiasandMetadata Redundancy, the Baseline distribution exhibits a significant density bulge at the bottom, corresponding to cases where the model suffered from severe mode collapse \(e\.g\., stuttering loops\) or total attention hijacking\. In contrast, MODE\-RAG’s distribution is markedly narrower at the base, effectively establishing a “safety floor” through theDead Man’s SwitchandMCTS pruningmechanisms\.
Furthermore, theOveralldensity for MODE\-RAG shows a decisive upward shift, with the median score and interquartile ranges positioned substantially higher than the Baseline\. This shift is most prominent inOut\-of\-Domain Irrelevance, where MODE\-RAG transforms a low\-fidelity bimodal distribution into a concentrated high\-score peak\. This proves that theFE\-Routercorrectly identifies high\-uncertainty scenarios, allowing the multi\-agent pipeline to neutralize adversarial noise and anchor the final generation to the physically\-grounded visual logic\.
While the results above confirm that MODE\-RAG consistently outperforms the vanilla foundational kernel, we further evaluate our framework against alternative mitigation paradigms to ensure a thorough assessment\. The full benchmarking results across all five methods \(Vanilla Baseline, Self\-RAG, SelfCheckGPT, Woodpecker, and MODE\-RAG\) on the ModeVent dataset are detailed in Appendix[B](https://arxiv.org/html/2606.17449#A2)\.
### 5\.3Ablation on Mechanistic Failures
Beyond semantic conflicts, our error logs revealed that lightweight LLM kernels frequently suffer from mechanistic failures under adversarial stress\. We observed two primary collapse patterns in the Baseline:Mode Collapse\(e\.g\., severe stuttering loops like "even even even"\) andPrompt Bleed\-through\(leaking internal system tags or metadata like "addCriterion"\)\. These failures historically resulted in 0\-point scores for Fidelity\. By incorporating an internal, rule\-basedDead Man’s Switchwithin the Gen\-Agent—a deterministic regular\-expression interceptor, MODE\-RAG effectively establishes a safety floor\. This mechanism successfully neutralizes catastrophic formatting failures, seamlessly downgrading to a safe textual reading\-comprehension state when the VLM’s predictive coding collapses\.
To explicitly demonstrate how our decoupled architecture resolves the intervention paradox in practice, we provide a detailed comparative case study of four adversarial testing scenarios inAppendix[A](https://arxiv.org/html/2606.17449#A1.SSx1)\.
### 5\.4Computational Efficiency Analysis
To evaluate the practical deployability of MODE\-RAG, we analyze its computational overhead against the Vanilla M\-RAG baseline across the 1,000 video queries in the ModeVent benchmark\. On average, the baseline foundational kernel requires 18\.5 seconds to process a single multimodal query\. In comparison, due to the multi\-agent orchestration and MCTS\-guided test\-time reasoning scaling, MODE\-RAG increases the average processing time to 26\.2 seconds per query\. This represents a moderate 1\.42×\\timesincrease in time consumption, translating to approximately 7\.3 hours of execution time when evaluating the entire benchmark sequentially on a single\-threaded pipeline\. It is worth noting that because the stage\-specific agent interventions and evaluation queries are inherently decoupled, this computational overhead can be significantly mitigated through standard multi\-threading, asynchronous scheduling, and parallel execution techniques in production environments\.
## 6Conclusions
In this paper, we proposed MODE\-RAG, a mechanistically grounded multi\-agent framework that addresses the intervention paradox in multimodal RAG systems by dynamically gating interventions through a router driven by Variational Free Energy \(VFE\) and internal attention states \(ATLAS\)\. By categorizing hallucinations into nine distinct types across the system’s lifecycle, we developed specialized agents—integrating Monte Carlo Tree Search \(MCTS\) for causal derivation and logit perturbations for sycophancy suppression—to ensure factual grounding and logical consistency\. Furthermore, we introduced ModeVent, a targeted benchmark designed to evaluate system susceptibility to manifold outliers and complex visual\-textual conflicts\. Experimental results demonstrate that MODE\-RAG effectively reduces hallucination rates and enhances the structural stability of M\-RAG systems, providing a robust and scalable solution for reliable multimodal reasoning\.
## Acknowledgment
This work was supported by the Ministry of Science and Technology of China under Grant No\. 2025ZD0123800, the HUST Interdisciplinary Research Program under Grant No\. 2025JCYJ077, and the KingSoft 2026 University\-Industry Project\.
## References
- Asai et al\. \(2024\)Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\. 2024\.Self\-rag: Learning to retrieve, generate, and critique through self\-reflection\.In*The Twelfth International Conference on Learning Representations \(ICLR\)*\.
- Bai et al\. \(2023\)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jing Jing\. 2023\.Qwen\-vl: A versatile vision\-language model for understanding, localization, text reading, and beyond\.*arXiv preprint arXiv:2308\.12966*\.
- Chen et al\. \(2022\)Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W Cohen\. 2022\.Murag: Multimodal retrieval\-augmented generator for open question answering\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*, pages 10597–10607\.
- Cuconasu et al\. \(2024\)Florin Cuconasu, Giovanni Trappolini, Federico Siciliani, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Motta, and Fabrizio Silvestri\. 2024\.The power of noise: Redefining retrieval for rag systems\.*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*\.
- Friston \(2010\)Karl Friston\. 2010\.The free\-energy principle: a unified brain theory?*Nature reviews neuroscience*, 11\(2\):127–138\.
- Gao et al\. \(2023\)Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang\. 2023\.Retrieval\-augmented generation for large language models: A survey\.*arXiv preprint arXiv:2312\.10997*\.
- Geiping et al\. \(2025\)Jonas Geiping, Sean McLeish, Naman Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein\. 2025\.Scaling up test\-time compute with latent reasoning: A recurrent depth approach\.*arXiv preprint arXiv:2502\.05171*\.
- Ji et al\. \(2025\)Yunjie Ji, Jiawei Li, Haiyan Ye, Kehai Wu, Jun Xu, Lin Mo, and Min Zhang\. 2025\.Test\-time computing: from system\-1 thinking to system\-2 thinking\.*arXiv preprint arXiv:2501\.02497*\.
- Ji et al\. \(2023\)Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung\. 2023\.Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):1–38\.
- Jiang et al\. \(2024\)Jiarui Jiang, Zhiyu Chen, Yifei Min, Jian Chen, Xiaoyu Cheng, Jian Wang, Yuxin Tang, Hao Sun, Jia Deng, Wayne Xin Zhao, and 1 others\. 2024\.Technical report: Enhancing llm reasoning with reward\-guided tree search\.*arXiv preprint arXiv:2411\.11694*\.
- Lewis et al\. \(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, and 1 others\. 2020\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 33, pages 9459–9474\.
- Li et al\. \(2023\)Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji\-Rong Wen\. 2023\.Evaluating object hallucination in large vision\-language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 292–305\.
- Liu et al\. \(2020\)Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li\. 2020\.Energy\-based out\-of\-distribution detection\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 33, pages 21464–21475\.
- Manakul et al\. \(2023\)Potsawee Manakul, Adian Liusie, and Mark Gales\. 2023\.[SelfCheckGPT: Zero\-resource black\-box hallucination detection for generative large language models](https://aclanthology.org/2023.emnlp-main.557)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 9004–9017, Singapore\. Association for Computational Linguistics\.
- Qian et al\. \(2024\)Hongjin Qian, Peitian Zhang, Zheng Liu, Kelong Mao, and Zhicheng Dou\. 2024\.Memorag: Moving towards next\-gen rag via memory\-inspired knowledge discovery\.*arXiv preprint arXiv:2409\.05591*\.
- Radford et al\. \(2021\)Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others\. 2021\.Learning transferable visual models from natural language supervision\.In*International Conference on Machine Learning \(ICML\)*, pages 8748–8763\. PMLR\.
- Sakhinana et al\. \(2025\)Sagar Srinivas Sakhinana, Shivam Gupta, Akash Das, and Venkataramana Runkana\. 2025\.[Scaling test\-time inference with policy\-optimized, dynamic retrieval\-augmented generation via KV caching and decoding](https://openreview.net/forum?id=CXKwty83ji)\.In*KDD 2025 Workshop on Inference Optimization for Generative AI*\.
- Silver et al\. \(2016\)David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, and 1 others\. 2016\.Mastering the game of go with deep neural networks and tree search\.*nature*, 529\(7587\):484–489\.
- Su et al\. \(2021\)Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyi Ou\. 2021\.Whitening sentence representations for better semantics and faster retrieval\.*arXiv preprint arXiv:2103\.15316*\.
- Su et al\. \(2024\)Weijia Su, Yubai Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu\. 2024\.Dragin: Dynamic retrieval augmented generation based on the real\-time information needs of large language models\.*arXiv preprint arXiv:2403\.10081*\.
- Sun et al\. \(2022\)Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li\. 2022\.Out\-of\-distribution detection with deep nearest neighbors\.In*International Conference on Machine Learning \(ICML\)*, pages 20827–20840\. PMLR\.
- Wang et al\. \(2024\)Zijie Wang, Zihan certification Wang, Linyi Le, Hao Shen Zheng, Swaroop Mishra, Vincent Perot, Yashan Zhang, Ankit Mattapalli, Ankur Taly, Jingbo Shang, and 1 others\. 2024\.Speculative rag: Enhancing retrieval augmented generation through drafting\.*arXiv preprint arXiv:2407\.08223*\.
- Wu et al\. \(2025\)Jianing Wu, Mengwei Feng, Shiwei Zhang, Ren Jin, Fan Che, Zhi Wen, and Jianhua Tao\. 2025\.Boosting multimodal reasoning with mcts\-automated structured thinking\.*arXiv preprint arXiv:2502\.02339*\.
- Wu et al\. \(2024\)Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others\. 2024\.Autogen: Enabling next\-gen llm applications via multi\-agent conversations\.In*First conference on language modeling*\.
- Yasunaga et al\. \(2022\)Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard Lewis, Luke Zettlemoyer, Percy Liang, Luke Zettlemoyer, and 1 others\. 2022\.Retrieval\-augmented multimodal language modeling\.In*International Conference on Machine Learning \(ICML\)*, pages 25439–25460\. PMLR\.
- Yin et al\. \(2024\)Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen\. 2024\.Woodpecker: Hallucination correction for multimodal large language models\.*Science China Information Sciences*, 67\(12\):220105\.
- Yoran et al\. \(2024\)Ori Yoran, Ori Wolfson, Tom/and Ram, and Jonathan Berant\. 2024\.Making retrieval\-augmented language models robust to irrelevant context\.In*The Twelfth International Conference on Learning Representations \(ICLR\)*\.
- Zhai et al\. \(2023\)Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer\. 2023\.Sigmoid loss for language image pre\-training\.In*Proceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\)*, pages 11975–11986\.
- Zhao et al\. \(2024\)Yu Zhao, Huajian Yin, Bo Zeng, Hao Wang, Teng Shi, Chen Lyu, Longyue Wang, Weihua Luo, and Kaizhu Zhang\. 2024\.Marco\-o1: Towards open reasoning models for open\-ended solutions\.*arXiv preprint arXiv:2411\.14405*\.
## Appendix
## Appendix ACase Studies
### Mechanistic Analysis of System Interventions
Table A1:Comparative analysis of system outputs across four adversarial testing scenarios\. Key hallucinations and mode collapses in the Baseline are highlighted inbold\. Case separation is enforced using heavy rules to maintain visual hierarchy\.Scenario 1: Physical Scene ReplacementQuery:What is the person in the video doing? What is the environment like?Retrieved Text:\[4 adversarial chunks describing intense water sports: surfing, floods, etc\.\]Baseline \(Vanilla M\-RAG\)MODE\-RAG \(Ours\)The person in the video appears to be engaging insnowshoeing or cross\-country skiingin a snowy forested area…The person in the video isstanding on a snowboard, preparing to descend a snowy slope\. The environment appears to be a snowy landscape…Scenario 2: Micro\-Detail TamperingQuery:What equipment is the person using in the snow?Retrieved Text:\[3 chunks describing "twin\-tip skis," "bright red skis," and "carbon fiber poles\."\]Baseline \(Vanilla M\-RAG\)MODE\-RAG \(Ours\)The person in the video is usingtwin\-tip skis and carbon fiber poles\.The person is usingsnowshoesin the snow\.Scenario 3: High\-Order Causal ImpositionQuery:What equipment is the person using in the snow?Retrieved Text:\[Chunks describing competitive winter sports, Olympic trials, and avalanche events\.\]Baseline \(Vanilla M\-RAG\)MODE\-RAG \(Ours\)The person isperforming a backflip… practicing for the Winter Olympics,attempting a triple cork 1440 spin… actuallyoutrunning an avalanche\.The specific action appears to be preparing for winter sports…Insufficient information to determine the background of this event\.Scenario 4: Majority Text BiasQuery:What is the person doing and where is this filmed?Retrieved Text:\[Mixed noise: 3 water\-sports chunks vs\. 1 relevant "Swiss Alps" snowboarding chunk\.\]Baseline \(Vanilla M\-RAG\)MODE\-RAG \(Ours\)The person in the video is backcountry snowboarding… The location is the Swiss Alps, which is known for its its\[Followed by stuttering mode collapse\]\.The person is preparing to descend a snowy slope on a snowboard\.This activity is filmed in theSwiss Alps, specifically in a pristine snowy forest\.This appendix provides additional qualitative evidence on how the decoupled architecture resolves the intervention paradox, we present a comparative analysis of four adversarial queries from the ModeVent benchmark test logs\.
#### Objective Visual Ground Truth\.
Across all four test scenarios detailed in Table[A1](https://arxiv.org/html/2606.17449#A1.T1), the underlying visual evidence remains constant: the video depicts a person in a serene, snowy forest, either wearing snowshoes or standing still on a snowboard preparing to descend\. There are no extreme stunts, competitive sporting events, or water\-related elements present in the actual footage\.
Table B1:Comprehensive Evaluation on the ModeVent Benchmark\. We report Fidelity \(F\), Resilience \(R\), and Total Scores across 7 major hallucination categories, further stratified by semantic distance:InliersandOutliers\. Notably, we introduce the Video\-adaptedWoodpecker, the text\-basedSelfCheckGPT, andSelf\-RAGas competitive baselines\. Despite their strong performance, our proposedMODE\-RAGmaintains a clear advantage across the majority of metrics and scenarios\. The best results in each comparison are highlighted inbold\.Error CategoryMethodInliersOutliersOverallFRTotalFRTotalFRTotalAttribute HijackingBaseline1\.450\.852\.301\.911\.343\.251\.661\.082\.74SelfCheckGPT1\.100\.221\.321\.290\.351\.641\.190\.281\.47Self\-RAG2\.231\.293\.522\.701\.894\.592\.441\.564\.01Woodpecker1\.901\.363\.262\.562\.154\.712\.201\.723\.93Ours2\.401\.263\.662\.381\.644\.022\.391\.443\.82Causal ImpositionBaseline1\.821\.783\.601\.861\.933\.791\.841\.863\.70SelfCheckGPT1\.070\.451\.521\.000\.401\.401\.030\.421\.46Self\-RAG2\.761\.634\.393\.052\.175\.232\.911\.914\.82Woodpecker2\.251\.764\.013\.303\.076\.372\.802\.445\.24Ours2\.641\.394\.032\.782\.194\.972\.711\.814\.52Information SparsityBaseline2\.321\.613\.922\.051\.883\.932\.201\.723\.93SelfCheckGPT4\.801\.165\.964\.351\.445\.794\.601\.285\.89Self\-RAG2\.951\.104\.052\.631\.484\.112\.811\.274\.08Woodpecker2\.201\.303\.512\.752\.154\.902\.441\.674\.12Ours3\.141\.714\.863\.602\.105\.713\.341\.885\.22Majority Text BiasBaseline2\.832\.985\.822\.432\.615\.042\.622\.795\.41SelfCheckGPT1\.611\.202\.801\.251\.012\.261\.411\.092\.50Self\-RAG3\.212\.385\.593\.783\.186\.963\.532\.836\.35Woodpecker3\.022\.355\.373\.112\.765\.873\.072\.585\.65Ours3\.402\.836\.233\.913\.457\.363\.673\.166\.83Metadata RedundancyBaseline2\.942\.325\.262\.532\.414\.942\.732\.375\.10SelfCheckGPT3\.992\.316\.293\.232\.215\.443\.612\.265\.86Self\-RAG3\.012\.035\.043\.512\.746\.253\.262\.395\.65Woodpecker2\.351\.714\.062\.752\.635\.382\.552\.184\.73Ours3\.802\.025\.823\.712\.776\.493\.762\.406\.16Out\-of\-Domain IrrelevanceBaseline2\.862\.195\.052\.752\.725\.462\.802\.475\.27SelfCheckGPT1\.340\.231\.562\.140\.412\.551\.750\.322\.06Self\-RAG3\.552\.215\.763\.552\.746\.293\.552\.486\.03Woodpecker3\.012\.065\.073\.242\.976\.213\.132\.525\.64Ours3\.311\.945\.254\.003\.147\.143\.672\.576\.24Scene MisalignmentBaseline2\.562\.134\.692\.612\.294\.902\.592\.214\.80SelfCheckGPT1\.050\.021\.061\.030\.071\.101\.040\.051\.08Self\-RAG3\.292\.265\.553\.272\.425\.703\.282\.345\.63Woodpecker2\.621\.944\.563\.032\.895\.912\.842\.445\.27Ours2\.891\.904\.793\.302\.746\.043\.112\.345\.45AverageBaseline2\.371\.944\.312\.312\.184\.502\.342\.064\.40SelfCheckGPT2\.200\.803\.001\.990\.842\.832\.100\.822\.92Self\-RAG2\.981\.804\.783\.242\.415\.643\.112\.115\.21Woodpecker2\.451\.754\.212\.972\.685\.652\.712\.224\.93Ours3\.071\.844\.913\.392\.595\.983\.232\.225\.45
#### Combating Attribute Hijacking and Perception Omission\.
Scenarios 1 and 2 highlight the Baseline’s vulnerability to semantic coercion\. Despite the visual evidence clearly showing a snowboard or snowshoes, the injection of text describing “cross\-country skiing” or “twin\-tip skis” hijacked the Baseline’s attention, causing it to blindly hallucinate equipment not present in the video\. In contrast, MODE\-RAG’sPer\-Agentenforces a strict “visual\-first” extraction\. By isolating atomic visual facts before textual integration, the system successfully overrides the adversarial text, accurately maintaining the physical reality of the scene\.
#### Suppressing Sycophancy and Causal Fabrication\.
Scenario 3 demonstrates a severe case of Causal Imposition\. Confronted with text describing competitive winter sports, the Baseline model exhibits extreme sycophancy, inventing a massive, Hollywood\-style narrative involving a “triple cork 1440 spin” and “outrunning an avalanche\.” This exposes the danger of unguided LLM reasoning, where the model prioritizes narrative alignment with the text over physical constraints\. MODE\-RAG neutralizes this through theRea\-Agent’s MCTS DAG\. Since an avalanche or a backflip cannot be topologically derived from the Per\-Agent’s root node \(standing still\), the MCTS prunes these branches entirely, allowing the Gen\-Agent to safely output a justified epistemic refusal regarding the background context\.
#### Preventing Mechanistic Mode Collapse\.
Scenario 4 exposes a critical physical limitation of lightweight LLM kernels\. When subjected to Majority Text Bias \(a 3:1 ratio of water\-sports noise to relevant snow text\), the Baseline model’s attention mechanism collapses under the conflicting semantic density, resulting in a stuttering loop and system paralysis\. MODE\-RAG bypasses this failure mode completely\. Prior to generation, theRet\-Agentactively computes the manifold distance between the visual log and the candidate texts, discarding the three contradictory water\-sports chunks upfront\. This listwise cross\-check purifies the context window, feeding the generator a clean, aligned prompt that guarantees formatting stability and flawless factual synthesis\.
## Appendix BResults on Additional Backbones
To comprehensively verify the effectiveness of the MODE\-RAG framework, we conduct an extensive comparative analysis against multiple established mitigation paradigms in recent literature\. Specifically, our benchmark encompasses a total of five distinct methodological configurations:
1. 1\.Vanilla M\-RAG \(Baseline\):The foundational unguided VLM \(Qwen\-2\.5\-VL\-7B\) executing direct multimodal generation\.
2. 2\.Self\-RAGAsai et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib1)\): An end\-to\-end framework that trains the model to self\-reflect on retrieved passages and generations via reflection tokens\.
3. 3\.SelfCheckGPTManakul et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib14)\): A zero\-resource sampling\-based approach that detects hallucinations via stochastic consistency checks\.
4. 4\.WoodpeckerYin et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib26)\): A training\-free, post\-hoc correction pipeline designed to rectify multi\-modal fabrications through diagnostic querying\.
5. 5\.MODE\-RAG \(Ours\):Our proposed hierarchical, variational free energy\-gated multi\-agent intervention framework\.
Table[B1](https://arxiv.org/html/2606.17449#A1.T1a)presents the full quantitative comparison across these five methods on the polar extremes of the ModeVent dataset\.
### B\.1Implementation of Additional Baselines
While Vanilla M\-RAG requires no architectural modification and MODE\-RAG is detailed in Section 4, the remaining three baselines \(Woodpecker, SelfCheckGPT, and Self\-RAG\) were originally developed for static images or pure text\. Below, we outline the specific multimodal adaptations and pipeline configurations required to deploy them within our adversarial video RAG setting\.
#### Video\-Adapted Woodpecker\.
We adapt the Woodpecker frameworkYin et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib26)\)—initially an image\-centric hallucination corrector—to the video domain by shifting the focus from spatial object misidentification to temporal dynamics \(e\.g\., fabricated actions or incorrect event sequences\)\. The adapted pipeline operates in four stages: \(1\) Drafting: A standard multimodal RAG setup generates an initial answer\. \(2\) Question Generation: An LLM extracts action\-centric claims and temporal events from the draft, formulating targeted verification questions\. \(3\) Visual Verification: A Video\-LLM acts as an independent visual expert\. Crucially, external texts are masked to ensure the model relies solely on raw video frames for objective fidelity\. \(4\) Correction: The verified answers form a Visual Fact\-Sheet, guiding the LLM to revise the initial draft and prune spatiotemporal hallucinations\.
#### Multimodal SelfCheckGPT\.
To complement the visual\-centric verification, we implement an alternative, uncertainty\-based baseline by adapting SelfCheckGPTManakul et al\. \([2023](https://arxiv.org/html/2606.17449#bib.bib14)\)from black\-box text evaluation to the multimodal RAG domain\. This zero\-shot pipeline addresses adversarial textual noise through generation consistency, executing in three stages: \(1\) Multi\-Sample Generation: Multiple independent candidate answers are generated using high\-temperature sampling\. \(2\) Consistency Voting: Instead of standard token\-level probability checks, a semantic overlap metric identifies the most frequent consensus among the candidates\. \(3\) Refinement: The LLM acts as a strict validator, cross\-referencing the candidate consensus against the raw retrieved texts to synthesize a final factual response\. Additionally, we integrate a dynamic memory\-recovery mechanism with progressive token\-throttling to handle potential Out\-Of\-Memory errors during large\-scale evaluation\.
#### Multimodal Self\-RAG\.
Given that the original Self\-RAGAsai et al\. \([2024](https://arxiv.org/html/2606.17449#bib.bib1)\)is a text\-to\-text framework designed to critique retrieved textual passages, we adapt it for video reasoning via a two\-stage cascaded pipeline\. This approach bridges the modality gap while preserving the model’s reflective capabilities: \(1\) Visual Translation: A Vision\-Language Model first processes the raw video frames alongside the adversarial retrieved contexts to generate a comprehensive text\-based description of the visual scenes, actions, and objects\. \(2\) Reflective Generation: This visual description is subsequently injected into the Self\-RAG model using its native retrieval syntax\. Treating this textual translation as the primary retrieved evidence, the Self\-RAG model leverages its intrinsic reflection tokens to evaluate the fidelity of the provided information and synthesize the final answer to the user’s query\.
### B\.2Result Analysis and Discussion
The comprehensive empirical results presented in Table[B1](https://arxiv.org/html/2606.17449#A1.T1a)demonstrate the performance trade\-offs, highlighting both the global strengths and the localized limitations of the proposed MODE\-RAG framework\.
#### Overall Strengths and Outlier Robustness\.
MODE\-RAG achieves the highest global performance with anOverallAverage Total Score of5\.45, consistently outperforming all four competitive baselines \(Baseline: 4\.40, SelfCheckGPT: 2\.92, Self\-RAG: 5\.21, Woodpecker: 4\.93\)\. The primary architectural advantage of our framework lies in its exceptional robustness againstOutliers \(Hard\-OOD\)scenarios, where it reaches an average total score of5\.98\. Specifically, in categories heavily plagued by aggressive external text noise—such asMajority Text Bias\(7\.36\) andOut\-of\-Domain Irrelevance\(7\.14\)—MODE\-RAG delivers a substantial performance leap\. This consistently validates the efficacy of our thermodynamic gating via the FE\-Router and the manifold filtering via the Ret\-Agent\. By proactively evaluating the epistemic uncertainty and discarding highly mismatched text chunks upfront, our system effectively prevents the LLM kernel from experiencing attention hijacking, thereby securing a strong safety floor for factual cross\-modal synthesis\.
#### Vulnerability to Information Sparsity\.
Despite its global superiority, the multi\-agent execution within MODE\-RAG exhibits localized deficits under specific error contexts\. In theInformation Sparsitycategory, MODE\-RAG \(Overall Total: 5\.22\) is noticeably outperformed by the text\-based SelfCheckGPT, which achieves a dominant score of5\.89\. This deficit occurs because when the retrieved context is extremely sparse, SelfCheckGPT’s high\-temperature multi\-sample consistency voting natively excels at consensus\-driven extraction\. In contrast, our rigid multi\-agent validation schema can occasionally become overly restrictive, leading to redundant processing steps without gaining an additional informative edge\.
#### Conservative Pruning in Complex Reasoning\.
Another limitation is observed in theCausal Impositioncategory, where Woodpecker outperforms our method in both Outliers \(6\.37 vs\. 4\.97\) and Overall \(5\.24 vs\. 4\.52\) metrics\. A granular examination reveals that this is primarily driven by a drop in our Resilience \(R\) scores \(1\.81 vs\. Woodpecker’s 2\.44\)\. Because Woodpecker leverages an aggressive post\-hoc prompt rewriting strategy based on direct question\-answering, it forces the model to actively correct claims\. MODE\-RAG, conversely, relies on a strict MCTS causal DAG; when a claim cannot be topologically derived from the visual invariants, the system tends to trigger a conservativeEpistemic Refusal\(i\.e\., acknowledging insufficient information\)\. While this strictness preserves visual Fidelity, it inherently sacrifices descriptive completeness \(Resilience\) when facing high\-order causal fabrications\.
## Appendix CData Construction Examples
To automate the construction of the ModeVent benchmark, we leveragedDeepSeek\-V3\.2to synthesize adversarial test scenarios from MultiVent’s ground truth\. These misleading queries are strategically designed to reflect the hallucination taxonomy introduced in[section˜1](https://arxiv.org/html/2606.17449#S1), ensuring a comprehensive evaluation of model vulnerabilities\. In this section, we present representative examples of the challenging queries generated through this pipeline\.
Example 1: Causal ImpositionGround Truth:This is a news report from TVBS News about a medical condition called c̈ytokine storm,ẅhich can be fatal\. The report features interviews with doctors from Taipei Veterans General Hospital and a nutritionist, who discuss how this immune overreaction can damage organs like the lungs, as shown by X\-ray images\. Experts advise that during the pandemic, people should manage stress, get enough sleep, and maintain a balanced diet to strengthen their immune systems\. The segment also shows scenes of public health measures, including disinfection at a train station and people receiving vaccinations\.”Error Type:Causal ImpositionGenerated Query:What is being discussed in the news report, and what is the purported origin of the medical condition according to the background context?
Example 2: Attribute HijackingGround Truth:The video features an interview with Professor Chen Jian from Beijing Hospital’s Hepatobiliary Surgery Department\. He discusses the treatment of liver abscesses, explaining that early\-stage cases are often managed conservatively with antibiotics, while larger abscesses may require drainage or surgical removal\. The conversation takes place in an office setting with bookshelves in the background, and animated graphics illustrate medical procedures like needle aspiration\. A female host and a nurse are present to conduct the interview\.Error Type:Attribute HijackingGenerated Query:Who is the expert being interviewed in the video, what medical condition is discussed, and what are the backgrounds or settings shown during the interview?
Example 3: Out\-of\-Domain IrrelevanceGround Truth:A news anchor presents a story about a political controversy involving a leaked audio recording\. The broadcast displays images of politicians, including President Yoon Suk\-yeol, and shows text from a social media post by Lee Jun\-seok, who denies being the source of the leak\. A press conference is shown where a masked man speaks at a podium in front of the National Assembly seal, addressing the allegations\. The report includes an animated graphic depicting two silhouetted figures representing lawmakers from the People Power Party, discussing the situation\.Error Type:Out\-of\-Domain IrrelevanceGenerated Query:What is the main topic of the news report in the video?
Example 4: Information SparsityGround Truth:The video is a news report from YTN about a political controversy involving the People Power Party\. It features a female anchor introducing the story, followed by on\-screen text messages allegedly exchanged between party members discussing the possibility of a candidate’s withdrawal\. The report includes footage of a press conference with Kim Dong\-cheol, the party’s floor leader, who denies wrongdoing and claims the matter was handled internally\. Other party figures, including Lee Yong\-joo and Lee Sang\-tae, are shown speaking at events, while opposition leaders like Park Hee\-ryeon and Ahn Cheol\-soo are also featured\. The segment concludes with a reporter providing an update on the situation outside a government building\.Error Type:Information SparsityGenerated Query:What are the specific details and sequence of events reported in this news segment about the political controversy?Similar Articles
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
SEMA-RAG is a self-evolving multi-agent RAG framework for medical question answering that decouples interpretation, exploration, and adjudication into three specialist agents, achieving significant accuracy improvements over baselines across multiple benchmarks.
MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
MM-BizRAG is a multimodal retrieval-augmented generation system for enterprise Q&A that uses document structure-aware splitting and layout-aware parsing to outperform vision-centric baselines by up to 32% on heterogeneous enterprise documents. The paper also introduces FastRAGEval, a cost-efficient LLM-based evaluation metric with stronger human alignment than RAGChecker.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.
HKUDS/RAG-Anything
HKUDS released RAG-Anything, an open-source all-in-one multimodal retrieval-augmented generation framework based on LightRAG.
Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection
Proposes Evidence Graph Consistency (EGC), a framework using graph-based structural consistency for hallucination detection in RAG, revealing that effectiveness varies across model families.