From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

arXiv cs.CL Papers

Summary

This paper introduces GranuVistaVQA, a multimodal benchmark with element-level annotations, and GranuRAG, a framework that treats visual elements as first-class retrieval units for verifiable multimodal RAG, achieving up to 29.2% improvement over baselines.

arXiv:2605.15019v1 Announce Type: new Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:25 AM

# From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG
Source: [https://arxiv.org/html/2605.15019](https://arxiv.org/html/2605.15019)
Guanhua Chen,Chuyue Huang11footnotemark:1,Yutong Yao,Shudong Liu,Xueqing Song, Lidia S\. Chao,Derek F\. Wong NLP2CT Lab, Department of Computer and Information Science, University of Macau \{nlp2ct\.guanhua, nlp2ct\.chuyue, nlp2ct\.yutong, nlp2ct\.shudong, xqsongangie\}@gmail\.com \{derekfw, lidiasc\}@um\.edu\.mo

###### Abstract

Multimodal Retrieval\-Augmented Generation \(RAG\) systems retrieve evidence at coarse granularities \(entire images or scenes\), creating a mismatch with fine\-grained user queries and making failures unverifiable\. We introduceGranuVistaVQA, a multimodal benchmark featuring real\-world landmarks with element\-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities\. We further proposeGranuRAG, a multi\-granularity framework that treats visual elements as first\-class retrieval units through three stages: element\-level detection and classification, multi\-granularity cross\-modal alignment for evidence retrieval, and attribution\-constrained generation\. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis\. Experiments demonstrate that GranuRAG achieves up to 29\.2% improvement over six strong baselines for this task\.

From Scenes to Elements: Multi\-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen††thanks:Equal contribution, Chuyue Huang11footnotemark:1, Yutong Yao, Shudong Liu, Xueqing Song,Lidia S\. Chao,Derek F\. Wong††thanks:Corresponding authors\.NLP2CT Lab, Department of Computer and Information Science, University of Macau\{nlp2ct\.guanhua, nlp2ct\.chuyue, nlp2ct\.yutong, nlp2ct\.shudong, xqsongangie\}@gmail\.com\{derekfw, lidiasc\}@um\.edu\.mo

## 1Introduction

Multimodal Large Language Models \(MLLMs\) have substantially advanced visual understandingAlayracet al\.\([2022](https://arxiv.org/html/2605.15019#bib.bib77)\); Liet al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib78)\)\. However, their reasoning remains opaque and prone to hallucinationBaiet al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib17)\)\. While Retrieval\-Augmented Generation \(RAG\) mitigates this by conditioning generation on external evidenceLewiset al\.\([2020](https://arxiv.org/html/2605.15019#bib.bib6)\); Chenet al\.\([2025a](https://arxiv.org/html/2605.15019#bib.bib52)\), current multimodal extensions operate at coarse granularities, retrieving entire images, scenes, or pagesFanget al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib15)\); Yuet al\.\([2025a](https://arxiv.org/html/2605.15019#bib.bib5)\)\. This creates a fundamental attribution gap\. A user asking about a Baroque pediment may receive a building photograph, yet neither party can verify whether the pediment appears in the image, whether the retrieved knowledge is relevant, or whether the answer faithfully reflects these inputs\. Detection failures, retrieval errors, and generation hallucinations collapse into an opaque black box\.

In this work, we argue that verifiable multimodal RAG requires treating visual elements as first\-class retrieval targets, not merely as regions to be implicitly attended within retrieved scenes\. Although recent grounding models can localize objectsPenget al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib2)\); Guoet al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib16)\), they rely on parametric knowledge and do not retrieve external evidence\. Conversely, existing multimodal RAG systems retrieve evidence but lack explicit element\-level grounding, even recent fine\-grained approachesLiuet al\.\([2025](https://arxiv.org/html/2605.15019#bib.bib3)\)prioritize representational expressiveness over transparent attribution\. We bridge this gap with a detect\-then\-retrieve approach: we first detect candidate visual elements via open\-vocabulary detection, then assemble hierarchical evidence spanning element\-level descriptions and global context, and finally generate answers attributable to specific visual spans and retrieved passages\. This design transforms evaluation from black\-box answer assessment into transparent evidence auditing: we can separately diagnose whether the correct elements were detected, whether relevant knowledge was retrieved, and whether generation remained faithful to both\.

Table 1:Comparison of our dataset with five similar multimodal datasets\.However, existing benchmarks inadequately assess multi\-granularity alignment as shown in Table[1](https://arxiv.org/html/2605.15019#S1.T1)\. Critically, all overlook the partial observation challenge inherent to real\-world imagery: photographs capture scenes from varying distances and angles, such that a single image depicts only a subset of elements present at a location\. This challenge pervades domains from architectural photography to medical imaging and satellite sensing, yet no existing benchmark provides the supervision needed to diagnose element\-level detection and retrieval under partial visibility\. To address this, we introduceGranuVistaVQA, a benchmark centered on architectural heritage landmarks, a domain where elements have well\-defined visual semantics, authoritative knowledge sources exist, and multi\-view partial observation naturally arises from real\-world photography\. The dataset comprises 1,422 images across 71 landmarks, where each view covers only 34% of annotated elements on average\. Crucially, we provide human\-verified element visibility labels that enable fine\-grained error diagnosis unavailable in prior benchmarks\. We further propose GranuRAG, a detect\-then\-retrieve framework that grounds visible elements via open\-vocabulary detection, retrieves hierarchical evidence, and generates attribution\-constrained answers\.

Experiments show GranuRAG outperforms strong baselines\. Moreover, LLMs fine\-tuned on our pipeline’s reasoning traces surpass both direct fine\-tuning and self\-generated chain\-of\-thought \(CoT\)Weiet al\.\([2022](https://arxiv.org/html/2605.15019#bib.bib24)\), demonstrating that explicit multi\-granularity alignment provides a more effective supervision signal\.

## 2Related Work

#### Benchmarks for Multimodal RAG

Knowledge\-intensive multimodal QA has evolved from answer\-only evaluation toward attribution\-aware assessment requiring verifiable evidence and localized failure analysis\. Early benchmarks\(Marinoet al\.,[2019](https://arxiv.org/html/2605.15019#bib.bib32); Schwenket al\.,[2022](https://arxiv.org/html/2605.15019#bib.bib73)\)established the need for external knowledge without explicit grounding\. Follow\-up datasets added structured supervision:ViQuAE\(Lerneret al\.,[2022](https://arxiv.org/html/2605.15019#bib.bib53)\)identifies retrieval as the primary bottleneck;InfoSeek\(Chenet al\.,[2023](https://arxiv.org/html/2605.15019#bib.bib37)\)andEncyclopedic\-VQA\(Mensinket al\.,[2023](https://arxiv.org/html/2605.15019#bib.bib74)\)provide section\-level evidence, revealing gaps due to unreliable entity\-section linking\. Document\-centric benchmarks\(Yuet al\.,[2025b](https://arxiv.org/html/2605.15019#bib.bib72); Xuet al\.,[2025b](https://arxiv.org/html/2605.15019#bib.bib40)\)evaluate at page, region, and document level granularities\. Recent work incorporates explicit spatial supervision:BBox\-DocVQA\(Yuet al\.,[2025c](https://arxiv.org/html/2605.15019#bib.bib93)\)grounds answers to semantically coherent regions;Toloka VQA\(Ustalovet al\.,[2023](https://arxiv.org/html/2605.15019#bib.bib89)\)requires bounding boxes for answer\-supporting objects;VISA\(Maet al\.,[2024a](https://arxiv.org/html/2605.15019#bib.bib88)\)mandates visual source attribution during generation\. However, existing benchmarks lack explicit mappings between visual elements and knowledge entries\. GranuVistaVQA addresses this by treating visual elements as core evidence units with element\-level knowledge alignment for fine\-grained, verifiable attribution\.

#### LLM Methods for Multimodal RAG

Multimodal RAG has advanced through richer retrieval representations, stronger reranking, and controllable attribution\. Unified dense retrievers\(Liuet al\.,[2023](https://arxiv.org/html/2605.15019#bib.bib68); Zhouet al\.,[2024b](https://arxiv.org/html/2605.15019#bib.bib69),[a](https://arxiv.org/html/2605.15019#bib.bib70)\)use joint text\-image embeddings but lack explicit local visual\-textual connections\. Finer\-grained methods\(Linet al\.,[2024](https://arxiv.org/html/2605.15019#bib.bib71); Yanget al\.,[2025](https://arxiv.org/html/2605.15019#bib.bib29)\)operate at paragraph/section levels, leaving element\-level grounding underexplored\. End\-to\-end pipelines\(Chenet al\.,[2022](https://arxiv.org/html/2605.15019#bib.bib66); Zhanget al\.,[2024b](https://arxiv.org/html/2605.15019#bib.bib33)\)improve recall through iterative retrieval but amplify noise\. Robustness\-focused approaches\(Cuiet al\.,[2024](https://arxiv.org/html/2605.15019#bib.bib67); Yan and Xie,[2024](https://arxiv.org/html/2605.15019#bib.bib31); Tianet al\.,[2025](https://arxiv.org/html/2605.15019#bib.bib1)\)address cross\-source reconciliation but reason over coarse units\. Recent spatial control work targets specific evidence chain components:Locate\-Then\-Generate\(Zhuet al\.,[2023](https://arxiv.org/html/2605.15019#bib.bib90)\)separates localization from generation in scene\-text VQA;HuLiRAG\(Xiet al\.,[2025](https://arxiv.org/html/2605.15019#bib.bib92)\)decouples retrieval from attention via segmentation but lacks grounding\-knowledge connection;GROUNDHOG\(Zhanget al\.,[2024c](https://arxiv.org/html/2605.15019#bib.bib94)\)achieves pixel\-level alignment without retrieval integration;VisRAG 2\.0\(Sunet al\.,[2025](https://arxiv.org/html/2605.15019#bib.bib91)\)improves multi\-image reasoning but treats evidence as disconnected region sets;Ferret\-v2\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.15019#bib.bib63)\)enables fine\-grained region\-language alignment for in\-image comprehension without external knowledge integration\. In contrast, GranuRAG treats individual elements as verifiable units, grounding each factual claim to both a detected region and a retrieved snippet, which enables fine\-grained alignment and supports systematic error diagnosis across detection, retrieval, and generation stages\.

## 3GranuVistaVQA Benchmark

To enable verifiable multimodal RAG with fine\-grained attribution, we construct a knowledge\-intensive benchmark centered on urban architectural heritage\. Unlike prior datasets that treat images as atomic units, our design establishes visual elements as first\-class retrieval targets and explicitly models the partial observation challenge: real\-world photographs capture landmarks from varying viewpoints, each depicting only a subset of architecturally significant elements\.

### 3\.1Task Formulation

Given a query imageIIdepicting a landmark from an arbitrary viewpoint, the task is to generate a comprehensive description covering all visible architectural elements while avoiding hallucination about occluded or absent components\.

We associate each landmark with three components: metadata \(name, summary, and style\) that provides high\-level context; an element inventoryE=\{e1,…,ek\}E=\\\{e\_\{1\},\\ldots,e\_\{k\}\\\}listing architecturally significant components; and element descriptionsED:E→P​a​r​a​g​r​a​p​h​s\\mathrm\{ED\}:E\\to Paragraphsthat map each element to expert\-written text\. For each imageII, the ground\-truth visible setEgt​\(I\)⊆EE^\{\\mathrm\{gt\}\}\(I\)\\subseteq Econtains elements visually identifiable in that view\. This formulation enables modular evaluation: systems must \(i\) predict visible elementsE^​\(I\)≈Egt​\(I\)\\hat\{E\}\(I\)\\approx E^\{\\mathrm\{gt\}\}\(I\), \(ii\) retrieve relevant descriptions fromED\\mathrm\{ED\}, and \(iii\) generate outputs faithful to both visual and textual evidence\. This decomposition allows us to isolate failures at each stage, distinguishing detection errors from retrieval mistakes and generation hallucinations\.

### 3\.2Data Collection and Annotation

#### Domain Selection

We focus on architectural heritage for three methodological reasons: \(1\) elements have well\-defined visual semantics amenable to detection, \(2\) authoritative knowledge sources enable reliable ground truth, and \(3\) tourist photography naturally exhibits multi\-view partial observation\. We curate 71 landmarks from official cultural heritage databases, spanning religious buildings, temples, fortifications, and cultural institutions across diverse architectural traditions\.

#### Knowledge Corpus Construction

For each landmark, we compile a structured JSON document following a two\-level schema \(full specification in Appendix[B\.1](https://arxiv.org/html/2605.15019#A2.SS1)\):

xlandmark=\(meta,E,ED\)x\_\{\\mathrm\{landmark\}\}=\(\\mathrm\{meta\},E,\\mathrm\{ED\}\)\(1\)Textual content is sourced from official tourism portals and encyclopedic references, then structured through: \(i\) element phrase extraction from authoritative descriptions, \(ii\) cross\-landmark normalization to ensure consistent terminology \(e\.g\., “bell tower”≡\\equiv“campanile”\), and \(iii\) LLM\-assisted description generation with human validation \(details in Appendix[B\.2](https://arxiv.org/html/2605.15019#A2.SS2)\)\. Specifically, we focus on heritage sites in Macau, with all description content in Chinese\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x1.png)\(a\)Panoramic
![Refer to caption](https://arxiv.org/html/2605.15019v1/x2.png)\(b\)Close\-up
![Refer to caption](https://arxiv.org/html/2605.15019v1/x3.png)\(c\)Partial

Figure 1:Examples of Multi\-Perspective Image\.
#### Image Collection

We collect 1,422 photographs ensuring viewpoint diversity: panoramic shots capturing overall structure, close\-ups revealing fine ornamentation, and oblique partial views \(Figure[1](https://arxiv.org/html/2605.15019#S3.F1)\)\. After collection, we perform comprehensive data sanitization to remove privacy\-sensitive content, including watermarks, visible human faces, and personally identifiable information\. We also apply quality filtering to retain only images with resolution≥\\geq512px and no visible artifacts\. The full screening protocol is described in Appendix[B\.2](https://arxiv.org/html/2605.15019#A2.SS2)\.

#### Visibility Annotation

For each imageII, annotators identifyEgt​\(I\)E^\{\\mathrm\{gt\}\}\(I\)following strict visibility criteria:

- •Visual identifiability: Elements qualify only if recognizable from pixels alone, without relying on prior landmark knowledge
- •Partial occlusion: Included only when discriminative visual cues remain \(e\.g\., a half\-visible scroll counts if its characteristic shape is apparent\)
- •Ambiguity handling: Uncertain cases are excluded to avoid false positives

We employ a human\-in\-the\-loop workflow: an LLM proposes candidate elements, which annotators refine by adding missed elements, removing hallucinated ones, and resolving synonyms to canonical forms \(protocol in Appendix[B\.6](https://arxiv.org/html/2605.15019#A2.SS6)\)\.

MetricValue\#Landmark \(LL\)71\#Img \(NN\)1422Avg img/landmark \(N/LN/L\)20\.03\#Unique elements \(UEU\_\{E\}\)221Avg elements per landmark3\.59Table 2:The statistic of our proposed GranuVistaVQA\.![Refer to caption](https://arxiv.org/html/2605.15019v1/x4.png)Figure 2:The results of evaluating MLLMs on GranuVistaVQA\.†means the fine\-tuned LLM\.![Refer to caption](https://arxiv.org/html/2605.15019v1/x5.png)Figure 3:The overview of our proposed GranuRAG framework\.

### 3\.3Data Statistic and Evaluation

#### Statistic

Table[2](https://arxiv.org/html/2605.15019#S3.T2)summarizes the dataset statistics\. The multi\-view design creates natural partial observations\. close\-ups capture fine details but miss broader structures, while panoramic shots show overall layouts but lose granular information \(Figure[1](https://arxiv.org/html/2605.15019#S3.F1)\)\. On average, individual images cover only 34% of their landmark’s element inventory\. This validates our core premise that answering element\-level questions requires aggregating evidence across complementary viewpoints\. The statistic of image distribution is explained in Appendix[B\.7](https://arxiv.org/html/2605.15019#A2.SS7)\.

#### Evaluation Metrics

To evaluate our framework, we report three evaluation metrics: ROUGE\-LLin \([2004](https://arxiv.org/html/2605.15019#bib.bib85)\)and BERT\-F1Zhanget al\.\([2020](https://arxiv.org/html/2605.15019#bib.bib86)\)computed against the gold\-standard reference, and an LLM\-as\-a\-judge scoreZhenget al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib87)\)\. For the LLM\-Score, we use an ensemble of three strong LLMs \(GPT\-4\.1, Gemini\-2\.5\-Pro, claude\-haiku\-4\-5OpenAI \([2024](https://arxiv.org/html/2605.15019#bib.bib26)\); Team and Google \([2025](https://arxiv.org/html/2605.15019#bib.bib7)\); Anthropic \([2025](https://arxiv.org/html/2605.15019#bib.bib10)\)\) to rate each explanation on a0−1000\-100scale under a weighted rubric covering Coverage \(40%\), Faithfulness \(40%\), and Cohesion \(20%\); the final LLM Score is the mean across judges to reduce single\-model varianceChiaet al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib9)\)\.

#### Benchmark Evaluation

We evaluated state\-of\-the\-art \(SOTA\) different powerful MLLMs on our benchmark\. Figure[2](https://arxiv.org/html/2605.15019#S3.F2)shows that current models struggle across all metrics as shown in Section[5\.1](https://arxiv.org/html/2605.15019#S5.SS1)\. Even the best\-performing model achieves limited success on ROUGE\-L and BERT\-F1, while all models perform poorly on LLM\-as\-Judge evaluation, which measures factual accuracy and hallucination control\. These results highlight the challenge of multi\-granularity information alignment, motivating our GranuRAG framework\.

## 4Methodology

Based on the above analysis, we introduce our GranuRAG framework as shown in Figure[3](https://arxiv.org/html/2605.15019#S3.F3): given a query imageIIand candidate element setEE, we \(1\) localize and match visual regions to elements inEE, yielding visible subsetE^​\(I\)\\hat\{E\}\(I\); \(2\) retrieve hierarchical evidence forE^​\(I\)\\hat\{E\}\(I\); and \(3\) generate attribution\-constrained output grounded in the retrieved evidence and global description\.

### 4\.1Visual Region Detection and Filtering

The first stage localizes salient architectural regions in imageIIwithout requiring prior knowledge of which elements are present\. We employ YOLO\-WorldChenget al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib4)\), an open\-vocabulary object detector, to identify candidate regions based on generic architectural primitives such as columns, carvings, and decorative motifs\. This detection strategy offers broad coverage across diverse heritage landmarks without domain\-specific fine\-tuning\.

Raw detections often contain redundant bounding boxes that capture the same visual content at multiple scales\. To address this issue, we apply overlap\-based filtering: when two boxes overlap by more than 80%, we retain the smaller one to preserve fine\-grained architectural details\. The sensitive analysis of this overlap percentage will be shown in Appendix[A\.2](https://arxiv.org/html/2605.15019#A1.SS2)\. This denoising step produces a refined set of cropped regions:

ℬ​\(I\)=\{b1,b2,…,bK\}\\mathcal\{B\}\(I\)=\\\{b\_\{1\},b\_\{2\},\\ldots,b\_\{K\}\\\}\(2\)where eachbkb\_\{k\}represents a distinct visual region likely to contain meaningful architectural content\. The filtered crops serve as visual queries for the subsequent matching stage\.

### 4\.2Knowledge\-Guided Element Matching

Given the detected regionsℬ​\(I\)\\mathcal\{B\}\(I\)and a candidate element setEEwith associated appearance descriptions\{ae\}e∈E\\\{a\_\{e\}\\\}\_\{e\\in E\}, this stage determines which architectural elements are actually visible in the image\. We formulate this as a multimodal matching problem where each cropped region is compared against element descriptions from the knowledge corpus\.

An MLLM receives each annotated bounding box specifically for an image alongside all appearance descriptions, which specify visual attributes such as shape, material, and stylistic features\. For each detected regionbkb\_\{k\}, the model identifies the best\-matching element by comparing observed visual characteristics against documented descriptions:

ek=Mϕ​\(I,bk,\{\(e,ae\)\}e∈E\)∈E∪\{∅\}e\_\{k\}=M\_\{\\phi\}\\bigl\(I,b\_\{k\},\\\{\(e,a\_\{e\}\)\\\}\_\{e\\in E\}\\bigr\)\\in E\\cup\\\{\\varnothing\\\}\(3\)whereMϕM\_\{\\phi\}is the MLLM parameterized byϕ\\phi, and the output∅\\varnothingindicates that no candidate description is sufficiently consistent with the region, in which case the region is discarded\. This design combines the spatial localization capability of the detector with the fine\-grained semantic discrimination ability of the MLLM, enabling reliable identification even among visually similar elements\.

The matching process yields a grounded element setE^​\(I\)=\{ek∣ek≠∅\}\\hat\{E\}\(I\)=\\\{e\_\{k\}\\mid e\_\{k\}\\neq\\varnothing\\\}containing only elements with confirmed visual evidence\. By explicitly filtering out undetected elements, we establish a principled boundary between what the system observes and what it knows, reducing the risk of hallucinating information about absent components\.

### 4\.3Evidence\-Grounded Generation

The final stage synthesizes a coherent interpretation by conditioning on both the annotated image and retrieved knowledge for matched elements\. For eache∈E^​\(I\)e\\in\\hat\{E\}\(I\), we retrieve its expert\-written descriptionded\_\{e\}from the knowledge corpus, which provides factual details including historical background, symbolic meaning, and architectural significance\. We prepend global metadatammcontaining landmark name, architectural style, and historical period to contextualize the element\-level information:

𝒞​\(I\)=\[m\]⊕\[\(ei,dei\)\]ei∈E^​\(I\)\\mathcal\{C\}\(I\)=\[m\]\\oplus\\bigl\[\(e\_\{i\},d\_\{e\_\{i\}\}\)\\bigr\]\_\{e\_\{i\}\\in\\hat\{E\}\(I\)\}\(4\)
The generator then produces the final output conditioned on this hierarchical evidence:

y=Gθ​\(I,𝒞​\(I\)∣Ω​\(E^​\(I\)\)\)y=G\_\{\\theta\}\\bigl\(I,\\mathcal\{C\}\(I\)\\mid\\Omega\(\\hat\{E\}\(I\)\)\\bigr\)\(5\)where the generation promptΩ\\Omegainstructs the model to describe only elements confirmed inE^​\(I\)\\hat\{E\}\(I\)and ground all factual claims to retrieved descriptions\. Such a design ensures that each claim in the output traces back to a detected visual region and a retrieved knowledge snippet\. Such traceability facilitates systematic error diagnosis: missing information indicates detection failure, incorrect facts suggest retrieval error, and unsupported claims reveal generation hallucination\.

## 5Experiment

### 5\.1Setup

Experiments are conducted on the GranuVistaVQA dataset as described in section[3](https://arxiv.org/html/2605.15019#S3)\. We use the official APIs ofQwen3\-VL\-8B,Qwen\-VL\-Max,GPT\-4o,GPT\-4\.1\-Mini, andclaude\-3\.5\-sonnetBaiet al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib50)\); OpenAI \([2024](https://arxiv.org/html/2605.15019#bib.bib26)\); Anthropic \([2024](https://arxiv.org/html/2605.15019#bib.bib25)\)\. For the element detector, we employ the open\-vocabulary YOLO\-World\-XL111https://replicate\.com/franz\-biz/yolo\-world\-xlDψD\_\{\\psi\}with a fixed confidence threshold\. We use the evaluation metrics following Section[3\.3](https://arxiv.org/html/2605.15019#S3.SS3)\. More implementation details will be shown in Appendix[A\.1](https://arxiv.org/html/2605.15019#A1.SS1)\.

Table 3:The main results of our method on four different LLMs under three settings: \(A\) Baseline, \(B\) CoT, and \(C\) GranuRAG\.†means the fine\-tuned LLM\.Boldmeans the best results for each LLM\.
### 5\.2Main Results

We evaluate GranuRAG across six state\-of\-the\-art MLLMs under three settings: \(A\) Baseline, where the generator observes the image and the full noisy candidate setEallE\_\{\\text\{all\}\}; \(B\) CoT, which augments \(A\) with chain\-of\-thought promptingWeiet al\.\([2022](https://arxiv.org/html/2605.15019#bib.bib24)\)to encourage structured reasoning; and \(C\) GranuRAG, where the generator receives only the grounded element subsetE^​\(I\)\\hat\{E\}\(I\)from our two\-stage pipeline\. For both the fine\-tuned models \(denoted by†\) and Setting B, we synthesize the thinking process using powerful LLM and validate them through manual inspection \(see Appendix[A\.3](https://arxiv.org/html/2605.15019#A1.SS3)for details\)\.

Table[3](https://arxiv.org/html/2605.15019#S5.T3)presents the results\. Setting C consistently outperforms both baselines across almost all LLMs and metrics, demonstrating that explicit element alignment yields robust improvements regardless of backbone capacity\. While CoT prompting \(Setting B\) moderately improves over the noisy baseline, it does not close the gap to our full pipeline, suggesting that reasoning\-time prompting alone cannot fully compensate for irrelevant knowledge\. Fine\-tuning further amplifies these gains: Qwen3\-VL\-8B†\(C\) achieves 35\.74 ROUGE\-L, 46\.96 BERT\-F1, and 70\.24 LLM score, substantially outperforming its zero\-shot counterpart\. Notably, even closed\-source LLMs benefit markedly from grounded evidence, with Setting C improving LLM scores by 22\.7, 18\.8, and 25\.7 points over Setting A, respectively\. These results confirm that our GranuRAG framework provides complementary value beyond model scale or reasoning prompts\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x6.png)Figure 4:Ablation on visual presentation and element filtering\. T3 and T4 use grounded subsetE^​\(I\)\\hat\{E\}\(I\); T1 and T2 use full candidate setEallE\_\{\\text\{all\}\}\.Table 4:Ablation on visual modality and knowledge relevance\.Boldmeans the best results\.
### 5\.3Ablation Studies

We conduct ablation studies to isolate the contributions of visual evidence presentation and knowledge selection\. First, we examine four configurations varying visual input and element sets: \(T1\) raw image with all candidatesEallE\_\{\\text\{all\}\}, \(T2\) box\-annotated image withEallE\_\{\\text\{all\}\}, \(T3\) raw image with grounded subsetE^​\(I\)\\hat\{E\}\(I\), and \(T4\) box\-annotated image withE^​\(I\)\\hat\{E\}\(I\)\. As shown in Figure[4](https://arxiv.org/html/2605.15019#S5.F4), T3 and T4 consistently outperform T1 and T2, confirming that filtering noisy candidates is crucial\. T4 achieves the highest LLM score, indicating that combining localized visual cues and grounded textual evidence yields optimal performance\.

Table[4](https://arxiv.org/html/2605.15019#S5.T4)further contrasts three variants: text\-only with gold elementsEgoldE\_\{\\text\{gold\}\}, image with all candidates, and image with grounded subset\. While text\-only provides a strong upper bound, adding the image without filtering \(Image \+ All\) paradoxically degrades LLM score, suggesting that irrelevant visual\-textual mismatches introduce noise\. In contrast, pairing the image with the grounded subset substantially recovers text\-only performance and exceeds it on most metrics, demonstrating that selective retrieval enables effective multimodal integration\.

We further ablate the detectors to verify that our design choices are well\-motivated rather than arbitrary\. As shown in Table[5](https://arxiv.org/html/2605.15019#S5.T5), replacing YOLO\-World with Grounding DINOLiuet al\.\([2024](https://arxiv.org/html/2605.15019#bib.bib56)\)still outperforms both the baseline and the approach of directly extracting relevant elements with LLMs, showing that the detect\-then\-match paradigm is robust across detectors\.

Table 5:Ablation on the detector component\.Boldmeans the best performance\.

## 6Analysis

### 6\.1Generalization Analysis

To assess generalization beyond the fine\-tuning distribution, we compare performance on in\-domain \(ID\) and out\-of\-domain \(OOD\) test data\. ID samples appear in training with different viewpoints, while OOD samples are entirely unseen during training\. As shown in Figure[5](https://arxiv.org/html/2605.15019#S6.F5), ID samples consistently achieve higher scores, reflecting expected distributional advantages\. Importantly, the absolute improvements from GranuRAG on OOD samples are comparable to, and sometimes exceed, those on ID samples across all three metrics\. Moreover, the performance gap between ID and OOD data does not widen as we progress from Base to CoT to GranuRAG\. This suggests that our grounding mechanism improves reasoning quality rather than exploiting memorized training instances\. The consistent gains on OOD data confirm that GranuRAG generalizes well to novel visual content\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x7.png)Figure 5:Performance comparison on in\-domain and OOD data across three methods\.Table 6:The results of comparing our method with different RAG strategies\.Boldmeans the best result\.
### 6\.2Comparison with Retrieval Baselines

We evaluate the effectiveness of our GranuRAG framework by fixing the generator backbone \(Qwen\-VL\-Max\) and comparing three strategies for constructing the evidence that conditions generation\. The global baseline provides the generator with descriptions of the full candidate setEEwithout visual grounding, requiring it to implicitly filter irrelevant elements\. Embedding Retrieval replaces our second\-stage matching with CLIP\-based dense retrieval222https://huggingface\.co/openai/clip\-vit\-large\-patch14Radfordet al\.\([2021](https://arxiv.org/html/2605.15019#bib.bib8)\): for each detected region, we retrieve the top\-1 most similar element description from a vector database\. Besides, we also compare GranuRAG with two strong multi\-modal RAG frameworks, which is RAVQALinet al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib39)\)333https://github\.com/LinWeizheDragon/Retrieval\-Augmented\-Visual\-Question\-Answeringand VisRAG 2\.0Sunet al\.\([2025](https://arxiv.org/html/2605.15019#bib.bib91)\)444https://github\.com/openbmb/visrag, usingQwen\-VL\-Maxmodel\.

Table[6](https://arxiv.org/html/2605.15019#S6.T6)shows that GranuRAG substantially outperforms both baselines across all metrics\. The global baseline achieves the lowest scores, struggling to identify relevant elements from the full candidate set\. Embedding Retrieval improves performance but still lags behind our approach\. GranuRAG achieves the best results with particularly large gains in LLM Score \(\+15\.85 over CLIP\), demonstrating that LLM\-based semantic matching is more effective than embedding similarity for fine\-grained element recognition\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x8.png)\(a\)GPT\-4o
![Refer to caption](https://arxiv.org/html/2605.15019v1/x9.png)\(b\)Qwen\-VL\-Max

Figure 6:Distribution of answer quality when both LLMs successfully extract relevant elements\.![Refer to caption](https://arxiv.org/html/2605.15019v1/x10.png)\(a\)GPT\-4o
![Refer to caption](https://arxiv.org/html/2605.15019v1/x11.png)\(b\)Qwen\-VL\-Max

Figure 7:Extraction accuracy comparison across images with different numbers of elements\.
### 6\.3Error Analysis

To understand the source of improvements, we conduct error analysis on two LLMs from two aspects\. First, we compare answer quality when both methods extract the correct elements\. As shown in Figure[6](https://arxiv.org/html/2605.15019#S6.F6), GranuRAG produces better answers in 94\.4% and 90\.2% of cases for the two models respectively, while the baseline wins in less than 5%\. This indicates that our fine\-grained element representations provide more accurate information for generation, even when extraction is equally successful\.

Second, we analyze extraction accuracy by examining cases where only one method identifies the correct elements\. Figure[7](https://arxiv.org/html/2605.15019#S6.F7)shows that GranuRAG succeeds where the baseline fails far more often than the reverse, especially for images with fewer elements\. This gap narrows as element count increases, reflecting the growing difficulty of fine\-grained reasoning in complex images\. Overall, these results confirm that GranuRAG improves both extraction accuracy and generation quality\.

Table 7:The average win and loss rate of human evaluation\.Boldmeans the best results\.
### 6\.4Human Evaluation

To complement automatic metrics, we conduct human evaluation comparing our method against the baseline\. We sample 20 questions for two models individually \(GPT\-4o and Qwen\-VL\-Max\), yielding 40 pairwise comparisons\. Three graduate students in computer science independently assess baseline and GranuRAG outputs according to four criteria:coverageof relevant scenic elements,accuracyof factual details,fluencyof language, andacceptabilityas tourist guidance\. Annotators select the superior output in each pair without knowing which system produced it\.

Table[7](https://arxiv.org/html/2605.15019#S6.T7)reports the aggregated win rates\. GranuRAG achieves 82\.22% preference over the baseline for GPT\-4o and 91\.11% for Qwen\-VL\-Max, confirming that grounded retrieval yields explanations that human evaluators judge more comprehensive and trustworthy\. To assess annotation reliability, we compute Fleiss’ Kappa across the three raters, obtainingκ=0\.712\\kappa=0\.712\(SE = 0\.105,z=6\.75z=6\.75,p<0\.001p<0\.001\), which indicates substantial agreement\(Landis and Koch,[1977](https://arxiv.org/html/2605.15019#bib.bib23)\)and validates the consistency of our evaluation protocol\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x12.png)\(a\)GranuRAG \- Base
![Refer to caption](https://arxiv.org/html/2605.15019v1/x13.png)\(b\)GranuRAG \- CoT

Figure 8:Top 10 attention difference regions\.Red: higher attention in our method;Blue: higher attention in the base/CoT method\.
### 6\.5Visualization

To examine how GranuRAG influences visual focus, we fine\-tune three adapters \(Base, CoT, and GranuRAG\) on Qwen3\-VL\-8B and compare their attention weights on the same samples\. Specifically, we compute the pixel\-level differenceΔ​A=AGranuRAG−Abaseline\\Delta A=A\_\{\\text\{GranuRAG\}\}\-A\_\{\\text\{baseline\}\}by averaging attention weights across all generated tokens, then visualize the top\-10 regions with the largest positive shifts \(red\+\+markers\) and negative shifts \(blue−\-markers\)\. As shown in Figures[8\(a\)](https://arxiv.org/html/2605.15019#S6.F8.sf1)and[8\(b\)](https://arxiv.org/html/2605.15019#S6.F8.sf2), red markers concentrate on semantically relevant scenic elements, such as the “White Holy Spirit Dove Relief” sculptures and decorative arches, while blue markers scatter across generic background regions like ceilings and walls\. This pattern confirms that our grounding mechanism systematically reallocates attention toward knowledge\-relevant regions and away from distractors\. More case studies are provided in Appendix[A\.5](https://arxiv.org/html/2605.15019#A1.SS5)\.

## 7Conclusion

In this paper, we introduced GranuVistaVQA, a benchmark that reflects real\-world visual reasoning challenges\. Our GranuRAG framework explicitly grounds visual elements before retrieving multi\-granularity evidence, enabling transparent attribution and consistently outperforming strong baselines\. Experiments demonstrate that organizing evidence hierarchically from atomic visual elements, rather than relying on implicit attention within coarse retrievals, is essential for verifiable multimodal reasoning\.

## Limitations

While GranuVistaVQA advances multimodal RAG evaluation, several limitations should be noted\. First, our benchmark focuses on landmark images where visual elements are typically buildings or monuments\. This may not cover all visual reasoning scenarios, such as understanding abstract diagrams or everyday indoor scenes\. Second, regarding efficiency, our method takes approximately 3\.5 seconds per sample compared to 2 seconds for baseline approaches, nearly twice as long, due to the additional detection and multi\-granularity retrieval steps, which may limit applicability in time\-sensitive scenarios\. Third, our current implementation for the multi\-granularity RAG scenario relies solely on traditional supervised fine\-tuning \(SFT\)\. Future work will explore a broader range of parameter\-efficient fine\-tuning methods, including diverse LoRA variantsZhanget al\.\([2023](https://arxiv.org/html/2605.15019#bib.bib60)\); Chenet al\.\([2025b](https://arxiv.org/html/2605.15019#bib.bib59)\)as well as recently popular neuron\-level fine\-tuning techniquesXuet al\.\([2025a](https://arxiv.org/html/2605.15019#bib.bib58)\); Chenet al\.\([2026](https://arxiv.org/html/2605.15019#bib.bib57)\)\.

## Acknowledgments

This work was supported in part by the Science and Technology Development Fund of Macau SAR \(Grant Nos\. FDCT/0007/2024/AKP, EF2024\-00185\-FST\), the UM and UMDF \(Grant Nos\. MYRG\-GRG2024\-00165\-FST\-UMDF, MYRG\-GRG2025\-00236\-FST\), the Tencent AI Lab Rhino\-Bird Research Program \(Grant No\. EF2023\-00151\-FST\), the Stanley Ho Medical Development Foundation \(Grant No\. SHMDF\-AI/2026/001\), and the National Natural Science Foundation of China \(Grant No\. 62266013\)\.

## References

- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds, R\. Ring, E\. Rutherford, S\. Cabi, T\. Han, Z\. Gong, S\. Samangooei, M\. Monteiro, J\. Menick, S\. Borgeaud, A\. Brock, A\. Nematzadeh, S\. Sharifzadeh, M\. Binkowski, R\. Barreira, O\. Vinyals, A\. Zisserman, and K\. Simonyan \(2022\)Flamingo: a visual language model for few\-shot learning\.External Links:2204\.14198,[Link](https://arxiv.org/abs/2204.14198)Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- Anthropic \(2024\)Claude 3\.5 sonnet model card addendum\.External Links:[Link](https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf)Cited by:[§5\.1](https://arxiv.org/html/2605.15019#S5.SS1.p1.1)\.
- Anthropic \(2025\)Introducing Claude Haiku 4\.5\.Note:Accessed: 2025\-12\-30External Links:[Link](https://www.anthropic.com/news/claude-haiku-4-5)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: A frontier large vision\-language model with versatile abilities\.CoRRabs/2308\.12966\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.12966),[Document](https://dx.doi.org/10.48550/ARXIV.2308.12966),2308\.12966Cited by:[§5\.1](https://arxiv.org/html/2605.15019#S5.SS1.p1.1)\.
- Z\. Bai, P\. Wang, T\. Xiao, T\. He, Z\. Han, Z\. Zhang, and M\. Z\. Shou \(2024\)Hallucination of multimodal large language models: a survey\.arXiv preprint arXiv:2404\.18930\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- G\. Chen, Y\. Yao, L\. S\. Chao, X\. Liu, and D\. F\. Wong \(2025a\)SGIC: A self\-guided iterative calibration framework for RAG\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 28357–28370\.External Links:[Link](https://aclanthology.org/2025.acl-long.1376/)Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- G\. Chen, Y\. Yao, C\. Gao, L\. S\. Chao, F\. Wan, and D\. F\. Wong \(2025b\)Not all lora parameters are essential: insights on inference necessity\.CoRRabs/2503\.23360\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.23360),[Document](https://dx.doi.org/10.48550/ARXIV.2503.23360),2503\.23360Cited by:[Limitations](https://arxiv.org/html/2605.15019#Sx1.p1.1)\.
- W\. Chen, H\. Hu, X\. Chen, P\. Verga, and W\. W\. Cohen \(2022\)MuRAG: multimodal retrieval\-augmented generator for open question answering over images and text\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7\-11, 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),pp\. 5558–5570\.External Links:[Link](https://doi.org/10.18653/v1/2022.emnlp-main.375),[Document](https://dx.doi.org/10.18653/V1/2022.EMNLP-MAIN.375)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Chen, J\. Wu, S\. Yang, R\. Zhan, Z\. Wu, M\. Yang, S\. Huang, L\. S\. Chao, and D\. F\. Wong \(2026\)Neuron\-aware data selection in instruction tuning for large language models\.CoRRabs/2603\.13201\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.13201),[Document](https://dx.doi.org/10.48550/ARXIV.2603.13201),2603\.13201Cited by:[Limitations](https://arxiv.org/html/2605.15019#Sx1.p1.1)\.
- Y\. Chen, H\. Hu, Y\. Luan, H\. Sun, S\. Changpinyo, A\. Ritter, and M\. Chang \(2023\)Can pre\-trained vision and language models answer visual information\-seeking questions?\.arXiv preprint arXiv:2302\.11713\.Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Cheng, L\. Song, Y\. Ge, W\. Liu, X\. Wang, and Y\. Shan \(2024\)Yolo\-world: real\-time open\-vocabulary object detection\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 16901–16911\.Cited by:[§4\.1](https://arxiv.org/html/2605.15019#S4.SS1.p1.1)\.
- Y\. K\. Chia, L\. Cheng, H\. P\. Chan, C\. Liu, M\. Song, S\. M\. Aljunied, S\. Poria, and L\. Bing \(2024\)M\-longdoc: a benchmark for multimodal super\-long document understanding and a retrieval\-aware tuning framework\.arXiv preprint arXiv:2411\.06176\.Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- J\. Cho, D\. Mahata, O\. Irsoy, Y\. He, and M\. Bansal \(2024\)M3docrag: multi\-modal retrieval is what you need for multi\-page multi\-document understanding\.arXiv preprint arXiv:2411\.04952\.Cited by:[Table 1](https://arxiv.org/html/2605.15019#S1.T1.1.4.3.1)\.
- W\. Cui, K\. Bi, J\. Guo, and X\. Cheng \(2024\)MORE: multi\-modal retrieval augmented generative commonsense reasoning\.External Links:2402\.13625,[Link](https://arxiv.org/abs/2402.13625)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Dong, Y\. Chang, X\. D\. Goh, D\. Li, R\. Tang, and Y\. Liu \(2025\)Mmdocir: benchmarking multi\-modal retrieval for long documents\.arXiv preprint arXiv:2501\.08828\.Cited by:[Table 1](https://arxiv.org/html/2605.15019#S1.T1.1.3.2.1)\.
- J\. Fang, Z\. Bi, R\. Wang, H\. Jiang, Y\. Gao, K\. Wang, A\. Zhang, J\. Shi, X\. Wang, and T\. Chua \(2024\)Towards neuron attributions in multimodal large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- Q\. Guo, S\. De Mello, H\. Yin, W\. Byeon, K\. C\. Cheung, Y\. Yu, P\. Luo, and S\. Liu \(2024\)RegionGPT: towards region understanding vision language model\.arXiv preprint arXiv:2403\.02330\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§A\.1](https://arxiv.org/html/2605.15019#A1.SS1.p1.1)\.
- J\. R\. Landis and G\. G\. Koch \(1977\)The measurement of observer agreement for categorical data\.Biometrics33\(1\),pp\. 159–174\.Cited by:[§6\.4](https://arxiv.org/html/2605.15019#S6.SS4.p2.3)\.
- P\. Lerner, O\. Ferret, C\. Guinaudeau, H\. Le Borgne, R\. Besançon, J\. G\. Moreno, and J\. Lovón Melgarejo \(2022\)ViQuAE, a dataset for knowledge\-based visual question answering about named entities\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’22,New York, NY, USA,pp\. 3108–3120\.External Links:ISBN 9781450387323,[Link](https://doi.org/10.1145/3477495.3531753),[Document](https://dx.doi.org/10.1145/3477495.3531753)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023\)BLIP\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.External Links:2301\.12597,[Link](https://arxiv.org/abs/2301.12597)Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- W\. Lin, J\. Chen, J\. Mei, A\. Coca, and B\. Byrne \(2023\)Fine\-grained late\-interaction multi\-modal retrieval for retrieval augmented visual question answering\.Advances in Neural Information Processing Systems36,pp\. 22820–22840\.Cited by:[§6\.2](https://arxiv.org/html/2605.15019#S6.SS2.p1.1)\.
- W\. Lin, J\. Mei, J\. Chen, and B\. Byrne \(2024\)PreFLMR: scaling up fine\-grained late\-interaction multi\-modal retrievers\.External Links:2402\.08327,[Link](https://arxiv.org/abs/2402.08327)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Liu, X\. Liu, R\. Yao, J\. Liu, S\. Meng, D\. Wang, and J\. Ma \(2025\)Hm\-rag: hierarchical multi\-agent multimodal retrieval augmented generation\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 2781–2790\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p2.1)\.
- S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, Q\. Jiang, C\. Li, J\. Yang, H\. Su, J\. Zhu, and L\. Zhang \(2024\)Grounding DINO: marrying DINO with grounded pre\-training for open\-set object detection\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part XLVII,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 38–55\.External Links:[Link](https://doi.org/10.1007/978-3-031-72970-6%5C_3),[Document](https://dx.doi.org/10.1007/978-3-031-72970-6%5F3)Cited by:[§5\.3](https://arxiv.org/html/2605.15019#S5.SS3.p3.1)\.
- Z\. Liu, C\. Xiong, Y\. Lv, Z\. Liu, and G\. Yu \(2023\)Universal vision\-language dense retrieval: learning a unified representation space for multi\-modal retrieval\.External Links:2209\.00179,[Link](https://arxiv.org/abs/2209.00179)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Ma, S\. Zhuang, B\. Koopman, G\. Zuccon, W\. Chen, and J\. Lin \(2024a\)VISA: retrieval augmented generation with visual source attribution\.External Links:2412\.14457,[Link](https://arxiv.org/abs/2412.14457)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Ma, Y\. Zang, L\. Chen, M\. Chen, Y\. Jiao, X\. Li, X\. Lu, Z\. Liu, Y\. Ma, X\. Dong,et al\.\(2024b\)Mmlongbench\-doc: benchmarking long\-context document understanding with visualizations\.Advances in Neural Information Processing Systems37,pp\. 95963–96010\.Cited by:[Table 1](https://arxiv.org/html/2605.15019#S1.T1.1.5.4.1)\.
- K\. Marino, M\. Rastegari, A\. Farhadi, and R\. Mottaghi \(2019\)OK\-vqa: a visual question answering benchmark requiring external knowledge\.External Links:1906\.00067,[Link](https://arxiv.org/abs/1906.00067)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Mensink, J\. Uijlings, L\. Castrejon, A\. Goel, F\. Cadar, H\. Zhou, F\. Sha, A\. Araujo, and V\. Ferrari \(2023\)Encyclopedic vqa: visual questions about detailed properties of fine\-grained categories\.External Links:2306\.09224,[Link](https://arxiv.org/abs/2306.09224)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)GPT\-4o system card\.External Links:2410\.21276,[Link](https://arxiv.org/abs/2410.21276),[Document](https://dx.doi.org/10.48550/arXiv.2410.21276)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.15019#S5.SS1.p1.1)\.
- Z\. Peng, W\. Wang, L\. Dong, Y\. Hao, S\. Huang, S\. Ma, and F\. Wei \(2023\)Kosmos\-2: grounding multimodal large language models to the world\.arXiv preprint arXiv:2306\.14824\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p2.1)\.
- S\. Pramanick, R\. Chellappa, and S\. Venugopalan \(2024\)Spiqa: a dataset for multimodal question answering on scientific papers\.Advances in Neural Information Processing Systems37,pp\. 118807–118833\.Cited by:[Table 1](https://arxiv.org/html/2605.15019#S1.T1.1.6.5.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§6\.2](https://arxiv.org/html/2605.15019#S6.SS2.p1.1)\.
- D\. Schwenk, A\. Khandelwal, C\. Clark, K\. Marino, and R\. Mottaghi \(2022\)A\-okvqa: a benchmark for visual question answering using world knowledge\.External Links:2206\.01718,[Link](https://arxiv.org/abs/2206.01718)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Sun, C\. Peng, Y\. Yan, S\. Yu, Z\. Liu, C\. Chen, Z\. Liu, and M\. Sun \(2025\)VisRAG 2\.0: evidence\-guided multi\-image reasoning in visual retrieval\-augmented generation\.CoRRabs/2510\.09733\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.09733),[Document](https://dx.doi.org/10.48550/ARXIV.2510.09733),2510\.09733Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.15019#S6.SS2.p1.1)\.
- G\. Team and Google \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodal capabilities, and long context\.arXiv preprint arXiv:2507\.06261\.External Links:[Link](https://arxiv.org/abs/2507.06261)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§A\.3](https://arxiv.org/html/2605.15019#A1.SS3.p1.1),[§B\.2](https://arxiv.org/html/2605.15019#A2.SS2.SSS0.Px3.p1.5)\.
- Y\. Tian, F\. Liu, J\. Zhang, V\. W\., Y\. Hu, and L\. Nie \(2025\)CoRe\-mmrag: cross\-source knowledge reconciliation for multimodal rag\.External Links:2506\.02544,[Link](https://arxiv.org/abs/2506.02544)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Ustalov, N\. Pavlichenko, S\. Koshelev, D\. Likhobaba, and A\. Smirnova \(2023\)Toloka visual question answering benchmark\.External Links:2309\.16511,[Link](https://arxiv.org/abs/2309.16511)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Wasserman, R\. Pony, O\. Naparstek, A\. R\. Goldfarb, E\. Schwartz, U\. Barzelay, and L\. Karlinsky \(2025\)REAL\-mm\-rag: a real\-world multi\-modal retrieval benchmark\.arXiv preprint arXiv:2502\.12342\.Cited by:[Table 1](https://arxiv.org/html/2605.15019#S1.T1.1.2.1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p4.1),[§5\.2](https://arxiv.org/html/2605.15019#S5.SS2.p1.3)\.
- S\. Xi, C\. Yang, H\. Ding, Y\. Ni, C\. C\. Liu, Y\. Liu, and C\. Zhang \(2025\)Taming a retrieval framework to read images in humanlike manner for augmenting generation of mllms\.External Links:2510\.10426,[Link](https://arxiv.org/abs/2510.10426)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Xu, R\. Zhan, Y\. Ma, D\. F\. Wong, and L\. S\. Chao \(2025a\)Let’s focus on neuron: neuron\-level supervised fine\-tuning for large language model\.InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19\-24, 2025,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),pp\. 9393–9406\.External Links:[Link](https://aclanthology.org/2025.coling-main.630/)Cited by:[Limitations](https://arxiv.org/html/2605.15019#Sx1.p1.1)\.
- M\. Xu, Z\. Wang, H\. Cai, and R\. Zhong \(2025b\)A multi\-granularity retrieval framework for visually\-rich documents\.External Links:2505\.01457,[Link](https://arxiv.org/abs/2505.01457)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Yan and W\. Xie \(2024\)EchoSight: advancing visual\-language models with wiki knowledge\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 1538–1551\.External Links:[Link](http://dx.doi.org/10.18653/v1/2024.findings-emnlp.83),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.83)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Yang, J\. Fu, R\. Wang, J\. Wang, L\. Song, and J\. Bian \(2025\)OMGM: orchestrate multiple granularities and modalities for efficient multimodal retrieval\.External Links:2505\.07879,[Link](https://arxiv.org/abs/2505.07879)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Yu, C\. Tang, B\. Xu, J\. Cui, J\. Ran, Y\. Yan, Z\. Liu, S\. Wang, X\. Han, Z\. Liu, and M\. Sun \(2025a\)VisRAG: vision\-based retrieval\-augmented generation on multi\-modality documents\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=zG459X3Xge)Cited by:[§1](https://arxiv.org/html/2605.15019#S1.p1.1)\.
- S\. Yu, C\. Tang, B\. Xu, J\. Cui, J\. Ran, Y\. Yan, Z\. Liu, S\. Wang, X\. Han, Z\. Liu, and M\. Sun \(2025b\)VisRAG: vision\-based retrieval\-augmented generation on multi\-modality documents\.External Links:2410\.10594,[Link](https://arxiv.org/abs/2410.10594)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Yu, W\. Chen, G\. Qi, W\. Li, Y\. Li, L\. Sha, D\. Xia, and J\. Huang \(2025c\)BBox docvqa: a large scale bounding box grounded dataset for enhancing reasoning in document visual question answer\.External Links:2511\.15090,[Link](https://arxiv.org/abs/2511.15090)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, H\. You, P\. Dufter, B\. Zhang, C\. Chen, H\. Chen, T\. Fu, W\. Y\. Wang, S\. Chang, Z\. Gan, and Y\. Yang \(2024a\)Ferret\-v2: an improved baseline for referring and grounding with large language models\.External Links:2404\.07973,[Link](https://arxiv.org/abs/2404.07973)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao \(2023\)Adalora: adaptive budget allocation for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2303\.10512\.Cited by:[Limitations](https://arxiv.org/html/2605.15019#Sx1.p1.1)\.
- T\. Zhang, Z\. Zhang, Z\. Ma, Y\. Chen, Z\. Qi, C\. Yuan, B\. Li, J\. Pu, Y\. Zhao, Z\. Xie, J\. Ma, Y\. Shan, and W\. Hu \(2024b\)MR2ag: multimodal retrieval\-reflection\-augmented generation for knowledge\-based vqa\.External Links:2411\.15041,[Link](https://arxiv.org/abs/2411.15041)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.External Links:1904\.09675,[Link](https://arxiv.org/abs/1904.09675)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- Y\. Zhang, Z\. Ma, X\. Gao, S\. Shakiah, Q\. Gao, and J\. Chai \(2024c\)GROUNDHOG: grounding large language models to holistic segmentation\.External Links:2402\.16846,[Link](https://arxiv.org/abs/2402.16846)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.External Links:2306\.05685,[Link](https://arxiv.org/abs/2306.05685)Cited by:[§3\.3](https://arxiv.org/html/2605.15019#S3.SS3.SSS0.Px2.p1.1)\.
- J\. Zhou, Z\. Liu, S\. Xiao, B\. Zhao, and Y\. Xiong \(2024a\)VISTA: visualized text embedding for universal multi\-modal retrieval\.External Links:2406\.04292,[Link](https://arxiv.org/abs/2406.04292)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Zhou, S\. Mei, X\. Li, Z\. Liu, C\. Xiong, Z\. Liu, Y\. Gu, and G\. Yu \(2024b\)MARVEL: unlocking the multi\-modal capability of dense retrieval via visual module plugin\.External Links:2310\.14037,[Link](https://arxiv.org/abs/2310.14037)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhu, Z\. Liu, Y\. Liang, X\. Li, H\. Liu, C\. Bao, and L\. Xu \(2023\)Locate then generate: bridging vision and language with bounding box for scene\-text vqa\.External Links:2304\.01603,[Link](https://arxiv.org/abs/2304.01603)Cited by:[§2](https://arxiv.org/html/2605.15019#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AAppendix

### A\.1Implementation

For all evaluated MLLMs, we use a unified decoding setup with temperature set to 0\.1 and maximum generation length limited to 600 tokens\. This low\-temperature setting reduces randomness and ensures stable, reproducible outputs during evaluation\. In addition, we fine\-tune Qwen3\-VL\-8B555https://huggingface\.co/Qwen/Qwen3\-VL\-8B\-Instructusing the LLaMA\-Factory666https://github\.com/hiyouga/LLaMA\-Factoryframework on a single H800 GPU with setting LoRAHuet al\.\([2022](https://arxiv.org/html/2605.15019#bib.bib35)\)rank to 8\. The sequence length is set to 4096, with a batch size of 4 and gradient accumulation steps of 4\. We train for 3 epochs on the training set with a learning rate of 1e\-4\.

### A\.2Sensitivity Analysis of Overlap Threshold

We conduct a sensitivity analysis on the overlap threshold used in our redundancy filtering step\. Table[8](https://arxiv.org/html/2605.15019#A1.T8)reports the results across six threshold values ranging from 70% to 100%\. As shown, performance follows a clear and intuitive trend: lower thresholds \(e\.g\., 70%\) lead to over\-removal of informative elements, which hurts the quality of the generated output, while higher thresholds \(e\.g\., 90% and above\) retain too much redundancy and also degrade performance\. The best results are achieved at 80%, with a ROUGE\-L of 32\.01, BERT\-F1 of 51\.60, and LLM Score of 83\.90\. Importantly, performance remains consistently strong across the 75%–85% range, indicating that our method is not overly sensitive to this hyperparameter\. We select 80% as our default threshold since it sits at the natural midpoint of this effective range and yields the best overall performance\.

Table 8:Sensitivity analysis of the overlap threshold\.Boldmeans the best performance\.
### A\.3Synthesize Fine\-tuned Data

For the fine\-tuning task, we construct our test set by randomly selecting 20% of partial landmarks images \(containing incomplete visual elements\), with the remainder forming the training set\. Critically, we ensure that approximately half of the scenic spots are entirely absent from the training set to evaluate the model’s robustness on unseen locations\. The related analysis is shown in section[6\.1](https://arxiv.org/html/2605.15019#S6.SS1)\. We then employ Qwen3\-VL\-235B\-A22B\-ThinkingTeam \([2025](https://arxiv.org/html/2605.15019#bib.bib36)\)to synthesize reasoning data for both sets by populating the prompt with three components: the landmark image, annotated visual elements with descriptions, and the target commentary text\. We prompt the LLM to generate reasoning processes from input to target output using two approaches: its native chain\-of\-thought style and our proposed pipeline framework\.

To ensure data quality, each synthesized reasoning process underwent automated scoring by LLM, with samples failing to fully meet our criteria being regenerated\. When a sample failed validation three consecutive times, human annotators intervened to manually revise the reasoning process\. Through this rigorous pipeline combining automated synthesis with quality control, we construct a high\-quality dataset for fine\-tuning LLMs to learn both standard CoT reasoning and our pipeline\-based thinking process\.

### A\.4Attribution Evaluation

Beyond factual correctness, we also evaluate whether LLMs can trace their claims back to specific visual evidence\. A description may sound plausible yet be entirely ungrounded in what the model actually perceives\. We measure this with three metrics: Attribution Precision \(AP\), the fraction of cited elements that exist in the predicted visible setE^​\(I\)\\hat\{E\}\(I\); Attribution Recall \(AR\), the fraction of elements inE^​\(I\)\\hat\{E\}\(I\)that are actually cited in the output; and Unsupported Claim Rate \(UCR\), the fraction of output sentences not grounded by any element\.

Table 9:Attribution evaluation results\. AP = Attribution Precision, AR = Attribution Recall, UCR = Unsupported Claim Rate\.†denotes the fine\-tuned variant\.Boldmarks the best score per metric\.Underlinemeans the best results for each LLM\.As shown in Table[9](https://arxiv.org/html/2605.15019#A1.T9), all models improve consistently from Setting \(A\) to \(C\), confirming that structured evidence grounding helps anchor generated text to perceived visual content\. Fine\-tuning also brings clear gains: Qwen3\-VL\-8B†improves recall from 0\.45 to 0\.58 under Setting \(C\) compared to its base version\. Among proprietary models, GPT\-4o achieves the best overall profile, while Claude\-3\.5\-Sonnet shows high precision but low recall, suggesting it cites carefully but sparingly\. Notably, even the best configuration leaves over half of output sentences unsupported, indicating that faithful visual attribution remains an open challenge\.

### A\.5Case Study

Table[11](https://arxiv.org/html/2605.15019#A2.T11)presents a color\-coded case study that directly compares element\-level grounding and description quality across methods\. Each color family corresponds to one architectural element, with darker shades indicating more accurate descriptions\. Specifically,bluehighlights track the Christian emblem,greenhighlights track windows, andpinkdenotes the pediment\. This visualization reveals clear differences: the baseline misses the pediment entirely, CoT hallucinates non\-existent twin towers, while GranuRAG stays faithful to the visible evidence and aligns closely with the ground truth\.

From an element selection perspective, the baseline fails to identify the prominent pediment, resulting in incomplete structural coverage\. CoT, though more detailed, introduces twin towers that do not exist in the image, a typical hallucination risk in CoT reasoning\. GranuRAG correctly identifies all key elements including the pediment and Christian emblem while avoiding both omissions and hallucinations\. In terms of generation quality, the shade progression reflects meaningful differences: for the Christian emblem \(blue\), GranuRAG provides fine\-grained symbolic details \(e\.g\., three nails, IHS inscription\) that match the ground truth, while baseline and CoT remain superficial\. For windows \(green\), GranuRAG delivers complete structural descriptions versus partial coverage by other methods\. Most notably, the pediment \(pink\) appears only in GranuRAG’s output, demonstrating that our method improves both element recall and description accuracy\.

## Appendix BDataset Construction Details

### B\.1Data Schema

We organize knowledge at two granularities to support hierarchical retrieval:

Landmarks \-level representation:

xattr=\(meta,E,ED,Irep\),x\_\{\\mathrm\{attr\}\}=\(\\mathrm\{meta\},E,\\mathrm\{ED\},I^\{\\mathrm\{rep\}\}\),\(6\)wheremeta=\{Landmarks, Summary, Style\}\\mathrm\{meta\}=\\\{\\text\{Landmarks, Summary, Style\}\\\}contains high\-level context;E=\{e1,…,ek\}E=\\\{e\_\{1\},\\ldots,e\_\{k\}\\\}is the landmarks\-specific element inventory;ED:E→Paragraphs\\mathrm\{ED\}:E\\to\\text\{Paragraphs\}maps each element to its detailed description; andIrepI^\{\\mathrm\{rep\}\}is a representative image for visualization\.

Image\-level representation:

z=\(I,a​\(I\),Egt​\(I\)\),z=\(I,a\(I\),E^\{\\mathrm\{gt\}\}\(I\)\),\(7\)whereIIis an image,a​\(I\)a\(I\)denotes its source landmarks, andEgt​\(I\)⊆EE^\{\\mathrm\{gt\}\}\(I\)\\subseteq Especifies ground\-truth visible elements\.

The structured JSON schema follows:

```
{
  "Landmarks": "[NAME]",
  "Summary": "[description]",
  "Style": "[style_notes]",
  "Elements": ["[elem_1]", ...,
    "[elem_k]"],
  "ElementDescriptions": {
    "[elem_i]": "[paragraph]",
    ...
  },
  "Image": "[image_path]"
}
```

### B\.2Construction Pipeline

#### Stage 1: Data Collection

We collect textual materials from encyclopedic sources including Wikipedia and Baidu Baike, as well as from official heritage and tourism portals such as the Macao World Heritage website777[https://www\.wh\.mo/cn/](https://www.wh.mo/cn/)and the Macao Government Tourism Office website\.888[https://www\.macaotourism\.gov\.mo/zh\-hant/sightseeing](https://www.macaotourism.gov.mo/zh-hant/sightseeing)Images are sourced from Google Images, Rednote, and Trip\.com using landmark names as search queries to ensure viewpoint diversity\. All data are collected for non\-commercial research purposes only\. In addition, we perform data sanitization to remove privacy\-sensitive and potentially harmful content, employing a hybrid approach that combines LLM\-based filtering with manual spot\-checking\. All images are collected exclusively for non\-commercial academic research under the principle of fair use\.

We perform comprehensive data sanitization through a rigorous manual screening process\. Trained annotators inspect each image individually following a strict exclusion protocol: any image containing the following elements is immediately discarded and not included in the dataset:

- •Visible watermarks, platform logos, or copyright overlays
- •Timestamps, date stamps, or camera metadata overlays
- •Geolocation tags, map overlays, or address information
- •License plates, vehicle identifiers, or transportation tags
- •Identity documents, tickets, receipts, or personal contact info
- •Recognizable human faces and personal information

This policy prioritizes privacy protection over data retention\. After manual screening, we apply quality filtering to retain only images with resolution≥\\geq512px and no visible compression artifacts\. The final dataset contains 1,422 images that have passed both privacy and quality checks, ensuring suitability for academic research purposes\.

#### Stage 2: Knowledge Structuring

Raw textual materials are consolidated into the normalized JSON schema through a four\-step process\. We first extract element phrases from authoritative descriptions, then apply normalization rules \(Appendix[B\.3](https://arxiv.org/html/2605.15019#A2.SS3)\)\. Next, we generate element descriptions via LLM \(Appendix[B\.4](https://arxiv.org/html/2605.15019#A2.SS4)\) and finally validate all outputs \(Appendix[B\.5](https://arxiv.org/html/2605.15019#A2.SS5)\)\.

#### Stage 3: Image Annotation

For each imageII, we provide Qwen3\-VL\-MaxTeam \([2025](https://arxiv.org/html/2605.15019#bib.bib36)\)with the element inventoryEEand descriptionsED\\mathrm\{ED\}as context\. The model proposes candidate visible elementsE^​\(I\)⊆E\\hat\{E\}\(I\)\\subseteq E, which human annotators then refine into final ground\-truth labelsEgt​\(I\)E^\{\\mathrm\{gt\}\}\(I\)following strict visibility guidelines detailed in Appendix[B\.6](https://arxiv.org/html/2605.15019#A2.SS6)\.

#### Stage 4: Indexing

We embed all images using CLIP and L2\-normalize the resulting vectors to 512 dimensions\. A FAISS flat L2 index is built with aligned metadata that maps vector IDs to landmark names, image paths, and ground\-truth element sets, enabling source attribution during retrieval\.

Table 10:Element detection results with normalization applied\.Boldmeans the best result\.

### B\.3Normalization Procedures

We apply three normalization strategies to ensure cross\-landmark consistency\. Language unification consolidates multilingual variants into canonical keys and enforces consistent orthography\. Synonym collapsing merges semantically equivalent terms under lightweight taxonomy tags, such as grouping “bell tower” with “campanile,” “granite stone” with “granite,” and “carved motif” with “ornamental carving\.” Cross\-landmark alignment ensures that shared architectural features like “Baroque façade” receive identical identifiers across different landmarks\. These steps are critical for two reasons: they prevent the element matching stage \(Section[4\.2](https://arxiv.org/html/2605.15019#S4.SS2)\) from failing due to superficial lexical mismatches across knowledge sources, and they enable fair evaluation by ensuring the same element is not counted as two different entities, which would artificially deflate Precision and Recall\. As shown in Table[10](https://arxiv.org/html/2605.15019#A2.T10), even with normalization applied, element detection remains challenging, confirming that the task difficulty is genuine rather than an artifact of inconsistent naming\.

InputPlease carefully observe the visible details in the image corresponding to these elements, focusing your description on what is actually visible in the image\. Below is the complete element description \(ED\) for the image\.ED:
Twin Towers:The top floor features symmetrical bell towers on both sides…Windows:Smaller side windows with arched lintels and surrounding relief decorations…Pediment:The central pediment is …Christian Emblem:Below the emblem are three nails commemorating…Original Image![[Uncaptioned image]](https://arxiv.org/html/2605.15019v1/x14.png)GT TextThe facade of this church spans 24\.6 meters in width\. Its pediment displays quintessential Baroque style, adorned centrally with an exquisite Jesuit coat of arms\. This christian emblem, rendered in symmetrical geometric patterns and relief carvings, serves as a common religious symbol in Jesuit churches\. Directly below the coat of arms, three nails symbolize the historical events of Jesus’ crucifixion\. Above the coat of arms, the letters “IHS” are carved, derived from the initials of Jesus’ Greek name “IHSOUS\\mathrm\{IHSOUS\},” commonly used as an abbreviation for Jesus Christ, signifying “Jesus is the Savior of mankind\.” The pilasters flanking the coat of arms are adorned with broken pediments\. The central section of the church’s second level features a large rectangular window with louvered panels, flanked by smaller arched windows\. The lintels are decorated with scrollwork and complemented by carved details\. This layout aligns with the Baroque architectural pursuit of symmetry and opulence\.BaselineCoTGranuRAGAnnotated ImageAnnotated ImageAnnotated Image![[Uncaptioned image]](https://arxiv.org/html/2605.15019v1/x15.png)![[Uncaptioned image]](https://arxiv.org/html/2605.15019v1/x16.png)![[Uncaptioned image]](https://arxiv.org/html/2605.15019v1/x17.png)Generated TextThe church’s main facade presents a massive wall 24\.6 meters wide, divided into three vertical and three horizontal sections\.The central section features the Christian emblem, with three nails beneath it symbolizing the crucifixion of Jesus\.The second level features three windows: the outer ones are smaller, adorned with arched lintels and surrounding relief carvings, while the central window is a rectangular shuttered opening\.The overall design is simple yet solemn, showcasing the unique charm of religious art\.Generated TextThe exterior facade of this church displays a majestic Baroque style, spanning a width of 24\.6 meters\. The symmetrical bell towers at both ends feature glazed tile roofs\. Three window openings in the left tower house large bronze bells, while the right tower also holds a bronze bell within its window openings\.At the building’s central pediment, the Jesuit emblem is carved, with the ‘IHS’ above symbolizing the name of Jesus\.The second level features three windows: the outer ones are smaller, adorned with arched lintels and surrounding relief decorations, while the central window comprises a large shuttered opening\.The overall design harmoniously blends religious artistry with functionality, conveying the building’s solemnity and sacredness\.Generated TextThe church’s main facade presents a massive wall 24\.6 meters wide, divided into three vertical and three horizontal sections by two multi\-tiered cornices and six fluted pilasters with multiple recesses\.The pediment section of the building displays a quintessential Baroque style, adorned centrally with an exquisite Christian emblem\.This emblem, rendered in symmetrical geometric patterns and relief carvings, serves as a common religious symbol found in Jesuit churches\. Directly beneath the emblem, three nails symbolize the historical crucifixion of Jesus, one of Christianity’s most significant emblems\. Above the emblem, the letters “IHS” derive from the first three letters of Jesus’ Greek name “IHSOUS\\mathrm\{IHSOUS\},” commonly used as an abbreviation for Jesus Christ, signifying “Jesus is the Savior of Humanity\.”The second level features a larger rectangular shuttered window at the center flanked by smaller arched windows\. The lintels bear curved ornamentation complemented by sculpted details\.This layout harmonizes with Baroque architecture’s pursuit of symmetry and opulence while balancing natural light and aesthetic functionality\.Table 11:The case study example\.
### B\.4Element Description Generation

For each elemente∈Ee\\in E, we prompt Qwen3\-VL\-Max to generate a factual paragraph of one to three sentences describing its visual characteristics, cultural significance, and historical context\. We enforce three constraints during generation: factuality requires all statements to be source\-grounded with no unverifiable claims; scope constrains the content to the specific element rather than general landmark background; and tone maintains a neutral encyclopedic style that avoids promotional language\. Generated paragraphs are stored under their corresponding keys inED\\mathrm\{ED\}\.

### B\.5Validation and Quality Control

Quality control combines automatic checks with human review\. Automatic validation verifies key consistency \(keys​\(ED\)=E\\mathrm\{keys\}\(\\mathrm\{ED\}\)=E\), ensures all fields are non\-empty, and confirms orthographic normalization\. Human reviewers then perform three additional passes: factuality review spot\-checks generated descriptions against source materials and regenerates entries containing unsupported claims; style filtering removes promotional language and unattributed superlatives; and terminological harmonization ensures consistent wording for shared elements across landmarks\.

### B\.6Ground\-Truth Annotation Protocol

Human annotators refine LLM\-proposed labelsE^​\(I\)\\hat\{E\}\(I\)to produceEgt​\(I\)E^\{\\mathrm\{gt\}\}\(I\)following strict visibility rules\. Visual identifiability requires that elements qualify only if they are confidently recognizable from image pixels alone, without relying on prior landmark knowledge; for instance, an external tower invisible in an interior courtyard photo cannot be labeled\. For partial occlusion handling, partially occluded elements are included only when discriminative visual cues remain present, such as a half\-visible ornamental scroll whose characteristic shape remains recognizable\. Synonym resolution requires annotators to select canonical keys when multiple inventory terms refer to near\-identical visual components\. Uncertain cases undergo ambiguity escalation, where they are marked for expert review or excluded to avoid false positives\.

The annotation workflow presents each imageIIalongside landmark metadata\(meta,E,ED\)\(\\mathrm\{meta\},E,\\mathrm\{ED\}\)and LLM\-proposed candidatesE^​\(I\)\\hat\{E\}\(I\)\. Annotators edit the proposals by adding missed but visible elements, removing hallucinated or non\-verifiable elements, and correcting keys to canonical form\. This protocol ensures ground\-truth labels reflect genuine visual evidence, enabling fair evaluation while accounting for real\-world photography ambiguities such as viewpoint limitations and occlusions\.

![Refer to caption](https://arxiv.org/html/2605.15019v1/x18.png)Figure 9:Distribution of image counts per landmark\. The dataset shows a single dominant mode around 11–30 images \(48 of 71 landmarks, 67\.6%\), with only a few landmarks providing more than 40 views\.
### B\.7Detailed coverage distribution

Beyond the aggregate statistics reported in the main text \(mean 34%, median 29%\), the image distribution statistic is shown in Figure[9](https://arxiv.org/html/2605.15019#A2.F9), which reflects real\-world photography patterns: most landmarks contain 11–30 images, while iconic locations exceed 40 images\. The element coverage ratioCoverage​\(I\)=\|Egt​\(I\)\|/\|E\|\\text\{Coverage\}\(I\)=\|E^\{\\mathrm\{gt\}\}\(I\)\|/\|E\|shows considerable variation\. The interquartile range spans 0\.18 to 0\.47, meaning half of all images show between 18% and 47% of available elements\. The distribution exhibits notable tail behavior: 12% of images show fewer than 15% of elements due to extreme close\-up framing, while 5% show more than 70% of elements in comprehensive panoramas\. Stratifying by view type reveals that close\-up views average 22% coverage, mid\-range views average 36%, and panoramic views average 58%\.

This granular breakdown confirms that multi\-view aggregation is essential for our task\. No single view type reliably captures complete element inventories, which necessitates cross\-view evidence fusion for comprehensive retrieval\.

Similar Articles

RAG-Anything: All-in-One RAG Framework

Papers with Code Trending

RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Hugging Face Daily Papers

MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Hugging Face Daily Papers

Q-RAG introduces a reinforcement learning-based fine-tuning approach for embedder models to enable efficient multi-step retrieval, achieving state-of-the-art results on long-context benchmarks up to 10M tokens. This method provides a resource-efficient alternative to fine-tuning small LLMs for complex multi-step search tasks.

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Hugging Face Daily Papers

UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.