Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

arXiv cs.CL 06/24/26, 04:00 AM Papers
Summary
This paper proposes a training-free 'identify-before-answer' (IBA) framework for Knowledge-Based Visual Question Answering (KB-VQA) that decouples entity identification from evidence ranking, outperforming fine-tuned multi-modal retrieval-augmented generation baselines while reducing complexity.
arXiv:2606.23881v1 Announce Type: new Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:43 AM
# Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification
Source: [https://arxiv.org/html/2606.23881](https://arxiv.org/html/2606.23881)
Qian Ma1,Qiong Wu2,Zhengyi Zhou2,Yao Ma1 1Rensselaer Polytechnic Institute 2AT&T Chief Data Office \{maq5,may13\}@rpi\.edu, \{qw6547,zz547k\}@att\.comThis work was initiated and done while the first author was an intern at AT&T CDO\. Qiong Wu and Yao Ma are co\-corresponding authors\.

###### Abstract

Knowledge\-Based Visual Question Answering \(KB\-VQA\) requires grounding visual queries to external knowledge beyond directly observable content in images\. While recent multi modal large language models \(MLLMs\) show strong perceptual abilities, they struggle on KB\-VQA tasks requiring groundings from both fine\-grained entity and evidence levels\. Most existing multi\-modal retrieval augmented generation \(MM\-RAG\) methods tightly couple entity discrimination and section\-level evidence ranking into a single re\-ranking stage, leading to high cost and limited generalization\. In this work, we revisit existing MM\-RAG solutions from a workflow perspective and argue both entity\-level and fact\-level groundings are key bottlenecks\. We observe that although MLLMs often fail under open\-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names\. Based on this insight, we propose a simple and training\-free*identify\-before\-answer IBA*framework that decouples entity identification from section\-level re\-ranking\. Our approach prompts an MLLM to select high\-confidence entities using only candidate names, followed by an off\-the\-shelf textual re\-ranker for evidence selection\. Experiments on Encyclopedic\-VQA and InfoSeek show that our method consistently outperforms fine\-tuned multi\-modal re\-ranking baselines while reducing training and inference complexity\. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed\. Our implementation is made public to ease reproducibility[https://github\.com/VAN\-QIAN/ACL26\-IBA/](https://github.com/VAN-QIAN/ACL26-IBA/)\.

Ground Then Rank: Revisiting Knowledge\-Based VQA with Training\-Free Entity Identification

Qian Ma1††thanks:This work was initiated and done while the first author was an intern at AT&T CDO\. Qiong Wu and Yao Ma are co\-corresponding authors\., Qiong Wu2, Zhengyi Zhou2, Yao Ma11Rensselaer Polytechnic Institute2AT&T Chief Data Office\{maq5,may13\}@rpi\.edu, \{qw6547,zz547k\}@att\.com

## 1Introduction

Knowledge\-Based Visual Question Answering \(KB\-VQA\) extends standard VQA by requiring external world knowledge beyond what is directly observable in the imageDenget al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib14)\); Kimet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib13)\)\. While recent multi\-modal large language models \(MLLMs\) achieve strong performance on perception\-driven VQA, KB\-VQA queries often hinge on identifying the correct real\-world entity and grounding fine\-grained factual information that cannot be inferred from pixels aloneMensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\); Chenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\)\. This reliance on entity\-level knowledge makes KB\-VQA a challenging benchmark for multi\-modal intelligence\.

Modern MLLMsLiuet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib32),[2024](https://arxiv.org/html/2606.23881#bib.bib33)\); Baiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)have demonstrated remarkable progress on general VQA tasks, yet they remain unreliable on KB\-VQA where relevant knowledge is sparse, long\-tailed, and difficult to encode in model parametersKuanget al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib25)\); Denget al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib14)\)\. As a result, recent systems predominantly adopt a multi\-modal retrieval\-augmented generation \(MM\-RAG\) paradigmChenet al\.\([2022](https://arxiv.org/html/2606.23881#bib.bib49)\); Yuet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib50)\), which first retrieves a set of potentially relevant knowledge entries \(e\.g\., Wikipedia articles\) and then re\-ranks textual sections to support answer generation, as illustrated in Figure[1](https://arxiv.org/html/2606.23881#S3.F1)\.

Despite their success, we argue that existing MM\-RAG methods suffer from a fundamental limitation in their workflow\. Producing a correct and verifiable answer requires grounding at two distinct levels: \(i\)*entity\-level grounding*, ensuring that the retrieved context refers to the correct entity depicted in the image, and \(ii\)*section\-level grounding*, locating the passage within that entity’s article that is relevant to the question\. However, most existing approachesYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\); Tianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\)implicitly couple these two challenging tasks into a single re\-ranking step over all candidate sections\. This entangled formulation forces a single scoring function to simultaneously discriminate between entities and rank textual evidence, often leading to textually relevant but entity\-incorrect contexts, or correct entities paired with irrelevant sections\.

In contrast, humans naturally decouple these two steps when solving KB\-VQA problems\. After an initial retrieval or recall of plausible candidates, people typically first identify or narrow down the entity depicted in the image, and only then examine a small number of relevant articles to locate supporting evidence\. This decomposition reduces distractors and simplifies subsequent reasoning, suggesting a more principled paradigm\.

Motivated by this observation, we revisit the role of MLLMs in entity identification\. While directly naming an entity from an image remains challenging for current modelsCaronet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib41)\), we make a surprising empirical observation: simply providing candidate entity names enables MLLMs to identify the correct entity with much higher accuracy\. This suggests that MLLMs often possess incomplete yet usable entity knowledge that is difficult to exploit under open\-ended generation but can be effectively activated when the task is reframed as a constrained discrimination problem\. This behavior bears resemblance to a tip\-of\-the\-tongue\-like effectBrown and McNeill \([1966](https://arxiv.org/html/2606.23881#bib.bib48)\), which we use only as an intuitive analogy\. ToT describes a situation that human experts may also encounter that they have the expertise to the entity but can’t directly recall the name from scratch\. But once several plausible names \(*e\.g\.,*“Lapsana communis” and “Crepis tectorum” in Figure[1](https://arxiv.org/html/2606.23881#S3.F1)\) are presented, they can reason from visual cues with their expertise to select the correct one\.

Based on this insight, we propose a simple yet effective*IBA*framework for KB\-VQA\. Our approach explicitly inserts a lightweight entity identification step into the MM\-RAG workflow\. After initial retrieval, the MLLM scores candidate entities using only their names, retains a small subset of high\-confidence entities, and then applies a pre\-trained standard textual re\-ranker to select supporting sections within this narrowed scope\. This training\-free design decouples entity recognition from evidence selection, improving both accuracy and efficiency\.

Our contributions are summarized as follows:

- •To the best of our knowledge, we are the first to report the ‘tip\-of\-the\-tongue’ phenomenon in MLLMs for KB\-VQA, where providing candidate entity names significantly amplifies the model’s reasoning capability to better identify the entity in the query image\.
- •Based on this finding, we propose a simple yet effective framework that integrates an explicit identification step into existing MM\-RAG workflows, enhancing answer accuracy and computational efficiency without additional fine\-tuning or task\-specific training\.
- •We validate our approach on two mainstream KB\-VQA benchmarks, achieving new state\-of\-the\-art results while improving efficiency compared to existing MM\-RAG systems\.

## 2Related Works

### 2\.1MLLMs for KB\-VQA

Multimodal large language models \(MLLMs\) extend text\-only LLMs with visual encoders and cross\-modal alignment mechanisms, enabling joint reasoning over images and text\. Recent models such as LLaVALiuet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib32),[2024](https://arxiv.org/html/2606.23881#bib.bib33)\)and Qwen\-VLBaiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)achieve strong performance on perception\-driven VQA benchmarks, where answers can be inferred directly from visual content or broadly learned parametric knowledge\.

However, emerging evaluationsLiet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib37)\); Tanet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib38)\)consistently show that even state\-of\-the\-art MLLMs underperform on knowledge\-based VQA \(KB\-VQA\) tasks that require fine\-grained, entity\-centric, or long\-tail encyclopedic knowledge\. This limitation motivates augmenting MLLMs with external knowledge sources to support explicit grounding and reasoning beyond their parametric capacity

### 2\.2KB\-VQA and MM\-RAG based solutions

KB\-VQA benchmarks extend conventional VQA by requiring external knowledge not contained in the image alone, such as entity attributes or encyclopedic facts\. Early datasets such as OK\-VQAMarinoet al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib16)\)and A\-OKVQASchwenket al\.\([2022](https://arxiv.org/html/2606.23881#bib.bib17)\)emphasize commonsense or general knowledge, which increasingly falls within the training scope of large\-scale MLLMs\.

More recent benchmarks, including Encyclopedic\-VQA \(E\-VQA\)Mensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\)and InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\), raise the difficulty by requiring explicit grounding to fine\-grained Wikipedia entities and supporting sections\. To address these challenges, many methods adopt multimodal retrieval\-augmented generation \(MM\-RAG\), typically consisting of a retriever, a re\-ranking stage, and an answer generator\. Representative approaches such as EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), and CoRe\-MMRAGTianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\)differ in how relevance is assessed, ranging from explicitly trained multimodal re\-rankers to relevance implicitly learned through fine\-tuning\. Despite their success, these methods generally couple entity discrimination and section selection into a single re\-ranking process, which can be costly to train and sensitive to data availability\.

### 2\.3Positioning of our work

In contrast to prior MM\-RAG methods, our work revisits the KB\-VQA pipeline from a workflow perspective\. Rather than entangling entity identification and section\-level evidence selection within a single re\-ranking module, we explicitly decouple these two stages\. By introducing a lightweight identification step before section re\-ranking, our approach isolates the entity\-level grounding problem and leverages off\-the\-shelf components without requiring task\-specific re\-ranker training or MLLM fine\-tuning\. This design differs fundamentally from prior approaches that rely on learned multimodal relevance functions, and enables more interpretable and transferable KB\-VQA pipelines across datasets with varying knowledge distributions\.

## 3Methodology

In this section, we present our training\-free IBA \(Identify Before Answer\) framework, which explicitly decouples entity\-level identification from section\-level evidence selection\. We first introduce the overall workflow and then describe each component in detail, including problem formulation, initial retrieval, identify\-before\-re\-rank, and answer generation\.

![Refer to caption](https://arxiv.org/html/2606.23881v1/x1.png)Figure 1:Overall workflow comparison between existing MM\-RAG methods and our proposed IBA\.Upper: Existing MM\-RAG pipelines retrieve top\-KKcandidate articles from a large knowledge base and directly perform section\-level re\-ranking using trained re\-rankers or fine\-tuned VLMs, before generating the answer from the top\-ranked context\.Lower: Our training\-free IBA inserts an explicit entity identification step before re\-ranking\. Given the query image and retrieved candidate names, the VLM selects a small set of high\-confidence entities, which are then used to narrow the scope of section\-level re\-ranking with an off\-the\-shelf textual re\-ranker\.### 3\.1Problem Formulation

Given a query imageIIand questionQQ, a KB\-VQA system aims to generate an answeryyby grounding external knowledge\. In retrieval\-augmented generation, this is typically achieved by selecting a supporting text snippet from an external knowledge base\.

We model the knowledge base asKB=\{\(Pi,Ii\)\}i=1NKB=\\\{\(P\_\{i\},I\_\{i\}\)\\\}\_\{i=1\}^\{N\}, where each pagePiP\_\{i\}consists of multiple textual sectionsPi=\{Si,j\}j=1niP\_\{i\}=\\\{S\_\{i,j\}\\\}\_\{j=1\}^\{n\_\{i\}\}and is associated with a representative imageIiI\_\{i\}\. Due to the large scale of the knowledge base \(NNcan be millions\), practical systems first retrieve a small candidate set and then perform fine\-grained re\-ranking\. The objective of KB\-VQA is to select the most relevant sectionSi,jS\_\{i,j\}that provides sufficient evidence to support generating the correct answeryy\.

### 3\.2Initial Retrieval

The goal of initial retrieval is to obtain a small set of candidate knowledge entries from a massive external knowledge base\. This coarse\-grained step ensures tractability by narrowing the search space from millions of entries to a manageable top\-KKset\.

Following prior workYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\); Tianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\), we adopt an image\-to\-image retrieval strategy\. Each knowledge base page is indexed using a frozen EVA\-CLIP\-8B vision encoderSunet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib42)\)\. Image embeddings are pooled from the final layer and indexed using FAISSDouzeet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib44)\)\. Given a query image, cosine similarity is used to retrieve the top\-KKvisually similar candidate pages, which are then passed to subsequent identify\-before\-re\-rank stage\.

### 3\.3Identify\-Before\-Re\-Rank

After initial retrieval, existing MM\-RAG methods directly perform section\-level re\-ranking over allKKcandidate entries\. In contrast, we explicitly introduce an entity\-level identification step to further reduce the re\-ranking scope\.

As we have discussed in Section[1](https://arxiv.org/html/2606.23881#S1), one of the key challenges in KB\-VQA is entity identification to secure grounding at the entity level\. Even for human experts, directly naming the exact real\-world entity depicted in an image can be non\-trivial, especially when visual cues are subtle or the entity belongs to a fine\-grained category\. This difficulty is further amplified for MLLMs under open\-ended generation settings, where the model must produce the correct entity name from a vast output space without explicit constraints\.

At the same time, modern MLLMs have been trained on large\-scale, high\-quality corpora that include extensive encyclopedic knowledge, much of which originates from Wikipedia\-style resources\. Such training endows models with latent expertise that can support entity recognition, but this expertise is often difficult to reliably elicit through unconstrained generation\. We hypothesize that the challenge lies not in the absence of knowledge, but in the form of the task: open\-ended entity naming imposes a high uncertainty burden, whereas selecting the correct entity from a small candidate set is a more tractable discriminative problem\. As empirically validated in Section[4\.4](https://arxiv.org/html/2606.23881#S4.SS4), prompting MLLMs to select from candidate entities yields substantially higher identification accuracy than open\-ended naming\. This behavior is analogous to the human tip\-of\-the\-tongue phenomenonBrown and McNeill \([1966](https://arxiv.org/html/2606.23881#bib.bib48)\), where experts may struggle to recall an exact name spontaneously but can readily identify the correct option when presented with a shortlist of plausible candidates\.

Motivated by this observation, we design the identification step by prompting the MLLM with a list of candidate entity names, rather than asking it to generate the entity name freely\. This formulation allows the model to focus on assessing relative relevance among plausible candidates, effectively activating its latent encyclopedic knowledge while avoiding the brittleness of open\-ended generation\.

Accordingly, given the retrieved candidate set\{\(Pi,Ii\)\}i=1K\\\{\(P\_\{i\},I\_\{i\}\)\\\}\_\{i=1\}^\{K\}, we prompt the MLLM with: \(i\) query imageII, \(ii\) textual names of allKKcandidate entries, and \(iii\) their initial visual retrieval similarity scores\.

The MLLM is asked to assess entity relevance and select the top\-jjcandidates \(j<Kj<K\)\. This produces an identification confidence scoreID\(Pi\)\\mathrm\{ID\}\(P\_\{i\}\)for each candidate entry, reflecting the model’s belief that the entity depicted in the image corresponds toPiP\_\{i\}\.

For each identified entry, we compute textual relevance between the questionQQand each sectionSi,jS\_\{i,j\}using a pre\-trained textual re\-ranker such as BGEChenet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib52)\)\. The re\-ranker outputs a normalized textual relevance scoreT\(Si,j\)∈\[0,1\]\\mathrm\{T\}\(S\_\{i,j\}\)\\in\[0,1\]\. Unlike multimodal re\-rankers used in prior workYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), this component is used off\-the\-shelf without task\-specific training\.

We combine identification confidence, visual similarity from initial retrieval, and textual relevance to compute a final score:

score\(Si,j\)=α⋅ID\(Pi\)\+β⋅V\(I,Ii\)\+γ⋅T\(Si,j\),\\mathrm\{score\}\(S\_\{i,j\}\)=\\alpha\\cdot\\mathrm\{ID\}\(P\_\{i\}\)\+\\beta\\cdot\\mathrm\{V\}\(I,I\_\{i\}\)\+\\gamma\\cdot\\mathrm\{T\}\(S\_\{i,j\}\),whereα\\alpha,β\\beta, andγ\\gammacontrol the contribution of each signal\. For E\-VQA, we set\(α,β,γ\)=\(0\.5,0\.5,1\)\(\\alpha,\\beta,\\gamma\)=\(0\.5,0\.5,1\)\. For InfoSeek, we increaseα\\alphato emphasize entity identification due to weaker visual alignment between query images and knowledge base images\. The sensitivity analysis in Section[4\.4](https://arxiv.org/html/2606.23881#S4.SS4)reveals that the final re\-ranking outcome is not sensitive to the hyper\-parameters combination, since setting all to 1 has modest degradation\. The top\-ranked section is selected as supporting context for answer generation\.

Overall, the identify\-before\-re\-rank design provides a simple yet effective alternative to existing MM\-RAG pipelines\. By explicitly decoupling entity\-level identification from section\-level evidence selection, our framework avoids the need for training specialized multimodal re\-rankers or fine\-tuning large vision–language models\. This decoupling also improves interpretability, as the contributions of visual similarity, entity identification, and textual relevance can be examined independently\. Moreover, restricting section\-level scoring to a small set of identified entities substantially reduces computational cost and context length, leading to more efficient inference\. Finally, because our method relies only on off\-the\-shelf components, it generalizes naturally across different knowledge bases and KB\-VQA benchmarks without dataset\-specific adaptation\.

### 3\.4Answer Generation

Once the top\-ranked supporting section is obtained, we use off\-the\-shelf large language models to generate the final answer\. Our framework does not require fine\-tuning the generation model, making it flexible across different backbones\. Compared with prior approaches that rely on fine\-tuned multi\-modal generatorsCocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\); Tianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\), our method improves answer quality by providing more precise and entity\-grounded context, rather than modifying the generation model itself\.

## 4Experiments

### 4\.1Datasets and External Knowledge Base

We evaluate on two KB\-VQA benchmarks: Encyclopedic VQA \(E\-VQA\)Mensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\)and InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\), where answering requires external knowledge beyond the query image\. E\-VQA contains 221K image\-question pairs \(up to five images per question\) and covers both single\-hop and two\-hop questions\. Following prior workYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), we focus on the single\-hop setting\. Importantly, E\-VQA provides a controlled knowledge base of 2M Wikipedia articles with associated images, ensuring that each QA pair is answerable when the correct article is retrieved\.

InfoSeek contains 1\.3M QA pairs over 11K visual entities from OVENHuet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib46)\)\. Following existing MM\-RAG baselinesYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Tianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), we adopt the 100K\-article knowledge base released byYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)and report results on the validation split following the same settings\.

### 4\.2Evaluation Metrics

Retrieval\.We report Recall@KK, which measures whether the ground\-truth article appears in the top\-KKretrieved candidates\. Following prior workYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), a retrieved article is counted as correct only if its URL exactly matches the target page URL\. We report Recall@1 as a proxy for*top\-1 entity selection accuracy*after re\-ranking, reflecting how well a method prioritizes correct entity among retrieved candidates\.

Answer generation\.For E\-VQA, we evaluate open\-ended answers using the BEM scoreZhanget al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib47)\)\. For InfoSeek, we follow prior practiceYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\)and use VQA accuracyGoyalet al\.\([2017](https://arxiv.org/html/2606.23881#bib.bib6)\); Marinoet al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib16)\)for time and numerical questions, and BEM score for string questions\.

### 4\.3Implementation Details

Retriever and candidate set\.Following prior workYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), we use EVA\-CLIP\-8BSunet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib42)\)for image\-to\-image retrieval and retrieve the top\-K=20K\{=\}20candidate articles from the knowledge base released byYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\.

Identifier and Answer generators\.After initial retrieval, Qwen\-2\.5\-VL\-7B\-Instruct is deployed to implement the explicit identification step\. We instantiate our pipeline with two off\-the\-shelf backbones for answer generation: Llama\-3\.1\-8B\-Instruct and Qwen\-2\.5\-VL\-7B\-Instruct \(also used for identification\)\.

Baselines\.For EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), we follow the original pipeline and apply its released multimodal re\-ranker to re\-rank sections from the same top\-KKretrieved candidates, using the default weighting between initial retrieval similarity and re\-ranking scores\. For ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), we run the officially released model to produce REL tokens and generate answers by conditioning on sections assigned REL tokens within the top\-5 retrieved articles\.

Zero\-shot variants\.Following the two\-stage prompting design in Core\-MMRAGTianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\), we implement several zero\-shot variants to probe the role of workflow design\. Given the top\-5 retrieved articles,*1Stage*prompts the MLLM to directly answer with all articles as context, while*2Stage*first selects the most relevant article and then generates the answer conditioned on that article only\. We also include*Para*\(no external evidence\) and*Article*\(directly use the top\-5 retrieved articles without explicit re\-ranking\) variants\.

### 4\.4Retrieval Results

The retrieval results on InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\)and E\-VQAMensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\)are reported in Tables[1](https://arxiv.org/html/2606.23881#S4.T1)and[2](https://arxiv.org/html/2606.23881#S4.T2)\. EVA\-CLIP retrieval yields moderate Recall@20 but much lower Recall@1, showing that while the correct entity is usually present among candidates, it is rarely ranked first by visual similarity alone \(e\.g\., E\-VQA Recall@1: 13\.4%, Recall@20: 48\.8%\)\. This highlights the need for a re\-ranking stage to better prioritize the correct entity\.

After applying re\-ranking, both EchoSight and our proposed IBA substantially improve Recall@1, confirming the importance of re\-ranking for entity prioritization\. On InfoSeek, our method outperforms EchoSight by a clear margin, achieving Recall@1 of 58\.4% compared to 53\.1% \(Table[1](https://arxiv.org/html/2606.23881#S4.T1)\)\. This improvement indicates that the explicit identify\-before\-re\-rank design is more effective at prioritizing the correct entity from visually similar candidates\.

Table 1:InfoSeek retrieval results\. EVA\-CLIP denotes the initial image\-to\-image retrieval using EVA\-CLIP\-8B\. EchoSight applies its trained multimodal re\-ranker on top of the same retrieved candidates\. Our method prompts the MLLM to select the top\-3 entities from the 20 retrieved candidate names\.MethodInfoSeek Recall@kk=1k=3k=5k=10k=20EVA\-CLIP45\.663\.168\.674\.677\.9EchoSight53\.169\.473\.977\.477\.9Our IBA58\.472\.4\-\-\-On E\-VQA, EchoSight achieves slightly higher Recall@1 \(36\.5%\) than our method \(35\.5%\), as shown in Table[2](https://arxiv.org/html/2606.23881#S4.T2)\. This outcome is expected, as EchoSight is specifically trained on E\-VQA using curated positive supervision\. In contrast, our approach is entirely training\-free and directly transferable across datasets\. Despite this small gap in Recall@1, the downstream generation results \(Section[4\.5](https://arxiv.org/html/2606.23881#S4.SS5)\) show that higher identification accuracy alone does not guarantee better answer quality\.

Table 2:E\-VQA retrieval results\. EVA\-CLIP denotes the initial image\-to\-image retrieval using EVA\-CLIP\-8B\. EchoSight applies its trained multimodal re\-ranker on top of the retrieved candidates\. Our method prompts the MLLM to select the top\-3 entities from the 20 retrieved candidate names\.MethodE\-VQA Recall@kk=1k=3k=5k=10k=20EVA\-CLIP13\.426\.131\.941\.848\.8EchoSight36\.545\.347\.948\.848\.8Our IBA35\.543\.3\-\-\-Grounded\-subset identification analysis\.To directly probe entity identification \(independent of answer generation\), we evaluate on grounded subsets where each question is guaranteed to be answerable from its ground\-truth entity page\. On E\-VQA, our method correctly identifies the ground\-truth entity for 934/2,322 grounded questions \(40\.2%\), compared to 593/2,322 \(25\.5%\) under open\-ended entity naming\. On InfoSeek, we randomly sample 1,000 validation questions, among which 790 are grounded; our method identifies correctly for 578/790 \(73\.2%\) versus 362/790 \(45\.8%\) under open\-ended naming\. These results further support providing candidate entity names substantially amplifies MLLM\-based identification\. To demonstrate that this phenomenon is not specific to a single model, we further tested an advanced proprietary model, GPT\-5\.2OpenAI \([2025](https://arxiv.org/html/2606.23881#bib.bib66)\)\. On the grounded subset of E\-VQA, GPT\-5\.2’s identification ratio improved from 23\.4% to 58\.0% by providing textual options\.

Sensitivity Analysis\. We conduct a small sensitivity analysis by setting all score\-fusion weights \(ID score, visual similarity, text relevance\) to 1, removing dataset\-specific tuning\. Under this setting, Recall@1 is 34\.9% on E\-VQA \(vs\. 35\.5%\) and 56\.3% on InfoSeek \(vs\. 58\.4%\)\. The drops are modest \(−0\.6%\-0\.6\\%and−2\.1%\-2\.1\\%\), indicating limited sensitivity to weight choices\. Even without tuning, IBA remains competitive with EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\(53\.1% on InfoSeek\), whose re\-ranker requires supervised training\. This suggests that the improvement is largely attributable to the workflow design rather than hand\-tuned weight optimization\.

### 4\.5Generation Results

We evaluate answer generation quality on InfoSeek and E\-VQA and compare our training\-free pipeline with finetuned MM\-RAG baselines and zero\-shot variants \(Table[3](https://arxiv.org/html/2606.23881#S4.T3)\)\.

Table 3:Answer generation results on E\-VQA and InfoSeek\.InfoSeekUnseen QuestionUnseen EntityMethodsBackboneOveralltimenumstringtimenumstringE\-VQAOur proposed IBAIBA\-QwenQwen\-2\.5\-VL\\ul37\.234\.57\.847\.540\.16\.645\.243\.6IBA\-LLaVALlama\-3\.1\-8B37\.838\.713\.545\.944\.512\.844\.4\\ul43\.2Retrieve Augmented Models requiring Fine\-tuningReflectiVALlama\-3\.1\-8B36\.429\.710\.445\.636\.512\.143\.238\.6EchoSightLlama\-3\.1\-8B33\.824\.912\.341\.737\.511\.339\.841\.9Zero\-shot base modelsPara\-LlavaLlama\-3\.1\-8B9\.00\.90\.512\.42\.40\.510\.213\.3Para\-QwenQwen\-2\.5\-VL25\.510\.90\.035\.512\.10\.032\.921\.2Zero\-shot with RetrievalArticle\-llavaLlama\-3\.1\-8B18\.40\.00\.026\.40\.00\.024\.123\.0Article\-qwenQwen\-2\.5\-VL34\.514\.30\.246\.812\.10\.545\.935\.6Zero\-shot with Re\-rank1Stage\-llavaLlama\-3\.1\-8B10\.50\.00\.015\.00\.00\.013\.94\.01Stage\-QwenQwen\-2\.5\-VL34\.632\.71\.544\.840\.12\.144\.434\.32Stage\-llavaLlama\-3\.1\-8B27\.26\.90\.938\.55\.00\.236\.423\.12Stage\-QwenQwen\-2\.5\-VL35\.98\.40\.049\.04\.50\.348\.534\.1RAG vs\. purely parametric MLLMs\.Across both backbones, retrieve\-augmented methods substantially outperform purely parametric variants \(*Para\-\**\), highlighting that external evidence is essential for KB\-VQA\. On InfoSeek,*Para\-LLaVA*and*Para\-Qwen*achieve 9\.0 and 25\.5 overall, while retrieve\-augmented variants reach 30–38\. On E\-VQA,*Para\-LLaVA*and*Para\-Qwen*obtain 13\.3 and 21\.2, and our proposed IBA exceed 43\.

Training\-free identify\-before\-answer vs\. finetuned MM\-RAG\.Our training\-free pipeline outperforms finetuned baselines on both datasets\. On InfoSeek, our*IBA\-LLaVA*attains the best overall score \(37\.8\), followed by*IBA\-Qwen*\(37\.2\), both surpassing ReflectiVA \(36\.4\) and EchoSight \(33\.8\)\. On E\-VQA,*IBA\-Qwen*and*IBA\-LLaVA*achieve 43\.6 and 43\.2, outperforming EchoSight \(41\.9\) and ReflectiVA \(38\.6\)\. Notably, our method requires no task\-specific training or additional parameters, while EchoSight and ReflectiVA rely on finetuned components\. A closer look at InfoSeek question types shows that our improvements are consistent on time questions, where correct entity grounding and evidence selection are critical\. For example, on unseen\-question time queries,*IBA\-LLaVA*achieves 38\.7 versus 29\.7 \(ReflectiVA\) and 24\.9 \(EchoSight\), and on unseen\-entity time queries it achieves 44\.5 versus 36\.5 and 37\.5, respectively\. More qualitative results are shown in Appendix[B](https://arxiv.org/html/2606.23881#A2)\.

### 4\.6Ablation Study

Zero\-shot MLLM\.To test whether an off\-the\-shelf MLLM can solve KB\-VQA by prompting alone, we followTianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\)and compare four zero\-shot variants that differ in how retrieved evidence is used\.*Para\-\**answers using only parametric knowledge \(no external evidence\)\.*Article\-\**answers in one step given the full text of the top\-5 retrieved articles\.*1Stage\-\**further adds the retrieved entry images and explicitly asks the model to use the most relevant reference, effectively requiring implicit multimodal re\-ranking inside a single long prompt\.*2Stage\-\**decomposes this into two prompts: the model first selects the most relevant article and then generates answer conditioned on that article only\.

External evidence helps, but prompting is not a reliable re\-ranker\.Across both datasets, moving from*Para\-\**to*Article\-\**yields large gains, confirming that KB\-VQA cannot be solved reliably from parametric knowledge alone\. For LLaVA, InfoSeek improves from 9\.0 \(*Para\-LLaVA*\) to 18\.4 \(*Article\-LLaVA*\), and E\-VQA improves from 13\.3 to 23\.0\. For Qwen, InfoSeek improves from 25\.5 \(*Para\-Qwen*\) to 34\.5 \(*Article\-Qwen*\), and E\-VQA improves from 21\.2 to 35\.6\. However, asking the model to implicitly re\-rank evidence within a single prompt is unstable:*1Stage\-LLaVA*collapses to 10\.5 on InfoSeek and 4\.0 on E\-VQA, despite using more information than*Article\-LLaVA*\.

Two\-stage prompting helps, but still trails a structured pipeline\.Decomposing the interaction partially mitigates the above issue:*2Stage\-LLaVA*recovers to 27\.2 on InfoSeek and 23\.1 on E\-VQA, yet remains far behind our full pipeline \(37\.8 and 43\.2, respectively\)\. For Qwen, explicit re\-ranking provides limited or inconsistent gains:*1Stage\-Qwen*and*2Stage\-Qwen*reach 34\.6/35\.9 on InfoSeek \(vs\. 34\.5 for*Article\-Qwen*\) and slightly underperform on E\-VQA \(34\.3/34\.1 vs\. 35\.6\)\. Overall, these results suggest that while strong backbones can exploit long retrieved text, prompting alone does not reliably perform evidence selection, motivating an explicit and lightweight identification\-and\-reranking workflow\.

Q: What does this fish eat?

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/context/EVQA-402.png)

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Walleye is a nocturnal feeder, which means it feeds at night\. It is most easily caught using live minnows or lures that mimic small fish\. This implies that the walleye’sdiet consists of small fish, likely minnows\.✗

\#\# SELECTED Section Title: As food

Walleye is considered to be a palatable freshwater fish, consequently, is fished recreationally and commercially for food\. Because of its nocturnal feeding habits, it is most easily caughtat night using live minnows or lures that mimic small fish…

IBA \(Ours\): Walleyes eat a variety of food sources depending on their age and development\. Here are specific food sources mentioned:\* Invertebrates, such as fly larvæ and zooplankton, when the young walleyes are absorbing their yolk sac\.\* Fish, such as yellow perch or ciscoes, when they become piscivorous after 40\-60 days\.\* Crayfish, minnows, and leeches, in addition to fish, when they become adults\.✓

\#\# SELECTED Section Title: Reproduction

… Once the yolk has been fully absorbed, the young walleyesbegin to feed on invertebrates, such as fly larvæ and zooplankton\.After 40 to 60 days, juvenile walleyes become piscivorous\.Thenceforth, both juvenile and adult walleyes eat fish almost exclusively, frequently yellow perch or ciscoes, moving onto bars and shoals at night to feed\.…

Figure 2:Qualitative result E\-VQA, where we compare the answers provided by IBA with EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)that more informative context is selected by our IBA\.EchoSight’s multimodal re\-ranker vs\. our proposed re\-rank after identification\.To disentangle the effect of entity identification from section selection, we evaluate several re\-ranking variants on the E\-VQA grounded subset, where the correct entity page is guaranteed to contain the answer\. We compare \(i\) EchoSight’s original pipeline, \(ii\) our full identify\-before\-answer model, and two hybrids that keep our identification module but replace our textual re\-ranker with EchoSight’s released multimodal re\-ranker, either without \(*IBA\- MIS*\) or with \(*IBA \+ MIS*\) incorporating the MLLM identification score \(MIS\) into the final ranking\. Table[4](https://arxiv.org/html/2606.23881#S4.T4)summarizes the results\.

Table 4:Ablation on E\-VQA grounded questions comparing re\-ranking choices after entity identification\. IBA\-/\+ MIS replace our textual re\-ranker with EchoSight’s released multimodal re\-ranker, without /with using the MLLM identification score \(MIS\) in final ranking\.MethodsIdentified RatioScoreEchoSight77\.866\.4IBA−\-MIS w EchoSight reranker73\.663\.2IBA\+\+MIS w EchoSight reranker75\.162\.2IBA full75\.770\.5We draw two observations from Table[4](https://arxiv.org/html/2606.23881#S4.T4)\. First, replacing our textual re\-ranker with EchoSight’s multi\-modal re\-ranker consistently reduces answer quality, even when the identified ratio is comparable \(around 73–78%\)\. For example, EchoSight and the two hybrid variants obtain scores of 66\.4/63\.2/62\.2, all below our method \(70\.5\)\. Second, incorporating MIS does not improve the hybrid:*IBA \+ MIS*slightly increases the identified ratio over*IBA– MIS*\(75\.1 vs\. 73\.6\) but further lowers the final score \(62\.2 vs\. 63\.2\)\. These results suggest that, once the entity is \(mostly\) correct, effective KB\-VQA hinges on selecting*informative*sections rather than visually plausible but weak evidence\. To further analyze this effect, Table[5](https://arxiv.org/html/2606.23881#S4.T5)breaks down performance on grounded questions by whether EchoSight and our method identify the ground\-truth entity\.

Table 5:Breakdown results on E\-VQA grounded questions\. Our IBA achieves better overall results mainly from selecting more informative sections than EchoSight\. Some samples are shown in Figure[2](https://arxiv.org/html/2606.23881#S4.F2)and[5](https://arxiv.org/html/2606.23881#A3.F5)\.Both✓\(59\.3%\)Our✓\(13\.4%\)Echo✓\(15\.5%\)Both✗\(11\.8%\)OverallEcho79\.622\.880\.630\.966\.4IBA88\.283\.623\.927\.670\.5Although EchoSight attains a slightly higher identified ratio \(77\.8% vs\. 75\.7%\), our method achieves a higher overall answer score \(70\.5 vs\. 66\.4\)\. When both methods identify the correct entity \(59\.3% of questions\), our score is substantially higher \(88\.2 vs\. 79\.6\), indicating more informative section selection given the same entity\. Our advantage is even more pronounced when only our method identifies the correct entity \(13\.4%\): we maintain a strong score \(83\.6\) while EchoSight often fails to provide useful evidence \(22\.8\)\. Conversely, EchoSight performs better only when it identifies the correct entity and we do not \(15\.5% of questions; 80\.6 vs\. 23\.9\)\. When both miss the entity \(11\.8%\), both methods perform poorly\.

To reflect the evidence quality in a more straightforward way, we further check the direct evidence hits\. For the automatically generated subset of E\-VQA \(2,750 questions\), where evidence annotations are available, we check Evidence Hit \(if the evidence selected by method exactly matches the annotated evidence\) in addition to Entity Matching as shown in Table[6](https://arxiv.org/html/2606.23881#S4.T6)\.

Table 6:Direct evidence hit results on E\-VQA automatically generated subset in percentage\. While the entity match is lower than EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), our proposed IBA achieves higher direct evidence hit ratio\.MethodEntity MatchEvidence HitIBA37\.133\.8EchoSight39\.330\.2While entity match is lower, IBA achieves higher Evidence Hit\. This pattern aligns with our decomposition analysis in Table[5](https://arxiv.org/html/2606.23881#S4.T5), where we show that correct entity identification alone does not guarantee correct answers without proper evidence grounding\. Together, these results indicate that the performance gain of IBA does not stem solely from entity recall, but from improved evidence\-level grounding enabled by explicitly decoupling identification from ranking\.

Overall, these ablations indicate that our gains are not solely due to entity identification\. Explicitly decoupling identification from purely textual re\-ranking leads to more reliable selection of supportive evidence\. Even when the same entity is retrieved, our approach tends to choose sections that directly contain the facts required by the question, whereas a trained multimodal re\-ranker can favor visually plausible but less informative sections\. Qualitative examples in Figures[2](https://arxiv.org/html/2606.23881#S4.F2)and[5](https://arxiv.org/html/2606.23881#A3.F5)further illustrate this difference\.

## 5Conclusion

In this work, we revisit KB\-VQA from a workflow perspective and identify entity and evidence level groundings as critical bottlenecks\. While recent MLLMs possess substantial encyclopedic knowledge, we show that this knowledge is difficult to reliably elicit under open\-ended entity naming\. Instead, MLLMs exhibit significantly stronger identification ability when selecting from a small set of candidate entities\. Motivated by this observation, we propose a simple and training\-free identify\-before\-answer framework that explicitly decouples entity identification from section\-level evidence selection\. By prompting MLLMs with candidate entity names and leveraging an off\-the\-shelf textual re\-ranker for evidence selection, our approach avoids the need for specialized multi\-modal re\-ranker training while remaining interpretable and robust\. Extensive experiments on E\-VQA and InfoSeek demonstrate that this decomposition consistently outperforms finetuned MM\-RAG baselines\. Our analyses also reveal that effective KB\-VQA depends not only on identifying the correct entity, but also on selecting informative supporting evidence once the entity is fixed\. We hope this work encourages future research to rethink retrieval\-augmented reasoning pipelines by explicitly separating distinct grounding and selection stages, and to explore more lightweight designs for knowledge\-intensive multi\-modal reasoning\.

## Acknowledgements

This research is supported by the National Science Foundation \(NSF\) under grant numbers NSF\-2406647 and NSF\-2406648\. It is also supported by the National Artificial Intelligence Research Resource \(NAIRR\) Pilot and the Delta advanced computing and data resource, which is supported by the National Science Foundation under award NSF\-OAC\-2005572\.

## Limitations

Although our proposed IBA surpasses existing fine\-tuning baselines and demonstrates impressive performance on Knowledge\-based VQA like Encyclopedic\-VQA and InfoSeek, we note the following limitation that there is a dependence on the external knowledge base\. In real\-world application, it could be possible that the knowledge base is not perfect to include all supporting evidence for answering questions\. However, there are emerging works focused on integrating agentic workflow into the VQA tasksJianget al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib57)\); Wuet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib58)\); Fuet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib56)\)at the retrieval stage\. Specifically, they will invoke external search tools*i\.e\.,*the whole Internet will be considered as the external knowledge base, which could be a promising direction for future works\.

## References

- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.Advances in neural information processing systems35,pp\. 23716–23736\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p1.1)\.
- Vqa: visual question answering\.InProceedings of the IEEE international conference on computer vision,pp\. 2425–2433\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p1.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang,et al\.\(2025\)Qwen2\. 5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.23881#A1.SS3.p3.1),[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px3.p2.1),[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px3.p3.2),[§1](https://arxiv.org/html/2606.23881#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23881#S2.SS1.p1.1)\.
- R\. Brown and D\. McNeill \(1966\)The “tip of the tongue” phenomenon\.Journal of verbal learning and verbal behavior5\(4\),pp\. 325–337\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p5.1),[§3\.3](https://arxiv.org/html/2606.23881#S3.SS3.p3.1)\.
- M\. Caron, A\. Iscen, A\. Fathi, and C\. Schmid \(2024\)A generative approach for wikipedia\-scale visual entity recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 17313–17322\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p5.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)Bge m3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.03216\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px3.p6.2),[§3\.3](https://arxiv.org/html/2606.23881#S3.SS3.p7.3)\.
- W\. Chen, H\. Hu, X\. Chen, P\. Verga, and W\. Cohen \(2022\)Murag: multimodal retrieval\-augmented generator for open question answering over images and text\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 5558–5570\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p2.1)\.
- Y\. Chen, H\. Hu, Y\. Luan, H\. Sun, S\. Changpinyo, A\. Ritter, and M\. Chang \(2023\)Can pre\-trained vision and language models answer visual information\-seeking questions?\.arXiv preprint arXiv:2302\.11713\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p3.1),[§E\.2](https://arxiv.org/html/2606.23881#A5.SS2.SSS0.Px2.p1.1),[Appendix F](https://arxiv.org/html/2606.23881#A6.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2606.23881#S4.SS4.p1.1)\.
- F\. Cocchi, N\. Moratelli, M\. Cornia, L\. Baraldi, and R\. Cucchiara \(2025\)Augmenting multimodal llms with self\-reflective tokens for knowledge\-based visual question answering\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9199–9209\.Cited by:[§A\.3](https://arxiv.org/html/2606.23881#A1.SS3.p3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.1.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.2.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.3.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.4.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.5.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.6.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.7.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.8.4.1.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.9.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.1.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.2.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.3.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.4.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.5.4.1.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.6.4.1.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.1.4.1.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.2.4.1.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.3.4.1.1),[Appendix F](https://arxiv.org/html/2606.23881#A6.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.23881#S3.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.23881#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p1.2),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p2.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p3.1)\.
- J\. Deng, Z\. Wu, H\. Huo, and G\. Xu \(2025\)A comprehensive survey of knowledge\-based vision question answering systems: the lifecycle of knowledge in visual reasoning task\.arXiv preprint arXiv:2504\.17547\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p2.1),[§1](https://arxiv.org/html/2606.23881#S1.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p2.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px2.p1.5)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.ICLR\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px1.p3.6),[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px2.p1.5)\.
- M\. Douze, A\. Guzhva, C\. Deng, J\. Johnson, G\. Szilvasy, P\. Mazaré, M\. Lomeli, L\. Hosseini, and H\. Jégou \(2025\)The faiss library\.IEEE Transactions on Big Data\.Cited by:[§3\.2](https://arxiv.org/html/2606.23881#S3.SS2.p2.1)\.
- M\. Fu, Y\. Peng, B\. Liu, Y\. Wan, and D\. Chen \(2025\)LiveVQA: live visual knowledge seeking\.arXiv preprint arXiv:2504\.05288\.Cited by:[Limitations](https://arxiv.org/html/2606.23881#Sx2.p1.1)\.
- Y\. Goyal, T\. Khot, D\. Summers\-Stay, D\. Batra, and D\. Parikh \(2017\)Making the v in vqa matter: elevating the role of image understanding in visual question answering\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 6904–6913\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p2.1)\.
- H\. Hu, Y\. Luan, Y\. Chen, U\. Khandelwal, M\. Joshi, K\. Lee, K\. Toutanova, and M\. Chang \(2023\)Open\-domain visual entity recognition: towards recognizing millions of wikipedia entities\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 12065–12075\.Cited by:[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p2.1)\.
- D\. Jiang, R\. Zhang, Z\. Guo, Y\. Wu, J\. Lei, P\. Qiu, P\. Lu, Z\. Chen, C\. Fu, G\. Song,et al\.\(2024\)Mmsearch: benchmarking the potential of large models as multi\-modal search engines\.arXiv preprint arXiv:2409\.12959\.Cited by:[Limitations](https://arxiv.org/html/2606.23881#Sx2.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px1.p1.1)\.
- B\. S\. Kim, J\. Kim, D\. Lee, and B\. Jang \(2025\)Visual question answering: a survey of methods, datasets, evaluation, and challenges\.ACM Computing Surveys57\(10\),pp\. 1–35\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p1.1)\.
- J\. Kuang, Y\. Shen, J\. Xie, H\. Luo, Z\. Xu, R\. Li, Y\. Li, X\. Cheng, X\. Lin, and Y\. Han \(2025\)Natural language understanding and inference with mllm in visual question answering: a survey\.ACM Computing Surveys57\(8\),pp\. 1–36\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p2.1)\.
- J\. Li, D\. Li, S\. Savarese, and S\. Hoi \(2023\)Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models\.InInternational conference on machine learning,pp\. 19730–19742\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p1.1),[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px2.p1.5),[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px2.p5.2)\.
- Y\. Li, X\. Chen, B\. Hu, H\. Shi, and M\. Zhang \(2024\)Cognitive visual\-language mapper: advancing multimodal comprehension with enhanced visual knowledge alignment\.arXiv preprint arXiv:2402\.13561\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p2.1),[§2\.1](https://arxiv.org/html/2606.23881#S2.SS1.p2.1)\.
- H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p1.1),[§A\.3](https://arxiv.org/html/2606.23881#A1.SS3.p3.1),[§1](https://arxiv.org/html/2606.23881#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23881#S2.SS1.p1.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23881#S2.SS1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px3.p6.2)\.
- D\. Marin, J\. R\. Chang, A\. Ranjan, A\. Prabhu, M\. Rastegari, and O\. Tuzel \(2023\)Token pooling in vision transformers for image classification\.InProceedings of the IEEE/CVF winter conference on applications of computer vision,pp\. 12–21\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px1.p3.6)\.
- K\. Marino, M\. Rastegari, A\. Farhadi, and R\. Mottaghi \(2019\)Ok\-vqa: a visual question answering benchmark requiring external knowledge\.InProceedings of the IEEE/cvf conference on computer vision and pattern recognition,pp\. 3195–3204\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p2.1)\.
- T\. Mensink, J\. Uijlings, L\. Castrejon, A\. Goel, F\. Cadar, H\. Zhou, F\. Sha, A\. Araujo, and V\. Ferrari \(2023\)Encyclopedic vqa: visual questions about detailed properties of fine\-grained categories\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 3113–3124\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p3.1),[§E\.2](https://arxiv.org/html/2606.23881#A5.SS2.SSS0.Px1.p1.1),[Appendix F](https://arxiv.org/html/2606.23881#A6.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p1.1),[§4\.4](https://arxiv.org/html/2606.23881#S4.SS4.p1.1)\.
- OpenAI \(2025\)Update to GPT\-5 system card: GPT\-5\.2\.Technical reportOpenAI\.Note:System card \(PDF\)\.External Links:[Link](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf)Cited by:[§4\.4](https://arxiv.org/html/2606.23881#S4.SS4.p4.1)\.
- D\. Schwenk, A\. Khandelwal, C\. Clark, K\. Marino, and R\. Mottaghi \(2022\)A\-okvqa: a benchmark for visual question answering using world knowledge\.InEuropean conference on computer vision,pp\. 146–162\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p1.1)\.
- Q\. Sun, J\. Wang, Q\. Yu, Y\. Cui, F\. Zhang, X\. Zhang, and X\. Wang \(2024\)Eva\-clip\-18b: scaling clip to 18 billion parameters\.arXiv preprint arXiv:2402\.04252\.Cited by:[§3\.2](https://arxiv.org/html/2606.23881#S3.SS2.p2.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p1.1)\.
- Y\. Tan, Y\. Qing, and B\. Gong \(2025\)Vision llms are bad at hierarchical visual understanding, and llms are the bottleneck\.arXiv preprint arXiv:2505\.24840\.Cited by:[§A\.1](https://arxiv.org/html/2606.23881#A1.SS1.p2.1),[§2\.1](https://arxiv.org/html/2606.23881#S2.SS1.p2.1)\.
- Y\. Tian, F\. Liu, J\. Zhang, Y\. Hu, L\. Nie,et al\.\(2025\)CoRe\-mmrag: cross\-source knowledge reconciliation for multimodal rag\.arXiv preprint arXiv:2506\.02544\.Cited by:[§A\.3](https://arxiv.org/html/2606.23881#A1.SS3.p3.1),[Appendix F](https://arxiv.org/html/2606.23881#A6.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.23881#S3.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.23881#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p2.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p4.1),[§4\.6](https://arxiv.org/html/2606.23881#S4.SS6.p1.1)\.
- H\. Touvron, M\. Cord, A\. El\-Nouby, J\. Verbeek, and H\. Jégou \(2022\)Three things everyone should know about vision transformers\.InEuropean Conference on Computer Vision,pp\. 497–515\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px1.p3.6)\.
- P\. Wang, Q\. Wu, C\. Shen, A\. Dick, and A\. Van Den Hengel \(2017\)Fvqa: fact\-based visual question answering\.IEEE transactions on pattern analysis and machine intelligence40\(10\),pp\. 2413–2427\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p1.1)\.
- J\. Wu, Z\. Deng, W\. Li, Y\. Liu, B\. You, B\. Li, Z\. Ma, and Z\. Liu \(2025\)MMSearch\-r1: incentivizing lmms to search\.arXiv preprint arXiv:2506\.20670\.Cited by:[Limitations](https://arxiv.org/html/2606.23881#Sx2.p1.1)\.
- Y\. Yan and W\. Xie \(2024\)EchoSight: advancing visual\-language models with wiki knowledge\.arXiv preprint arXiv:2407\.12735\.Cited by:[§A\.3](https://arxiv.org/html/2606.23881#A1.SS3.p2.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.1.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.2.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.3.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.4.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.5.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.6.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.7.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.8.4.3.1),[Figure 3](https://arxiv.org/html/2606.23881#A2.F3.9.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.1.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.2.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.3.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.4.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.5.4.3.1),[Figure 4](https://arxiv.org/html/2606.23881#A2.F4.6.4.3.1),[Figure 5](https://arxiv.org/html/2606.23881#A3.F5),[Figure 5](https://arxiv.org/html/2606.23881#A3.F5.1.5.1.1),[Figure 5](https://arxiv.org/html/2606.23881#A3.F5.2.4.1.1),[Figure 5](https://arxiv.org/html/2606.23881#A3.F5.3.4.1.1),[Appendix C](https://arxiv.org/html/2606.23881#A3.p1.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.1.4.3.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.2.4.3.1),[Figure 6](https://arxiv.org/html/2606.23881#A4.F6.3.4.3.1),[§E\.2](https://arxiv.org/html/2606.23881#A5.SS2.SSS0.Px2.p1.1),[§E\.2](https://arxiv.org/html/2606.23881#A5.SS2.p1.1),[Appendix F](https://arxiv.org/html/2606.23881#A6.p1.1),[§1](https://arxiv.org/html/2606.23881#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.23881#S2.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.23881#S3.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.23881#S3.SS3.p7.3),[Figure 2](https://arxiv.org/html/2606.23881#S4.F2),[Figure 2](https://arxiv.org/html/2606.23881#S4.F2.1.3.1.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.23881#S4.SS1.p2.1),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p1.2),[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p2.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.23881#S4.SS3.p3.1),[§4\.4](https://arxiv.org/html/2606.23881#S4.SS4.p5.2),[Table 6](https://arxiv.org/html/2606.23881#S4.T6)\.
- S\. Yin, C\. Fu, S\. Zhao, K\. Li, X\. Sun, T\. Xu, and E\. Chen \(2024\)A survey on multimodal large language models\.National Science Review11\(12\),pp\. nwae403\.Cited by:[§A\.2](https://arxiv.org/html/2606.23881#A1.SS2.p1.1)\.
- Q\. Yu, Z\. Xiao, B\. Li, Z\. Wang, C\. Chen, and W\. Zhang \(2025\)MRAMG\-bench: a comprehensive benchmark for advancing multimodal retrieval\-augmented multimodal generation\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 3616–3626\.Cited by:[§1](https://arxiv.org/html/2606.23881#S1.p2.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2019\)Bertscore: evaluating text generation with bert\.arXiv preprint arXiv:1904\.09675\.Cited by:[§4\.2](https://arxiv.org/html/2606.23881#S4.SS2.p2.1)\.
- Y\. Zhang, Y\. Liu, D\. Miao, Q\. Zhang, Y\. Shi, and L\. Hu \(2023\)MG\-vit: a multi\-granularity method for compact and efficient vision transformers\.Advances in Neural Information Processing Systems36,pp\. 69328–69347\.Cited by:[Appendix H](https://arxiv.org/html/2606.23881#A8.SS0.SSS0.Px1.p3.6)\.

## Appendix AAdditional Related Work

### A\.1Multimodal Large Language Models

Multimodal large language models \(MLLMs\) extend text\-only LLMs by integrating modality encoders and alignment mechanisms, enabling joint reasoning over images and text\. Early models such as FlamingoAlayracet al\.\([2022](https://arxiv.org/html/2606.23881#bib.bib35)\)and BLIP\-2Liet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib36)\)demonstrated that coupling a pretrained vision encoder with a frozen language model can already yield strong few\-shot multimodal capabilities\. Subsequent models, including the LLaVA familyLiuet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib32),[2024](https://arxiv.org/html/2606.23881#bib.bib33)\)and Qwen\-VLBaiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\), further advanced this paradigm through large\-scale instruction tuning and improved cross\-modal fusion architectures, achieving strong performance across a wide range of multimodal benchmarks\.

Despite these advances, recent empirical studiesLiet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib37)\); Tanet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib38)\)consistently show that even state\-of\-the\-art MLLMs underperform on knowledge\-based VQA \(KB\-VQA\) benchmarks that require fine\-grained, entity\-centric, or long\-tail factual knowledge\. This limitation motivates augmenting MLLMs with external knowledge sources to support explicit grounding and reasoning beyond parametric knowledge alone\.

### A\.2Knowledge\-Based Visual Question Answering

Conventional visual question answering \(VQA\) benchmarksAntolet al\.\([2015](https://arxiv.org/html/2606.23881#bib.bib5)\); Goyalet al\.\([2017](https://arxiv.org/html/2606.23881#bib.bib6)\); Wanget al\.\([2017](https://arxiv.org/html/2606.23881#bib.bib15)\)focus on questions that can be answered using visual content and common\-sense knowledge\. With large\-scale pretraining and instruction tuning, modern MLLMs perform competitively on these tasks, as much of the required information is either directly observable or implicitly encoded during trainingYinet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib22)\); Baiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)\.

Knowledge\-based VQA \(KB\-VQA\) fundamentally extends this setting by requiring external knowledge not contained in the image alone, such as entity attributes or encyclopedic factsDenget al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib14)\)\. Early benchmarks, including OK\-VQAMarinoet al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib16)\)and A\-OKVQASchwenket al\.\([2022](https://arxiv.org/html/2606.23881#bib.bib17)\), introduced questions that depend on commonsense or general knowledge, which increasingly falls within the training scope of large\-scale MLLMs\.

More recent benchmarks, such as Encyclopedic\-VQAMensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\)and InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\), further raise the difficulty by requiring fine\-grained, entity\-level grounding in Wikipedia\. These datasets demand explicit identification of the correct entity and selection of supporting sections from retrieved articles\. Despite progress in MLLMs, models continue to struggle in this setting due to challenges in discriminating visually similar entities and selecting the most informative evidence, motivating retrieval\-augmented approaches\.

### A\.3MM\-RAG\-Based Solutions

To address the limitations of parametric knowledge in KB\-VQA, many recent methods adopt multimodal retrieval\-augmented generation \(MM\-RAG\)\. As illustrated in Figure[1](https://arxiv.org/html/2606.23881#S3.F1), MM\-RAG pipelines typically consist of three stages: \(i\) a retriever that performs coarse retrieval from a large\-scale knowledge base, \(ii\) a re\-ranking stage that selects the most relevant context among retrieved candidates, and \(iii\) an answer generator that produces the final response conditioned on the selected evidence\.

Representative methods differ primarily in how relevance is modeled\. EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)trains a dedicated multimodal re\-ranker to select the most relevant sections given an image–question pair, leveraging contrastive learning with curated supervision\. However, obtaining high\-quality positive supervision for section\-level relevance is challenging, and among major KB\-VQA benchmarks, only E\-VQA provides explicit supporting section annotations\.

Other approaches, such as ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\)and CoRe\-MMRAGTianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\), rely on fine\-tuning large MLLMs to implicitly assess relevance through internal representations\. While effective, these methods require constructing task\-specific training data and incur substantial computational cost when fine\-tuning large\-scale models such as LLaVALiuet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib33)\)or Qwen\-VLBaiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)\. These limitations motivate alternative designs that reduce training overhead while improving robustness and transferability\.

## Appendix BQualitative results

In this section we will provide more qualitative results on both InfoSeek and E\-VQA datasets as shown in Figure[3](https://arxiv.org/html/2606.23881#A2.F3)and[4](https://arxiv.org/html/2606.23881#A2.F4)\.

infoseek\-00016131Q: In which year was this equipment retired from operational service?

Query Entity: Bren light machine gun

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00016131.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

1963✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Not mentioned in the text\.✗

IBA \(Ours\): 2006✓

00073156Q:What is the sea level in metre of this mountain?

Query Entity:Corcovado

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00073156.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

396✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

396✗

IBA \(Ours\): 710✓

00073009Q: what was the date this aircraft enter into service?

Query Entity:Tupolev Tu\-154

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00073009.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

1970 2 21✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Not mentioned\.✗

IBA \(Ours\): 9 February 1972✓

00072497Q: In which year did this building come into service?

Query Entity:Niechorze Lighthouse 1866

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00072497.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

1814✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

1814✗

IBA \(Ours\): 1866✓

00073353Q: What is the length of this lake in kilometre?

Query Entity:Windermere

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00073353.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

11\.23✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

11✗

IBA \(Ours\): 18✓

00072995Q:What is the weight of a male of this bird in gram?

Query Entity:Least grebe 129

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00072995.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

692✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

692\-925✗

IBA \(Ours\): 129✓

00018660Q: What is the closest upper taxonomy of this bird?

Query Entity:Eurasian collared dove

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00018660.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

Zenaida✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Genus✗

IBA \(Ours\): Streptopelia✓

00018687Q: What is this plant named after?

Query Entity:Allamanda

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00018687.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

Charles Plumier✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Thevetia✗

IBA \(Ours\): Frédéric\-Louis Allamand✓

00000129Q:What country does this building belong to?

Query Entity:Polish Baltic Philharmonic

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/InfoSeek/infoseek_val_00000129.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

Netherlands✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Denmark✗

IBA \(Ours\): Poland✓

Figure 3:Sample qualitative results on image\-question pairs from InfoSeek, where we compare the answers provided by IBA with those from ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\)and EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\.Q: How long is this bird in centimeters?

Query Entity: Sanderling

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-1204.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

30–40✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The breeding adult bird is 14\-16 cm in length\.✗

IBA \(Ours\): The sanderling is 18–20 cm \(7\.1–7\.9 in\) in length\.✓

Q: Who designed this hotel?

Query Entity:The Fullerton Hotel Singapore

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-632.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

The hotel rooms were designed by Hirsch Bedner Associates\.✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Hirsch Bedner Associates designed the hotel rooms\.✗

IBA \(Ours\): The Fullerton Building was designed by Major P\.H\. Keys of Keys & Dowdeswell, a Shanghai firm of architects\.✓

Q: What settlements comprised the this abbey estate?

Query Entity: Bolton Abbey

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-4721.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

villages✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The Fountains Abbeyestate comprised the following settlements:…✗

IBA \(Ours\): The Bolton Abbey estate comprised the following settlements: Addingham,Holme,Beamsley and Gargrave✓

Q: What tool is associated with the traditional seaside atmosphere of this pier?

Query Entity:Brighton Palace Pier

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-4744.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

whelk✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The answer is: Carousel\.✗

IBA \(Ours\): The traditional seaside atmosphere of the Brighton Palace Pier is associated with ""bucket and spade""\.✓

Q: What is the architectural style of this mosque?

Query Entity: Great Mosque of Banten

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-725.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

Islamic architecture✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The architectural style of the Great Mosque of Banten is a mix of Indian Mughal pattern and ancient candi decoration\.✗

IBA \(Ours\): The architectural style of the Great Mosque of Banten is described as follows joglo style, a traditional Javanese style✓

Q:When was this park created?

Query Entity: Princes Street Gardens

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/qualiatives/EVQA/EVQA-841.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

1843✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The park was created in 1876, when the town council reacquired the ground and laid it out as a public park\.✗

IBA \(Ours\): Princes Street Gardens were created in the 1820s\.✓

Figure 4:Sample qualitative results on image\-question pairs from E\-VQA, where we compare the answers provided by IBA with those from ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\)and EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\.
## Appendix CMore examples for informative context

We have presented more examples on Figure[5](https://arxiv.org/html/2606.23881#A3.F5)where our proposed IBA selects more informative context compared with EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\.

Q: How many tiny reddish yellow warts

does each segment of this moth have?

Query Entity: Atolmis rubricollis

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/context/EVQA-1173.png)

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

According to the text,there is no mentionof tiny reddish\-yellow warts on the segments of the red\-necked footman moth\. The description only mentions the coloration of the thorax, head, abdomen, antennae, legs, and wings, but does not mention any warts\. Therefore, the answer is None\.✗

\#\# Section Title: Description

The red\-necked footman is a small moth that is mostly charcoal grey or deep dark brown \(fresh specimens almost black\), but has a conspicuous orange thorax, part of which is visible behind the black head as an orange\-red collar\. The hindwings are brownish grey\. The antennae and legs are black and the end of the abdomen is yellowish orange or golden yellow\. The wings are tightly folded together around the body and have pleated, squared\-off ends\. The wingspan is 25 to 35 mm \(1\.0 to 1\.4 in\) and the length of the forewings is 15 to 18 mm \(0\.6 to 0\.7 in\)\. Wilhelm Neuburger named Romanian specimens as the synonym Atolmis flavicollis; these were less dark brown and with a light yellow instead of a red collar\.

IBA \(Ours\): According to the text,each segment of the caterpillar bears six tiny reddish yellow warts\.✓

\#\# Section Title: Life cycle

The white eggs of the red\-necked footman are laid in small groups in crevices in the branches of trees, especially those of old firs\. They grow to a length of about 27 mm \(1\.1 in\)\. Their head is black with a bold diagonal white stripe on either side\. Their main colour is dark greenish grey marbled with cream\.Each segment bears six tiny reddish yellow warts which bear black hairs\.The caterpillars feed on lichens growing on the trunks and branches of trees, and can be found between August and October…

Q: In which year\(s\) was this castle rebuilt?

Query Entity: Dudley Castle

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/context/EVQA-577.png)

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The answer is: None\. According to the text, the bulk of the remaining habitable parts of the castle wasdestroyed by fire in 1750, and there isno mention of the castle being rebuilt\.✗

SELECTED Section Title: Final years and ruin

stable block was constructed on the site at some point before 1700\. This was the final building to be constructed in the castle\. The bulk of the remaining habitable parts of the castle wasdestroyed by fire in 1750\. However, in the nineteenth century, the site found a new use as a ’Romantic Ruin’ and a certain amount of tidying up of the site was carried out by the Earls of Dudley\. Battlements on one of the remaining towers were reconstructed and two cannon captured during the Crimean Wars were installed\. In the nineteenth and early twentieth century the site was used for fêtes and pageants\. In 1937, when the Dudley Zoo was established, the castle grounds were incorporated into the zoo\. …

IBA \(Ours\): According to the evidence, the castle keep was rebuilt starting from 1262\.✓

SELECTED Section Title: The keep

The castle keep datesfrom the rebuilding that started in 1262\. It rests on the motte, constructed in the Norman period but somewhat reduced in height afterwards\. The original building was slightly rectangular in plan with approximate dimensions 15 metres north to south and 22 metres east to west\. The four drum towers on each corner are 9\.8 metres in diameter\. After the slighting at the end of the civil war, only the north side of the castle and parts of two of the drum towers remain\.

Q:What kind of scabious is this plant commonly known as?

Query Entity: Knautia Arvensis

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/context/EVQA-1740.png)

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

According to the article, the plant is commonly known asGipsy Rose\.✗

SELECTED Section Title: Name

Species of scabious were used to treat scabies, and many other afflictions of the skin including sores caused by the bubonic plague\. The word scabies comes from the Latin word scabere ’scratch’\.Another name for this plant is gipsy rose\.The genus Knautia is named after a 17th\-century German botanist, Christian Knaut\.

IBA \(Ours\): Based on the evidence text, the answer is: field scabious\.✓

SELECTED Section Title: Knautia arvensis

Knautia arvensis, commonly known asfield scabious, is a herbaceous perennial species of flowering plant in the honeysuckle family Caprifoliaceae\.

Figure 5:Sample qualitative results on image\-question pairs from E\-VQA, where we compare the answers provided by IBA with EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)where we selected more informative context\.
## Appendix DError Case Study

In this section, we provide more examples of error cases\. As shown in Figure[6](https://arxiv.org/html/2606.23881#A4.F6), for most cases, the error is caused by compromised entity grounding as shown in first two cases\. However, it could be possible that the data annotation is not concise enough or evaluation is not comprehensive enough\. As shown in the last case, both our proposed IBA and EchoSight have secured the correct entity grounding and provided reasonable answer*i\.e\.,*the exact name of the law signed ‘ Oklahoma City National Memorial Act ’\. However, it is judged as wrong via existing evaluation protocols, which could be caused due to the too coarse\-grained or vague ground truth answerlaw\.

Q: In which part of the world does this animal live?

Query Entity: Echinometra Mathaei rubricollis

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/errors/EVQA-21.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

World Ocean

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

Thered sea urchin\(Mesocentrotus franciscanus\) is found in the northeastern Pacific Ocean, specifically in the areas of Alaska to Baja California\.✗

IBA \(Ours\): Thepurple sea urchin, Strongylocentrotus purpuratus, lives along the eastern edge of the Pacific Ocean, extending from Ensenada, Mexico, to British Columbia, Canada\.✗

Q: How big is an adult of this species typically?

Query Entity: Argiope bruennichi

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/errors/EVQA-108.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

insects

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The answer is not explicitly stated in the provided text\. It does not provide information about the size of the adultArgiope spider\.✗

IBA \(Ours\): I did not find any information about the size of an adultArgiope species\.✓

Q:What did president bill clinton sign to create this memorial?

Query Entity: Oklahoma City National Memorial\. Ground Truth Answer: Law

![Refer to caption](https://arxiv.org/html/2606.23881v1/images/errors/EVQA-3271.png)

ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\):

Bill✗

EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\):

The answer is: the Oklahoma City National Memorial Act of 1997\.✗

IBA \(Ours\): President Bill Clinton signed the ""Oklahoma City National Memorial Act of 1997"" to create the Oklahoma City National Memorial\.✗

Figure 6:Error cases on image\-question pairs from E\-VQA, where we compare the answers provided by our proposed IBA with ReflectiVACocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\)and EchoSightYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)\.
## Appendix EPrompt Template

### E\.1Identification

We explicitly prompt the MLLM to perform entity identification as a constrained selection task, rather than open\-ended entity naming\. Given a query image and a small set of retrieved candidate entity names, the model is asked to select the most likely entity \(or top\-kkentities\) depicted in the image\. We also utilize the initial visual retrieval similarity score in the prompts\. The identification prompt is formatted as follows:

SYSTEM:Youareanexpertvisualentityrecognizer\.Lookattheimageandherearesomepotentiallyrelevantoptions\.

Options:

A\.<ENTITY\_NAME\_1\>\(imagesimilarity:<SIM\_1\>\)

B\.<ENTITY\_NAME\_2\>\(imagesimilarity:<SIM\_2\>\)

C\.<ENTITY\_NAME\_3\>\(imagesimilarity:<SIM\_3\>\)

\.\.\.

Replywith’Answer:<label1\>,<label2\>,\.\.\.’listingthetop<K\>optionlettersfrommosttoleastlikelybasedontheimage\.

Here, each option corresponds to a candidate entity retrieved from the external knowledge base, optionally augmented with its initial image\-to\-image retrieval similarity score\. The model is required to respond strictly in the prescribed format, enabling us to directly interpret the output as entity\-level confidence for subsequent re\-ranking\.

### E\.2Answer Generation

For the answer generation stage, we follow existing worksYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\)to apply answer generation templates depending on the dataset\.

#### E\-VQA\.

The prompt we use for LLMs when testing Encyclopedic\-VQA \(E\-VQA\)Mensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\)is shown as follows:

USER:Context:<CONTEXT\>

Question:<QUESTION\>

Theansweris:

#### InfoSeek\.

Due to the strict exact\-match evaluation used by InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\), following existing worksYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\), we adopt a one\-shot prompting strategy and add instructions to ensure the generated answer strictly matches the required format\. The prompt used for InfoSeek is:

SYSTEM:Youalwaysanswerthequestiontheuserasks\.Donotansweranythingelse\.

USER:Context:ThesouthernsideoftheAlpsisnexttoLakeComo\.

Question:Whichbodyofwateristhismountainlocatedinornextto?

Justanswerthequestions,noexplanationsneeded\.

Shortansweris:LakeComo

Context:<CONTEXT\>

Question:<QUESTION\>

Justanswerthequestions,noexplanationsneeded\.

Shortansweris:

## Appendix FDataset Details

Following existing worksYan and Xie \([2024](https://arxiv.org/html/2606.23881#bib.bib1)\); Tianet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib3)\); Cocchiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib2)\), we use the same test set of InfoSeekChenet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib19)\)and E\-VQAMensinket al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib18)\), consists of 71,335 and 4,750 question pairs respectively,

## Appendix GToken Budget and Re\-ranking Efficiency

In this section we will introduce more details to compare the cost of our proposed method with existing methods in terms of token and FLOPs\.

### G\.1Our proposed IBA with BGE Reranker

The Qwen top\-kkpipeline operates as follows:

1. 1\.Retrieve top 20 entities per query; Qwen identification keeps the top 3 entities\.
2. 2\.For the retained entities, obtain all wiki sections and send them to the BGE section reranker\.
3. 3\.Obtain the section scores by comparing all sections with the query text to select the best section for downstream answer generation\.

Empirically, our prepared metadata for E\-VQA shows:

avg\. sections/example≈24\.7,\\displaystyle\\approx 24\.7,avg\. tokens/section≈172\.\\displaystyle\\approx 172\.Assuming questions contribute≈20\\approx 20tokens, the total BGE token budget per example is

24\.7×\(172\+20\)≈4\.8k tokens,24\.7\\times\(172\+20\)\\approx 4\.8\\text\{k tokens\},\.

### G\.2EchoSight Re\-ranker

The EchoSight flow differs from our proposed IBA with larger re\-ranking space\.

1. 1\.Retrieve top 20 entities, which is same as ours\.
2. 2\.Expand all entities to sections, encode the image once, and encode*all*sections with BLIP\-2 Q\-Former\.
3. 3\.Score by query\(both image and text\) and candidate sections relevance\(similarity\) and rerank\.

Using the same section statistics as a proxy, the section pool grows to

20×24\.73≈165sections,20\\times\\tfrac\{24\.7\}\{3\}\\approx 165\\text\{ sections\},yielding a text load of≈165×172≈28k tokens\\approx 165\\times 172\\approx 28\\text\{k tokens\}, i\.e\., about6−7×6\\\!\-\\\!7\\timesmore text than our proposed IBA\.

### G\.3Runtime Implications \(full pipeline\)

- •Ours - –Identification \(Qwen2\.5\-VL\-7B\):≈\\approx256 visual tokens \(one image encode\) \+7070–100100text tokens input \(question \+ 20 candidate titles\)⇒\\Rightarrow400400tokens\. - –Section rerank \(BGE, text\-only\):≈4\.8k\\approx 4\.8\\text\{k\}text tokens \(24\.7 sections×\\times\(172 per section \+ 20 for the question\)\)\. Overall cost is dominated by the BGE text forward; the image is encoded once in identification\.
- •EchoSight reranker \(no identification\): - –One image encode \(≈\\approx256 visual tokens\)\. - –Text side encodes around28k28\\text\{k\}text tokens \(≈165\\approx 165sections×\\times172 tokens per section\) and fuses with the image\. The model\(*i\.e\.,*EchoSight Re\-ranker based on BLIP\-2\) is heavier than the off\-the\-shelf textual re\-ranker such as bge\-v2\-m3\.

Thus, end\-to\-end, our identification \+ text re\-rank load is far smaller than EchoSight’s multi\-modal re\-rank; the gains come from entity pruning \(section count reduced6−7×6\\\!\-\\\!7\\times\) and using a lightweight text re\-ranker\.

## Appendix HComputational Cost Analysis via FLOPs

We analyze computational cost using floating\-point operations \(FLOPs\) rather than wall\-clock latency\. Wall\-clock time is highly sensitive to implementation details, hardware configuration, batching strategy, and system\-level optimizations, making it difficult to fairly compare methods with different architectures\. In contrast, FLOPs provide a hardware\-agnostic and reproducible proxy for inference complexity that reflects the intrinsic computational demand of a model\. This metric has been widely adopted in prior work for comparing model efficiency across architectures and modalities\.

#### FLOPs estimation protocol\.

To compare inference cost across different model architectures in a hardware\-agnostic and reproducible manner, we estimate computational complexity using floating\-point operations \(FLOPs\)\. Following prior work that analyzes the scaling and efficiency of Transformer modelsKaplanet al\.\([2020](https://arxiv.org/html/2606.23881#bib.bib59)\), we adopt a standard proxy for Transformer\-based modules:

For a Transformer encoder with hidden dimensionddandLLlayers, the forward FLOPs per token can be approximated as

FLOPs per token≈24×L×d2,\\text\{FLOPs per token\}\\approx 24\\times L\\times d^\{2\},\(1\)which captures the dominant contributions of multi\-head self\-attention and feed\-forward network operations across layers\. The total compute is then obtained by multiplying this per\-token cost by the number of tokens processed by the model\. Using this unified protocol for all Transformer\-only components \(e\.g\., transformer layers in text or cross\-modal encoders\) ensures a fair basis for comparison across methods\.

Vision Transformers \(ViT\)Dosovitskiyet al\.\([2021](https://arxiv.org/html/2606.23881#bib.bib61)\), however, require a different consideration because their self\-attention operations compute pairwise interactions among image patch tokens, resulting in computation that scales quadratically with sequence length\. Specifically, self\-attention involves operations onS×SS\\times Saffinity matrices, whereSSis the number of visual tokens, dominating compute whenSSis large\. For these vision encoders, we therefore estimate FLOPs by accounting explicitly for both the attention term \(O\(S2⋅d\)O\(S^\{2\}\\cdot d\)\) and the feed\-forward term \(O\(S⋅d⋅dff\)O\(S\\cdot d\\cdot d\_\{ff\}\)\), rather than relying solely on thed2d^\{2\}proxy\. This exception reflects the inherent quadratic complexity of self\-attention in visual processing and is consistent with practice in ViT analysisTouvronet al\.\([2022](https://arxiv.org/html/2606.23881#bib.bib63)\); Marinet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib64)\); Zhanget al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib65)\)\.

#### EchoSight\.

EchoSight performs multi\-modal re\-ranking using a BLIP\-2\-basedLiet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib36)\)architecture\. During re\-ranking, EchoSight does not invoke a large language model\. Instead, it employs the BLIP\-2 Querying Transformer \(Q\-Former\), a BERTDevlinet al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib60)\)style Transformer \(12 layers, hidden size 768\), to encode both the multi\-modal query \(image \+ question\) and all candidate wiki sections, where vision information will be processed with its Vision TransformerDosovitskiyet al\.\([2021](https://arxiv.org/html/2606.23881#bib.bib61)\)vision encoder111[https://github\.com/Go2Heart/EchoSight/blob/main/lavis/models/blip2\_models/blip2\_qformer\_reranker\.py](https://github.com/Go2Heart/EchoSight/blob/main/lavis/models/blip2_models/blip2_qformer_reranker.py)\. Each candidate section is encoded independently using the same Q\-Former in a text\-only mode, and relevance scores are computed via embedding similarity\. Given that EchoSight expands all retrieved entities into sections, the re\-ranking stage processes a large number of text tokens, leading to substantial computational cost\. We estimate the FLOPs of EchoSight re\-ranking by accounting for: \(i\) the frozen vision encoder, \(ii\) the Q\-Former multi\-modal fusion, and \(iii\) the Q\-Former\-based text encoding of all candidate sections\. For the vision encoder, EchoSight uses an EVA\-CLIP Vision Transformer variant with a patch size of 14, hidden dimensiondv=1408d\_\{v\}=1408, and depthLv=39L\_\{v\}=39transformer layers\. The input image of size224×224224\\times 224is tokenized into162=25616^\{2\}=256patches plus one classification token, yieldingS=257S=257tokens\.

To estimate FLOPs for Vision Transformers \(ViT\), we decompose the dominant contributions as:

- •Self\-attention:computing the attention score matrixQKTQK^\{T\}involvesO\(S2⋅d\)O\(S^\{2\}\\cdot d\)operations due to pairwise interactions among tokens\.
- •Feed\-forward network \(FFN\):each token is transformed via two linear layers with intermediate dimensiondff≈4dd\_\{ff\}\\approx 4d, contributingO\(S⋅d⋅dff\)O\(S\\cdot d\\cdot d\_\{ff\}\)operations\.

This yields the per\-layer FLOPs approximation:

FLOPs per layerViT≈2dvS2\+4dvdffS,\\text\{FLOPs per layer\}\_\{\\text\{ViT\}\}\\approx 2\\,d\_\{v\}\\,S^\{2\}\\;\+\\;4\\,d\_\{v\}\\,d\_\{ff\}\\,S,where the first term corresponds to self\-attention and the second to FFN\. Such decomposition reflects the quadratic dependence on token count intrinsic to self\-attention in vision models\.

SubstitutingS=257S=257,dv=1408d\_\{v\}=1408,Lv=39L\_\{v\}=39anddff≈4dvd\_\{ff\}\\approx 4d\_\{v\}, the FLOPs for a single forward pass through the vision encoder can be approximated as:

FLOPsvision\\displaystyle\\text\{FLOPs\}\_\{vision\}≈Lv\(2dvS2\+4dvdffS\)\\displaystyle\\approx L\_\{v\}\\\!\\left\(2\\,d\_\{v\}\\,S^\{2\}\+4\\,d\_\{v\}\\,d\_\{ff\}\\,S\\right\)≈3\.3×1011\.\\displaystyle\\approx 3\.3\\times 10^\{11\}\.\(2\)
This indicates that a single forward pass through EVA\-CLIP’s Vision Transformer backbone incurs on the order of101110^\{11\}FLOPs, consistent with the understanding that self\-attention costs grow quadratically with the number of tokens processed\.

For the multi\-modal fusion and text encoding, EchoSight employs the BLIP\-2 Querying Transformer \(Q\-Former\)Liet al\.\([2023](https://arxiv.org/html/2606.23881#bib.bib36)\), a BERT\-base style Transformer with hidden sizedq=768d\_\{q\}=768andLq=12L\_\{q\}=12layers\. Using the same transformer FLOPs proxy yields the following cost per token\.,

FLOPsQFormer≈24×Lq×dq2\\displaystyle\\text\{FLOPs\}\_\{QFormer\}\\approx 24\\times L\_\{q\}\\times d\_\{q\}^\{2\}≈24×12×7682≈1\.7×1010\.\\displaystyle\\approx 24\\times 12\\times 768^\{2\}\\approx 1\.7\\times 10^\{10\}\.\(3\)
For the multi\-modal fusion step, this cost is incurred once for the query and question tokens\. For section text encoding, each candidate section is encoded with the same Q\-Former in text\-only mode\. Given an approximate rerank pool of∼28,000\\sim 28\{,\}000tokens, the total FLOPs for text encoding becomes

28,000×1\.7×1010≈4\.8×1014\.28\{,\}000\\times 1\.7\\times 10^\{10\}\\;\\approx\\;4\.8\\times 10^\{14\}\.\(4\)
Overall, while the vision encoder and fusion contribute on the order of101010^\{10\}–101110^\{11\}FLOPs, the text encoding FLOPs dominate the re\-ranking cost at∼4\.8×1014\\sim 4\.8\\times 10^\{14\}\. These estimates use consistent transformer FLOPs proxies and model configuration data from the EchoSight implementation, supporting a concrete and fair comparison of computational demand across methods\.

#### Our proposed IBA FLOPs\.

For our pipeline, the computational cost comprises two stages: \(i\) multi\-modal identification using Qwen2\.5\-VL\-7B Instruct and \(ii\) text\-only re\-ranking using the BAAI/bge\-reranker\-v2\-m3 model\.

Identification \(Qwen2\.5\-VL\-7B Instruct\)\.Qwen2\.5\-VL\-7B Instruct adopts a re\-engineered vision–language architecture rather than a standard ViT encoder followed by a text\-only transformerBaiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)\. In particular, the vision encoder is redesigned with windowed attention and multi\-stage token merging, which significantly reduces the effective sequence length compared to a naive patch\-based Vision Transformer\. As a result, the quadraticS2S^\{2\}self\-attention cost characteristic of vanilla ViT is largely mitigated in practice\.

For FLOPs estimation, we therefore treat the identification stage as a unified transformer\-style forward pass and apply the standard transformer FLOPs proxy to the shared multi\-modal backbone\. According to the official configuration, Qwen2\.5\-VL\-7B uses hidden sizedQwen=3584d\_\{\\text\{Qwen\}\}=3584andLQwen=28L\_\{\\text\{Qwen\}\}=28transformer layersBaiet al\.\([2025](https://arxiv.org/html/2606.23881#bib.bib34)\)\. The per\-token FLOPs is estimated as

FLOPsQwen2\.5\-VL≈24×LQwen×dQwen2\\displaystyle\\text\{FLOPs\}\_\{\\text\{Qwen2\.5\-VL\}\}\\approx 24\\times L\_\{\\text\{Qwen\}\}\\times d\_\{\\text\{Qwen\}\}^\{2\}≈24×28×35842≈8\.62×109\.\\displaystyle\\approx 24\\times 28\\times 3584^\{2\}\\approx 8\.62\\times 10^\{9\}\.\(5\)
During identification, the model processes one image together with the question and candidate entity names\. Due to the internal vision token compression in Qwen2\.5\-VL, the effective number of tokens participating in the transformer layers is substantially smaller than the raw patch count\. We conservatively approximate the overall forward pass as involving∼400\\sim 400effective tokens, yielding

FLOPsidentification≈400×8\.62×109≈3\.45×1012\.\\text\{FLOPs\}\_\{\\text\{identification\}\}\\approx 400\\times 8\.62\\times 10^\{9\}\\approx 3\.45\\times 10^\{12\}\.\(6\)
This estimate intentionally over\-approximates the identification cost and provides a conservative upper bound for comparison\.

Section re\-ranking \(bge\-reranker\-v2\-m3\)\.The off\-the\-shelf textual re\-ranker\(bge\-reranker\-v2\-m3\)Chenet al\.\([2024](https://arxiv.org/html/2606.23881#bib.bib52)\)uses a RoBERTa\-basedLiuet al\.\([2019](https://arxiv.org/html/2606.23881#bib.bib62)\)cross\-encoder with hidden sizedBGE=1024d\_\{\\text\{BGE\}\}=1024andLBGE=24L\_\{\\text\{BGE\}\}=24layers\. Again applying the transformer FLOPs proxy yields the following per token cost,

FLOPsBGE≈24×LBGE×dBGE2\\displaystyle\\text\{FLOPs\}\_\{\\text\{BGE\}\}\\approx 24\\times L\_\{\\text\{BGE\}\}\\times d\_\{\\text\{BGE\}\}^\{2\}≈24×24×10242≈6\.04×108\.\\displaystyle\\approx 24\\times 24\\times 1024^\{2\}\\approx 6\.04\\times 10^\{8\}\.\(7\)Given a rerank pool of∼4,800\\sim 4\{,\}800text tokens per example, the total re\-ranking FLOPs becomes

FLOPsBGE rerank≈4800×6\.04×108≈2\.90×1012\.\\text\{FLOPs\}\_\{\\text\{BGE rerank\}\}\\approx 4800\\times 6\.04\\times 10^\{8\}\\approx 2\.90\\times 10^\{12\}\.\(8\)
Combining both stages gives

FLOPsours≈3\.45×1012\+2\.90×1012≈6\.35×1012,\\text\{FLOPs\}\_\{\\text\{ours\}\}\\approx 3\.45\\times 10^\{12\}\+2\.90\\times 10^\{12\}\\approx 6\.35\\times 10^\{12\},indicating that our pipeline’s compute proxy is dominated by identification and re\-ranking FLOPs, and is orders of magnitude lower than the comparable EchoSight re\-ranking cost\.

#### Key efficiency advantage\.

A key source of efficiency in our pipeline stems from architectural decoupling, which reduces compute by*nearly two orders of magnitude*compared to EchoSight multi\-modal re\-ranker\.

Under the same FLOPs proxy, EchoSight’s re\-ranking cost is dominated by the cost of encoding the entire candidate section pool with a high\-capacity multi\-modal encoder\. Specifically, using an EVA\-CLIP Vision Transformer backbone \(4×10104\\times 10^\{10\}FLOPs per image\) and a BLIP\-2 Q\-Former \(1\.7×10101\.7\\times 10^\{10\}FLOPs per token\), the re\-ranking stage on a pool of∼28,000\\sim 28\{,\}000text tokens incurs∼4\.8×1014\\sim 4\.8\\times 10^\{14\}FLOPs\. By contrast, our method incurs only∼3\.45×1012\\sim 3\.45\\times 10^\{12\}FLOPs for the multi\-modal identification stage with Qwen2\.5\-VL\-7B and∼2\.90×1012\\sim 2\.90\\times 10^\{12\}FLOPs for the BGE reranker text stage, totaling6\.35×10126\.35\\times 10^\{12\}FLOPs\.

This means that even without considering the one\-off vision encoding cost, our rerank\-centric compute is*nearly two orders of magnitude lower*than the text encoding cost alone in EchoSight’s re\-ranker\. The bulk of EchoSight’s compute arises from repeatedly processing all candidate sections with a multi\-modal model, whereas our approach confines expensive multi\-modal processing to a single identification pass and relies on a lightweight text re\-ranker for large\-scale comparison\. This structural decoupling yieldssignificantly lower computational demandwhile preserving re\-ranking effectiveness, demonstrating that effective knowledge\-based VQA does not require heavy multi\-modal encoding over the entire candidate space\.
Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Similar Articles

Brain-IT-VQA: From Brain Signals to Answers

Self-Evolving Visual Questioner

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Submit Feedback

Similar Articles

Brain-IT-VQA: From Brain Signals to Answers
Self-Evolving Visual Questioner
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory