Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting
Summary
This paper audits knowledge-based VQA benchmarks, revealing systematic violations of assumptions that make accuracy a misleading metric. It introduces a repair protocol and multi-entity augmentation to restore answer derivability and question clarity, showing that corrected settings yield markedly different model rankings.
View Cached Full Text
Cached at: 07/02/26, 05:36 AM
# Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting
Source: [https://arxiv.org/html/2607.00159](https://arxiv.org/html/2607.00159)
11institutetext:Rensselaer Polytechnic Institute, Troy NY 12047, USA22institutetext:AT&T Chief Data Office, Bedminster NJ 07921, USA
22email:\{maq5,rayees,stewart,may13\}@rpi\.edu, qw6547@att\.com###### Abstract
Knowledge\-Based Visual Question Answering \(KB\-VQA\) aims to evaluate whether Visual Language Models \(VLMs\) can retrieve, ground, and reason over external structured knowledge beyond visual evidence\. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge\-grounded reasoning\. However, for existing KB\-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well\-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation\. In this work, we show that these assumptions are systematically violated in existing KB\-VQA benchmarks\. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric\. Furthermore, we find that existing datasets rely on visually trivial, single\-entity scenes that bypass the need for sophisticated visual\-to\-knowledge mapping\. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities\. To address this, we introduce \(1\) a principled audit\-and\-repair protocol that restores answer derivability and question clarity, and \(2\) a controlled multi\-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning\. Re\-evaluation under corrected and augmented settings yields markedly different performance trends\. Our findings call for rethinking evaluation protocols and designing more interaction\-aware KB\-VQA benchmarks that prioritize verifiable reasoning over simple matching\.111The datasets and code are available in[https://github\.com/VAN\-QIAN/ECCV26\-ARA](https://github.com/VAN-QIAN/ECCV26-ARA)\. Work was initiated when Qian was an intern at AT&T CDO\. Qiong Wu and Yao Ma are co\-corresponding authors\.
## 1Introduction
Recent Vision\-Language Models \(VLMs\)\[achiam2023gpt,bai2025qwen2,wang2025internvl3,zhang2024vision\]have demonstrated strong performance on a variety of visual reasoning tasks, especially Visual Question Answering \(VQA\) settings that require aligning image content with language understanding\[antol2015vqa,goyal2017making,kim2025visual,kuang2025natural\]\. Despite such effectiveness, they are still struggling to address tasks where answer is beyond the visual part like Knowledge\-Based Visual Question Answering \(KB\-VQA\)\. KB\-VQA\[InfoSeek,EVQA\]is designed to evaluate whether a model can answer image\-grounded questions correctly by retrieving and reasoning over an external, controlled knowledge base, rather than relying only on parametric memory\[InfoSeek,deng2025comprehensive\], where accuracy is used to assess such knowledge\-grounded capability\.
Emerging efforts are focusing on achieving better answer accuracy by developing re\-ranker modules to select the most relevant and informative evidence segment for the answer generation\[IBA,Echosight\], improving the model’s inherent capability to use all retrieved knowledge\[CoMeM,ReflectiVA,CCVQA,reag\]or empowering VLMs to invoke external tools to retrieve more relevant evidence\[Wiki\-PRF\]\. Significantly increasing efforts and resources are being invested to achieve better accuracy score\.
However, in representative KB\-VQA benchmarks such as InfoSeek and E\-VQA\[EVQA,InfoSeek\], we observe recurring dataset issues \(Section[3](https://arxiv.org/html/2607.00159#S3)\) that can make answer accuracy an unreliable indicator of the intended knowledge\-grounded reasoning capability\.1\. Answer\-evidence misalignment\.A non\-trivial portion of questions have annotated answers that are not supported by the provided knowledge resource \(*e\.g*\.missing or contradicted\)\. As shown in Figure[1](https://arxiv.org/html/2607.00159#S4.F1), when answering the sea level of the memorial park, the desired answer is ‘10 foot’ while it’s missing in the benchmark\-provided knowledge base\.2\. Underspecified questions and answer scope\.Many questions do not specify the intended answer granularity, so multiple answers can be reasonable\. For example, for question “When is the mating period of this bird?”, the evidence may mention a season \(Spring\), specific months \(March\-April\), or a life\-stage \(around 1\-year old\)\. When several evidence\-supported answers are plausible, accuracy becomes sensitive to the single particular annotated answer, and can therefore understate a model’s knowledge\-grounded capability\.3\. Visually simplified scenes vs\. real\-world multi\-modal queries\.Existing KB\-VQA benchmarks\[InfoSeek,EVQA\]present visually simplified scenes dominated by a single salient entity, so questions can often be answered using coarse global cues and retrieval may succeed without explicit grounding to the intended target\[sun2024eva,radford2021learning\]\. In realistic scenarios, however, users often issue more complex multi\-modal queries that require text\-image interaction*i\.e*\.the query text specifies which entity is relevant \(*e\.g*\., via spatial relations, relational descriptions,*etc*\.\), and the system should localize candidate entities and resolve the intended entity before KB retrieval and reasoning\. For example, a query may refer to “the fish on the right” where multiple plausible entities coexist and global image similarity alone is insufficient\.
These issues matter because they decouple benchmark scores from the capability KB\-VQA aims to measure\. Under such conditions, high answer accuracy can be driven by annotation artifacts, dataset biases, or shortcut strategies, rather than faithful knowledge\-grounded reasoning\.
To address these limitations, we propose a unified fixing protocol and augmentation framework\. The fixing protocol enforces answer derivability and question clarity via evidence verification, answer\-source consistency checks, and targeted question repair \(Section[4](https://arxiv.org/html/2607.00159#S4)\)\. Building on the repaired benchmark, we further introduce a controlled augmentation pipeline that injects visual distractors while preserving core QA semantics, creating more challenging multi\-entity settings for grounded retrieval and reasoning \(Section[5](https://arxiv.org/html/2607.00159#S5)\)\.
Re\-evaluation with representative KB\-VQA baselines\[IBA,Echosight,Wiki\-PRF,ReflectiVA,CoMeM\]reveals that correcting dataset validity alone can materially change measured performance and method ranking, while augmentation causes substantial drops in retrieval recall and end\-to\-end QA accuracy \(Section[6](https://arxiv.org/html/2607.00159#S6)\)\. These findings suggest that current KB\-VQA evaluation can be overly optimistic, and that robust progress requires both annotation\-level validity and explicit grounding difficulty\.
## 2Related Works
### 2\.1Existing KB\-VQA Datasets
The transition from general VQA\[antol2015vqa,goyal2017making\]to KB\-VQA\[EVQA,InfoSeek\]marks a shift from internal visual perception to the integration of open\-world knowledge\. While early benchmarks like OK\-VQA\[marino2019ok\]and FVQA\[FVQA\]focused on general common\-sense reasoning, modern datasets have scaled toward massive, external knowledge sources to simulate more complex grounding\. Concurrent works such asInfoSeek\[InfoSeek\]andE\-VQA\[EVQA\]are two primary examples of this large\-scale shift\. InfoSeek introduces over 1\.3 million samples to simulate real\-world intent; however, it suffers from a paradigm flaw where the knowledge base often lacks the specific facts used during human annotation, leading to misalignment and evaluation distortion\. E\-VQA attempts to address this by providing a controlled corpus of 2 million Wikipedia pages with section\-level evidence\. Despite these improvements, both datasets rely heavily on template\-based question generation, which often lacks the linguistic variety found in natural queries\. Beyond these two datasets, there is a following work SK\-VQA\[su2025skvqa\]that claims context comprehension capability is important to achieve KB\-VQA task\. Therefore, they constructed a synthetic dataset to generate ‘pseudo’ wiki articles to explicitly train the model to seek information among it\. Beyond task formulation, prior KB\-VQA benchmarks\[InfoSeek,EVQA\]also differ in how their knowledge sources, annotations, and question templates are constructed\. These design choices can introduce recurring issues such as mismatches between annotated answers and retrievable evidence, underspecified templates that admit multiple plausible answers, and visually simplified scenes that weaken the need for grounded disambiguation\. Our work builds on these benchmarks and provides a structured characterization of such issues and their implications for evaluation\.
### 2\.2MM\-RAG Solutions for KB\-VQA
Since the answers of KB\-VQA questions are beyond the visual part and external knowledge is required, most emerging solutions for KB\-VQA tasks can be categorized into Multi\-modal Retrieval\-Augmented Generation \(MM\-RAG\) frameworks to effectively acquire the external knowledge base according to the multi\-modal queries\. These solutions generally focus on how to selectively utilize retrieved content to respond to queries and can be categorized into two streams\.
1\. Re\-ranking based methodsmainly focus on picking the most relevant and informative part of all initial\-retrieved information*e\.g*\.the section corresponding to target question of all retrieved Wikipedia articles\.IBA\[IBA\]addresses this by explicitly decoupling the workflow to ground the target entity by doing identification before reasoning to answer\. It leverages the zero\-shot recognition capabilities of MLLMs to identify candidate entities before employing a lightweight text\-re\-ranker for precise evidence selection\. This modular workflow re\-design approach contrasts with frameworkEchoSight\[Echosight\], which trains a customized multi\-modal re\-ranker based on Q\-Former\[li2023blip\]\.
2\. Aggregation/Filtering\-basedmethods focusing on digesting all retrieved information to implicitly provide the evidence for downstream answer generation instead of explicitly selecting the evidence\.ReflectiVA\[ReflectiVA\]introduces specialized self\-reflective tokens that allow the re\-trained MLLM to autonomously determine the necessity of each retrieved section\. All sections considered as relevant will be aggregated to support the answer generation\.CoMeM\[CoMeM\]proposed to condense all retrieved articles into the embeddings and concatenate with query embeddings to support the answer generation as model memory\. InWiki\-PRF\[Wiki\-PRF\], reinforcement learning is introduced to empower MLLMs to conduct multiple retrievals with different tools and filtering all retrieved results to support the final answer generation\.
## 3Issues in Existing KB\-VQA Benchmarks
Answer accuracy is widely reported on KB\-VQA benchmarks, but it can be unreliable when benchmark instances violate basic validity requirements\. Across E\-VQA\[EVQA\]and InfoSeek\[InfoSeek\], we observe three recurring dataset issues that may directly distort evaluation: \(i\)answer\-evidence misalignmentthat creates unavoidable false negatives, \(ii\)underspecified questionswhere multiple answers are defensible and scores depend on the single desired annotated answer, and \(iii\)visually simplified sceneswhere shortcut retrieval can succeed without grounded disambiguation\. Together, these issues can decouple benchmark scores from the knowledge\-grounded retrieval and reasoning capability KB\-VQA is intended to measure\. We use three assumptions,*\(A\) Answer Derivability,**\(B\) Question Well\-Posedness,*and*\(C\) Grounded Disambiguation Requirement*as an organizing lens, and document representative violations under this framework\.
### 3\.1Answer\-Evidence Misalignment
Assumption \(A\) requires that each annotated answer be derivable from the associated KB evidence\. We observe two recurring violations:*\(a\) Unsupported answers*, where the answer is absent from the KB \(*e\.g*\.InfoSeek QID 23994 annotates “10 foot” for sea level of Wright Brothers National Memorial while corresponding evidence article doesn’t contain any relevant information\), and*\(b\) Contradictory answers*, where annotation conflicts with evidence \(*e\.g*\.InfoSeek QID 5411 annotates “1,302” kilogram as Mass of McLaren car, but evidence states “It weighs 1,301 kg” such a single value won’t match the desired range\)\.
When derivability fails, evaluation produces unavoidable false negatives under the benchmark\-provided KB that even perfect retrieval and correct reasoning can be scored as wrong because the target annotation is not grounded in the retrievable evidence \(Figure[1](https://arxiv.org/html/2607.00159#S4.F1)\)\. Using an initial verification pass with Qwen3\-30B\-A3B\[yang2025qwen3\]to scan each target entity page and judge support and derivability, we find unsupported annotations in about 1% of E\-VQA and 22% of InfoSeek\. This implies that a non\-trivial fraction of InfoSeek is unanswerable under its own provided KB, capping accuracy by construction\. This failure is largely structural in InfoSeek, where QA pairs come from WikiData\[vrandevcic2014wikidata\]knowledge graph triplets but evaluation uses Wikipedia text, creating a systematic cross\-source mismatch\.
### 3\.2Underspecified Questions
Assumption \(B\) requires each question to be sufficiently constrained so that the annotated answer is unique under the given KB\. We identify three frequent ambiguity patterns:
*\(1\) Missing attribute constraint**e\.g*\.For the question “How big can this plant become?” without specifying height or diameter\.
*\(2\) Missing temporal scope**e\.g*\.“When is the mating period?” without phase or time granularity, then the answer could be a season\(spring\) or month\(September\) during the year or during the species’ life\-cycle\(such as ‘1\-year old’\)\.
*\(3\) Missing spatial reference or granularity*\(*e\.g*\.For the same question “In which country or region does this animal live?”, the desired habitat answers could be region\-level\(North America\) or country\-level\(United States\) \)\.
Using these tags, our audit shows that 59% of E\-VQA and 47% of InfoSeek questions are ambiguous: E\-VQA has 21\.5% attribute, 27\.5% spatial, and 10% temporal omissions; InfoSeek shows a similar pattern \(17\.3%, 30\.3%, 2%\)\. Under such under\-specification, models may produce semantically valid alternatives but still be penalized as wrong\. Therefore, accuracy becomes sensitive to annotation preference rather than reasoning correctness\. These ambiguities mainly stem from template\-based generation\[EVQA\]and automatic KG\-to\-question mapping\[InfoSeek\], where key qualifiers are dropped\.
### 3\.3Single\-Entity Shortcut and Missing Grounded Disambiguation
Assumption \(C\) requires that success depends on grounded disambiguation to respond to complex multi\-modal user queries under realistic scenarios\. Ideally, KB\-VQA involves four stages: entity localization, text\-image grounding, entity disambiguation, and KB retrieval/reasoning\. In realistic queries \(*e\.g*\.“the fish on the left”\), the model should first identify which entity is being referenced\. However, existing KB\-VQA benchmarks usually contain one dominant entity per image and evaluate against a known target identifier\. Because the target is effectively fixed per instance, global retrieval can succeed without verifying which entity the question refers to\. With this design, global image embeddings can often retrieve the target directly, collapsing localization, grounding and disambiguation into a shortcut\. As a result, high answer accuracy does not necessarily imply robust grounding ability\. The metric can overestimate capability because shortcut retrieval is sufficient in many samples\. This issue is driven by entity\-centric curation, target\-anchored evaluation protocols, and reliance on global embedding retrieval\[EVQA,InfoSeek\]\. To restore this missing requirement, Section[5](https://arxiv.org/html/2607.00159#S5)introduces controlled multi\-entity augmentation so that localization, grounding, and disambiguation become necessary for success\.
## 4Proposed Fixing Protocol
To address recurring benchmark issues such asanswer\-evidence mismatchandunderspecifiedquestions \(Section[3](https://arxiv.org/html/2607.00159#S3)\), we propose a dataset\-level audit\-and\-repair protocol applicable to KB\-VQA benchmarks\. It explicitly enforces the two assumptions identified in Section[3](https://arxiv.org/html/2607.00159#S3):*Assumption A \(Answer Derivability\):*the annotated answer must be supported by and derivable from the external knowledge base;*Assumption B \(Question Well\-Posedness\):*the question must provide sufficient constraints to uniquely determine the annotated answer\. Given a KB\-VQA instance \(image, QA pair, target entity, and KB evidence snapshot\), the pipeline outputs either a repaired instance or a filtered\-out instance\. It consists of four cascaded stages: evidence verification, answer\-derivability auditing, question\-constraint repair, and final consistency validation\.
### 4\.1General Four\-Stage Protocol
Step 1: Evidence Verification\.We first verify whether the evidence referenced by each QA pair exists in the KB snapshot and whether it contains support for the annotated answer under the target entity context\. Instances are categorized asSupported & Matched,Supported but Mismatched, andUnsupported\. Since InfoSeek constructs QA pairs from Wikidata while using Wikipedia articles as the evaluation KB, we scan the target\-entity page section by section and apply two independent verifiers \(Qwen3\-30B\-A3B\[yang2025qwen3\]and DeepSeek\-v3\.2\[liu2025deepseek\]\) to mitigate erroneous filtering due to cross\-source mismatch\. Only instances that are consistently classified asUnsupportedby both verifiers are removed \(Figure[1](https://arxiv.org/html/2607.00159#S4.F1)\)\. The verification is constrained to checking whether the target\-entity evidence supports, contradicts, or lacks the annotated answer, rather than making an open\-ended judgment\. The evidence context is localized to sections, with mean/95th\-percentile lengths of 868/2360 tokens for E\-VQA and 877/2545 tokens for InfoSeek\. For E\-VQA, where evidence segments are provided during dataset construction, we directly examine the referenced segment and revisit cases with answer–evidence mismatch\.
Figure 1:Qualitative example from InfoSeek \(QID:149\)\. InfoSeek\[InfoSeek\]selects answers from Wikidata\[vrandevcic2014wikidata\]triples and converts them into QA pairs, while the evaluation KB consists of Wikipedia articles\. This cross\-source construction can yield cases where the provided KB does not support the annotated answer, confounding score interpretation even under perfect retrieval\.Step 2: Answer\-Derivability Auditing and Calibration\.Given verified evidence and the original question, we test whether the annotated answer is logically derivable\. For any instances tagged asSupported but Mismatched, if the answer contradicts evidence, we apply evidence\-grounded answer correction\. DeepSeek\-v3\.2 is not used as an answer source\. It only rewrites a mismatched annotation to the value explicitly supported by verified evidence\. If no evidence\-supported value can be derived, the instance is removed\. For InfoSeek, this step frequently corrects answer\-side mismatch introduced by cross\-source construction\. For E\-VQA\[EVQA\], answer\-evidence contradictions are much less frequent, consistent with its construction\[changpinyo2022all\]where the answer is selected from the provided evidence article\.
Step 3: Question\-Constraint Repair\.We identify whether the question is sufficiently constrained for unique answering\. We repair ambiguous questions with minimal edits using three tags: missing attribute constraints, missing temporal scope, and missing spatial reference\. Edits preserve the original intent and evidence dependency\. For InfoSeek, the dropped qualifiers from KG\-to\-question mapping are restored by adding missing facets\. For E\-VQA, the original template\-generated questions to fit super\-category are revised mainly when templates omit disambiguating qualifiers to the fine\-grained entity\.
Step 4: Leakage and Global Consistency Validation\.After repair, we run a final pass to ensure that: \(i\) the question does not leak the answer, \(ii\) the answer remains evidence\-supported, and \(iii\) the revised question yields a unique answer\. Any leakage\-inducing edits are rolled back and rewritten conservatively\.
### 4\.2Fixing Outcomes and Human Evaluation
Applying the same four\-stage protocol to both datasets yields different failure profiles but a unified corrected setting\. For InfoSeek, we retain 58,285 out of 71,335 instances \(81\.7%\) after filtering and repair\. We construct an entity\-unique subset of 1,924 questions\(consisting 1604 String, 223 Numerical and 97 Time questions\) for controlled evaluation*i\.e*\.for instances targeting on the same target entity with same query text but different query images, we only keep one\. For E\-VQA, answer\-side conflicts are rarer and most edits happen in question constraints\. Hence the fixed evaluation split remains 4,750 questions\. For human evaluation, we expand the review to 10% of the fixed evaluation splits, covering 475 E\-VQA and 200 InfoSeek instances\. Following existing work\[su2025skvqa\], two PhD\-student\-level annotators answer each fixed query using only the oracle evidence of the target entity\. Human accuracy remains high:92\.9±1\.192\.9\\pm 1\.1for fixed E\-VQA and91\.5±2\.191\.5\\pm 2\.1for fixed InfoSeek\. More qualitative examples of fixing outcomes and human\-evaluation cases are provided in Appendix[0\.A](https://arxiv.org/html/2607.00159#Pt0.A1)and Appendix[0\.F](https://arxiv.org/html/2607.00159#Pt0.A6)\. These repaired datasets are used as the*fixed*versions and original datasets with the same index/identifier are used as the*unfixed*in Section[6](https://arxiv.org/html/2607.00159#S6)\.
## 5Proposed Augmentation Protocol
As discussed in Section[3](https://arxiv.org/html/2607.00159#S3), a gap remains between current KB\-VQA benchmark settings and real\-world multi\-modal queries, where users often must specify the target entity through text alone under multi\-entity ambiguity \(*e\.g*\., “the fish on the right”\)\. In existing benchmarks, however, images are frequently dominated by a single salient entity, so retrieval can succeed via coarse global image similarity overlooking*\(C\) Grounded Disambiguation Requirement*\.
### 5\.1General augmentation protocols
To close the evaluation gap in Section[3](https://arxiv.org/html/2607.00159#S3), we augment each instance by adding a single distractor entity alongside the original target \(anchor\)\. We preserve the annotated answer by construction and keep the external knowledge source unchanged, so performance changes reflect increased visual ambiguity rather than knowledge or annotation shifts\. Each augmented image contains exactly one anchor and one distractor, where the distractor is drawn from either the same semantic category \(intra\-category\) or a different one \(inter\-category\), forcing retrieval to rely on grounded disambiguation rather than coarse global similarity\.
Intra\-Category DistractionWe spatially concatenate the anchor image with a single distractor from the same semantic category, placed in a different region \(*e\.g*\., left vs\. right\)\. We minimally revise the question to reference the anchor region or a distinguishing attribute \(*e\.g*\., “the fish on the left”\), while keeping the annotated answer unchanged\. This setting isolates fine\-grained intra\-category ambiguity and tests whether models ground textual constraints in localized visual evidence rather than relying on coarse global embeddings\.
Inter\-Category DistractionWe pair the anchor entity with a single distractor from a different semantic category \(*e\.g*\., a landmark with an animal, or a plant with a man\-made object\)\. Although visually distinct, the distractor can interrupting global image representations, requiring models to integrate the question with localized evidence to infer the intended referent\. The question and answer remain unchanged, allowing us to test whether retrieval is overly sensitive to irrelevant visual content instead of grounding target entity under textual constraints\.
### 5\.2Augmentation Outcomes and Human Evaluation
To enable direct comparisons, we exclude questions that would remain ambiguous after intra\-category augmentation \(*e\.g*\., those referring to “the object”\) from*both*splits\. As shown in Figure[2](https://arxiv.org/html/2607.00159#S5.F2), intra\-category augmentation adds a minimal cue \(typically spatial\) to specify the anchor entity, whereas inter\-category augmentation keeps the question unchanged\. With answers preserved by construction, performance changes reflect increased grounded disambiguation demands rather than knowledge or annotation shifts\. We generate 1,604 \(InfoSeek\) and 3,871 \(E\-VQA\) augmented variants, with additional examples in Appendix[0\.B](https://arxiv.org/html/2607.00159#Pt0.A2)\. For quality assurance, we sample 100 intra\- and 100 inter\-category instances per dataset\. Two annotators answer each augmented query given the anchor evidence and optionally report evident flaws\[su2025skvqa\]\. Human accuracy is87\.5±2\.1%87\.5\\pm 2\.1\\%/96\.0±1\.4%96\.0\\pm 1\.4\\%\(E\-VQA intra/inter\) and86\.0±1\.4%86\.0\\pm 1\.4\\%/89\.5±3\.5%89\.5\\pm 3\.5\\%\(InfoSeek intra/inter\)\. More details are in Appendix[0\.G](https://arxiv.org/html/2607.00159#Pt0.A7)\.
Figure 2:Qualitative example of our augmentation protocols \(Section[5](https://arxiv.org/html/2607.00159#S5)\) on*Ełk Lake*\. All augmented variants keep the anchor answer in \(a\)\. \(b\) Intra\-augmentation adds a minimal spatial cue \(“on the right”\) to preserve the original intent while inserting an additional lake image \(*Lake Turgoyak*\) on the left\. \(c\) Inter\-augmentation inserts a visually distinct distractor \(*Sceloporus jarrovii*\) while keeping the question unchanged\.
## 6Experiments
### 6\.1Experimental Setup
Datasets\.We evaluate KB\-VQA baselines on the following aligned dataset variants: \(1\) the original*unfixed*benchmark, \(2\) a*fixed*version that enforces Assumption A/B via the pipeline in Section[4](https://arxiv.org/html/2607.00159#S4), and \(3\) the*Intra*and*Inter*augmented versions that enforces the grounded\-disambiguation requirement via the pipeline in Section[5](https://arxiv.org/html/2607.00159#S5)\. Overall, these controlled comparisons tie each analysis directly to the assumptions in Section[3](https://arxiv.org/html/2607.00159#S3)and validate whether fixing and augmentation restore the intended evaluation conditions\.
External Knowledge Base\.E\-VQA provides a controlled knowledge base of 2 million Wikipedia articles with associated images\. Following prior work\[Echosight\], we focus on the single\-hop setting\. For InfoSeek, following existing baselines\[Wiki\-PRF,CoRe\-MMRAG,ReflectiVA\], we adopt the 100K\-article knowledge base released by Yan and Xie\[Echosight\]and follow the same initial retrieval settings using EVA\-CLIP\-8B\[sun2024eva\]\.
Baselines\.We benchmark five representative KB\-VQA methods, grouped by how retrieved evidence is utilized\.Re\-ranking\-based methods\(IBA\[IBA\], EchoSight\[Echosight\]\) explicitly re\-rank retrieved evidence and generate the final answer conditioned on the top\-ranked section, using either a tailored workflow paradigm or a trained multimodal re\-ranker\.Aggregation/Filtering\-based methods\(ReflectiVA\[ReflectiVA\], CoMeM\[CoMeM\], Wiki\-PRF\[Wiki\-PRF\]\) instead aggregate evidence across multiple sections or retrieval rounds and filter irrelevant content before answer generation\. ReflectiVA assigns section\-level special relevant tokens and aggregates the selected evidence\. CoMeM encodes retrieved articles into memory representations that are consumed by an MLLM\. Wiki\-PRF performs iterative retrieval with auxiliary tools \(*e\.g*\.grounding and captioning\) and filters retrieved content across rounds to serve as evidence for answer generation\.
Evaluation Metrics\.We report both end\-to\-end QA accuracy and retrieval\-related metrics\. For answer accuracy, we evaluate open\-ended outputs using BEM\[zhang2019bertscore\]on E\-VQA\. For InfoSeek, following prior practice\[Echosight,ReflectiVA,Wiki\-PRF\], we use VQA accuracy\[goyal2017making,marino2019ok,Wiki\-PRF\]\. For retrieval, we report Recall@K, indicating whether the correct target\-entity article appears among the top\-KKretrieved results\. For methods that include an explicit re\-ranking stage\[Echosight\]or multi\-round retrieval\[Wiki\-PRF\], we additionally report*post\-retrieval*performance, measuring whether the final evidence used for answer generation \(*e\.g*\.the top\-1 re\-ranked section or the aggregated evidence pool\) contains the correct target entity\.
Implementation Details\.For all baselines\[Echosight,ReflectiVA,Wiki\-PRF,CoMeM\], we use officially released checkpoints and prompts with unified KB and retrieval settings\. For IBA\[IBA\]component selection, we use Qwen2\.5\-VL\-7B\-Instruct\[bai2025qwen2\]for identification, bge\-reranker\-v2\-m3\[chen2024bge\]for re\-ranking and LLama\-3\.1\-8B\-Instruct\[grattafiori2024llama\]for answer generation\. Hyper\-parameters follow the default configurations provided by the authors and are summarized in Appendix[0\.H](https://arxiv.org/html/2607.00159#Pt0.A8)\. For initial retrieval, we use a frozen EVA\-CLIP\-8B\[sun2024eva\]encoder to extract image features, following existing works\[Echosight,ReflectiVA,Wiki\-PRF\]\. All image features are indexed and retrieved with cosine similarity using Faiss\-GPU library\[douze2025faiss\], consistent with prior practices\[Echosight,ReflectiVA,Wiki\-PRF\]\.
### 6\.2Results on fixed datasets
We first evaluate all methods on both the original and fixed versions of InfoSeek and E\-VQA\. Importantly, we ensure the evaluated samples from both original and fixed datasets preserve exactly the same sample IDs\. We only correct flawed question formulations and annotated answers\. Therefore, any performance difference reflects violations of*Answer Derivability*and*Question Well\-Posedness*\(Section[3](https://arxiv.org/html/2607.00159#S3)\) rather than model changes\.
Table 1:Performance comparison on InfoSeek and E\-VQA before and after dataset fixing\. For InfoSeek, most of methods performance improved, especially on Time and Numerical\.Boldfor best performance and\\ulunderline to the runner\-upBy repairing QA annotations to enforce answer–evidence alignment and resolving underspecified questions, fixing yields substantial score changes across methods on InfoSeek\. Because the fixed and unfixed evaluations are aligned on the same sample IDs, these shifts reflect the impact of removing annotation\- and template\-induced confounders rather than changes in evaluation coverage\. Some methods improve markedly \(*e\.g*\.IBA34\.5→42\.434\.5\\rightarrow 42\.4, EchoSight30\.9→37\.230\.9\\rightarrow 37\.2\), with larger gains on the stricter Time and Num subsets \(*e\.g*\.Wiki\-PRF35\.1→43\.335\.1\\rightarrow 43\.3,45\.7→52\.945\.7\\rightarrow 52\.9\)\. Because images, retrieval settings, checkpoints, and metrics are held constant, these shifts isolate the impact of benchmark validity \(Assumption A/B\) rather than changes in model design\. Crucially, we also observe ranking reversals after fixing, indicating that unfixed splits can distort comparative conclusions about which system components matter most\.
Key observations from fixing results\.\(i\) Relative gaps shrink after fixing\.On unfixed InfoSeek, aggregation/filtering pipelines appear much stronger than re\-ranking based methods like IBA \(*e\.g*\.Wiki\-PRF vs\. IBA:\+10\.4\+10\.4; ReflectiVA vs\. IBA:\+2\.8\+2\.8\)\. After fixing, these gaps shrink to\+1\.2\+1\.2and−4\.3\-4\.3, respectively\.
\(ii\) Method rankings can reverse\.On unfixed InfoSeek, ReflectiVA outperforms IBA, but the ordering reverses after fixing \(IBA42\.442\.4vs\. ReflectiVA38\.138\.1\)\. On the Time subset, explicit evidence selection becomes more favorable after fixing \(IBA47\.447\.4and Wiki\-PRF43\.343\.3vs\. ReflectiVA34\.034\.0\)\. These reversals matter in practice because they can change system\-design conclusions\. For example, after the fixing, community may draw a conclusion that selecting the most relevant and informative evidence section is more effective than relying on model’s inherent capability to assess the relevance of each evidence section to query\. Otherwise, if the community sticks to the conclusion draws from the unfixed data, a lot of efforts could be invested to expensive training but the gain may not aligned in a corresponding level\. Similarly, the advantage of Wiki\-PRF over IBA is large on unfixed InfoSeek \(44\.944\.9vs\.34\.534\.5, gap10\.410\.4\), but becomes much smaller after fixing \(43\.643\.6vs\.42\.442\.4, gap1\.21\.2\)\. In this case, community may reconsider the trade\-offs between developing new models and reconsidering the workflow orchestration\. Overall, violations of answer derivability and question well\-posedness can distort comparative conclusions and misallocate optimization effort\.
Due to space limits, we defer detailed fine\-grained and qualitative analyses to Appendix[0\.D](https://arxiv.org/html/2607.00159#Pt0.A4)to show how our fixing improves the performance by correcting the desired annotation answer and improving the question to guide the desired answer\. On the shared grounded InfoSeek subset where EchoSight’s top\-1 reranked section belongs to the ground\-truth entity \(664 samples\), question\-only fixes improve QA accuracy from60\.160\.1to75\.175\.1, joint question\-answer fixes improve from34\.034\.0to45\.245\.2, and answer\-only fixes remain nearly unchanged\. Detailed breakdowns are in Appendix[0\.D](https://arxiv.org/html/2607.00159#Pt0.A4)\. Post\-retrieval results also provide supporting evidence for re\-ranking methods that improving the question contributes to better re\-ranking\. On InfoSeek, post\-retrieval performance increases after fixing for EchoSight \(46\.8→48\.946\.8\\rightarrow 48\.9\) and IBA \(46\.7→47\.846\.7\\rightarrow 47\.8\)\. This suggests that part of the QA improvement is associated with better evidence selection after fixing, while question and answer repairs likely contribute jointly\. Table[6](https://arxiv.org/html/2607.00159#Pt0.A4.T6)shows that InfoSeek has high modification rates for both questions and answers, whereas E\-VQA mainly changes questions\. This aligns with the observed pattern: E\-VQA shifts are modest, while InfoSeek shows larger magnitude changes in Table[1](https://arxiv.org/html/2607.00159#S6.T1)\.
Table 2:Initial retrieval recall\. Since initial retrieval are following the same image\-to\-image protocol of all baselines, the fixed and unfixed version share the same retrieved information\. For Intra\- and Inter\-augmented version, augmented images are used\.Retrieval performance affects cross\-dataset rankings\.We further report retrieval recall on the fixed full datasets and the anchor/augmented variants to analyze retrieval difficulty and to contextualize augmentation effects\. Table[2](https://arxiv.org/html/2607.00159#S6.T2)shows that InfoSeek has much higher initial R@1 than E\-VQA \(43\.5% vs 13\.4% on the full fixed set\)\. Correspondingly, Table[3](https://arxiv.org/html/2607.00159#S6.T3)shows that Wiki\-PRF remains competitive on InfoSeek \(48\.6%\) but drops sharply on E\-VQA \(18\.2%\), while EchoSight and IBA are more stable\. This pattern indicates that ranking differences across datasets are driven largely by retrieval difficulty rather than by downstream reasoning, which further strengthen the need to challenge the retrieval as we discussed in Section[5](https://arxiv.org/html/2607.00159#S5)\.
Table 3:Post\-retrieval performance using retrievedtop 20articles\. on all dataset variants, including unfixed, fixed datasets, the subset of anchor samples and the augmented versions\. Note for ReflectiVA\[ReflectiVA\]and CoMEM\[CoMeM\], they aggregate the top\-5 and top\-10 articles*i\.e*\.align with initial retrieval recall@5 and recall@10 in Table[2](https://arxiv.org/html/2607.00159#S6.T2)\.
### 6\.3Results on augmented datasets
To directly test the grounded\-disambiguation requirement \(Section[5](https://arxiv.org/html/2607.00159#S5)\), we evaluate on the augmented variants that introduce a single distractor while preserving the answer and KB\. We report three representative methods \(EchoSight, IBA, Wiki\-PRF\) to cover re\-ranking and aggregation paradigms\.
Assumption C \(Grounded Disambiguation\): augmentation removes the single\-entity shortcut\.Table[2](https://arxiv.org/html/2607.00159#S6.T2)shows that adding a single distractor sharply reduces initial retrieval recall\. On InfoSeek, R@1 drops from43\.543\.5\(Full\) to14\.7/20\.414\.7/20\.4\(Intra/Inter\); on E\-VQA, it falls from13\.413\.4to3\.5/2\.93\.5/2\.9\. Since KB and questions are unchanged, this collapse is attributable to the added entity ambiguity, confirming the augmentation effectively enforces grounded disambiguation\. To separate layout effects from semantic distractors, we additionally evaluate two layout\-only controls:Blankreplaces the distractor with an empty panel andDoubleduplicates the anchor image\. Initial retrieval is layout\-sensitive, but Blank/Double remain easier than Intra/Inter \(E\-VQA R@1:6\.3/9\.96\.3/9\.9vs\.3\.5/2\.93\.5/2\.9and InfoSeek R@1:26\.2/36\.926\.2/36\.9vs\.14\.7/20\.414\.7/20\.4\)\. Full initial\- and post\-retrieval results are reported in Appendix[0\.E](https://arxiv.org/html/2607.00159#Pt0.A5), showing that semantic distractors introduce difficulty beyond layout shift\.
Post\-retrieval evidence selection does not recover the loss\.Table[3](https://arxiv.org/html/2607.00159#S6.T3)shows that post\-retrieval recall also degrades sharply on augmented images \(*e\.g*\.IBA InfoSeek45\.1→21\.7/21\.645\.1\\rightarrow 21\.7/21\.6, EchoSight E\-VQA46\.6→11\.8/11\.746\.6\\rightarrow 11\.8/11\.7\)\. This suggests that once initial grounding fails, later evidence aggregation cannot compensate, aligning with our diagnosis that grounded disambiguation is the critical bottleneck\. Surprisingly, the Wiki\-PRF didn’t try to actively invoke its grounding tool\. We defer more detailed analysis in Appendix[0\.E](https://arxiv.org/html/2607.00159#Pt0.A5)\.
QA accuracy drops sharply, validating the lack of grounded disambiguation\.Table[4](https://arxiv.org/html/2607.00159#S6.T4)reports QA accuracy on the anchor subset and its two augmented variants\. All methods suffer substantial performance degradation once ambiguity is introduced\. For example, on InfoSeek, IBA drops from40\.140\.1\(Anchor\) to21\.4/21\.621\.4/21\.6\(Intra/Inter\), and EchoSight drops from38\.738\.7to15\.9/17\.815\.9/17\.8\. Even Wiki\-PRF declines from43\.943\.9to23\.6/25\.923\.6/25\.9\. Similar trends hold on E\-VQA\.
Table 4:QA Performance on Intra\-category\(Intra\) and Inter\-Category\(Inter\) augmented datasets and the corresponding anchor samples\. Following Table[1](https://arxiv.org/html/2607.00159#S6.T1), we report the performance on E\-VQA\(E\-V\)\[EVQA\]and InfoSeek\(IS\)\[InfoSeek\]with breakdowns on Time\(T\), Numerical\(N\) and String\(S\) questionsThese results indicate that, even after repairing QA validity \(Section[4](https://arxiv.org/html/2607.00159#S4)\), existing KB\-VQA evaluations remain optimistic since their visual setups allow retrieval pipelines to exploit single\-entity shortcuts\. Our controlled augmentation thus exposes a remaining bottleneck:how to reliably ground the query to the correct entity and then select truly relevant evidence under clutter, which suggests that future KB\-VQA benchmarks should incorporate richer multi\-entity interactions and stronger grounding requirements\.
## 7Conclusion
We show that current KB\-VQA benchmarks can exhibit recurring dataset issues that complicate evaluation by answer accuracy alone\. Across InfoSeek and E\-VQA, we identify three prominent confounders: \(i\) answer–evidence misalignment where annotated answers are not derivable from the benchmark\-provided knowledge snapshot, \(ii\) underspecified questions that admit multiple plausible answers, and \(iii\) visually simplified settings that weaken the need for grounded disambiguation\. These issues can decouple scores from the knowledge\-grounded retrieval and reasoning capability KB\-VQA is intended to measure\.
To address them, we propose two protocols\. First, a fixing protocol that enforces answer derivability and question well\-posedness via evidence\-aware auditing and targeted repair\. Second, a controlled augmentation strategy that introduces visual distractors while preserving the annotated answer, thereby increasing the need for grounded entity disambiguation under multi\-entity conditions\. Together, these protocols transform existing benchmarks into a more diagnostic testbed for retrieval\-and\-reasoning\.
Experiments on the repaired and augmented splits reveal trends that differ from standard evaluation\. After fixing, most methods improve substantially, and we observe that relative rankings can change, indicating that unfixed benchmarks may distort comparative conclusions and the perceived impact of system components\. Under augmentation, performance drops consistently for both retrieval recall and end\-to\-end accuracy, highlighting that robust grounding and intent\-aligned retrieval remain challenging under visual ambiguity\.
More broadly, our findings suggest that KB\-VQA evaluation should explicitly account for benchmark validity\. We encourage the community to \(i\) prioritize evidence\-derivable annotations and well\-specified questions during dataset construction, \(ii\) report metrics that reflect evidence support and grounding robustness in addition to answer correctness, and \(iii\) develop retrieval and reasoning methods that remain reliable when multiple plausible entities and distractors are present\. We hope our protocols and revised benchmarks facilitate more faithful, diagnostic, and reproducible KB\-VQA evaluation\.
## Acknowledgements
This research was supported by the National Science Foundation \(NSF\) under grant numbers NSF2406647 and NSF2406648\. It was also supported by the National Artificial Intelligence Research Resource \(NAIRR\) Pilot and the Delta advanced computing and data resource, which is supported by the National Science Foundation under award NSF\-OAC\-2005572\. S\. M\. R\. and C\. V\. S\. are supported by the Imageomics Institute, funded by the US National Science Foundation’s Harnessing the Data Revolution \(HDR\) program under Award No\. 2118240 \(Imageomics: A New Frontier of Biological Information Powered by Knowledge\-Guided Machine Learning\)\.
## References
## Appendix 0\.AQualitative Examples of Fixing Outcome
In this section we present more qualitative samples of our fixing outcome\. Besides the qualitative example that is removed due to lack of supporting evidence as shown in Figure[1](https://arxiv.org/html/2607.00159#S4.F1), we also show how the instances are corrected to align with the supporting evidence in Figure[3](https://arxiv.org/html/2607.00159#Pt0.A1.F3)or improved with constraints in Figure[4](https://arxiv.org/html/2607.00159#Pt0.A1.F4)to ensure answer derivability and question well\-posedness\.
Figure 3:Fixing Qualitative example from InfoSeek \(QID:9\)\. InfoSeek\[InfoSeek\]selects answers from Wikidata\[vrandevcic2014wikidata\]triples and converts them into QA pairs, while the evaluation KB consists of Wikipedia articles\. This cross\-source construction can yield cases where the provided KB contains contradictory evidence to the annotated answer*e\.g*\., the desired answer is a range‘112 \- 158’, derived from135±23135\\pm 23\. Meanwhile the article in provided KB only contains‘131’ days\.Figure 4:Fixing Qualitative example from E\-VQA \(QID:79\) where desired answer ‘Spring’ exists in evidence\. Meanwhile there could be multiple plausible answers*e\.g*\., ‘April \- June’ since the following sentence indicates ‘are most often found between late April and June’\. Therefore, the question is changed to“Which season during the year is the mating period”
## Appendix 0\.BQualitative Examples of Augmentation Outcome
In this section, we provide more qualitative examples of our augmentation outcome, which is similar to Figure[2](https://arxiv.org/html/2607.00159#S5.F2)\. As shown in Figure[5](https://arxiv.org/html/2607.00159#Pt0.A2.F5)and Figure[6](https://arxiv.org/html/2607.00159#Pt0.A2.F6), anchor images are augmented with either intra\-category or inter\-category entities with the augmentation protocol we propose in Section[5](https://arxiv.org/html/2607.00159#S5)\.
Figure 5:Qualitative example of our augmentation protocols \(Section[5](https://arxiv.org/html/2607.00159#S5)\) on*Monastery of Saint Naum*\. All augmented variants keep the anchor answer in \(a\)\. \(b\) Intra\-augmentation adds a minimal spatial cue \(“on the left”\) to preserve the original intent while inserting an additional monastery image \(*Mar Saba*\) on the left\. \(c\) Inter\-augmentation inserts a visually distinct distractor \(*Arabidopsis lyrata*\) while keeping the question unchanged\.Figure 6:Qualitative example of our augmentation protocols \(Section[5](https://arxiv.org/html/2607.00159#S5)\) on*Nationals Park*\. All augmented variants keep the anchor answer in \(a\)\. \(b\) Intra\-augmentation adds a minimal spatial cue \(“on the left”\) to preserve the original intent while inserting an additional park image \(*Fort Macon State Park*\) on the left\. \(c\) Inter\-augmentation inserts a visually distinct distractor \(*Okenia rosacea*\) while keeping the question unchanged\.
## Appendix 0\.CQualitative Examples of Error Cases
In this section, we provide some typical failure cases of benchmarked methods\. Basically, the failures result from errors in the initial retrieval, or from the re\-ranking and filtering steps during the post\-retrieval stage\.
When retrieval failed, no methods can ground their answer with correct evidence\. Hence, we provide the qualitative examples when initial retrieval succeeds, but the methods may also fail in re\-ranking or filtering to correctly focus on the correct entity or sections on E\-VQA\. As shown in Figure[7](https://arxiv.org/html/2607.00159#Pt0.A3.F7), IBA\[IBA\]picks a section from the wrong entity while EchoSight\[Echosight\]selects an irrelevant section from the target entity\.
Figure 7:Qualitative example of failure cases on E\-VQA\[EVQA\]*Veterans Stadium*\. The target entity is included in the initial retrieval results\. However, all evaluated methods can’t obtain the desired answer\. EchoSight\[Echosight\]selects the section from the correct entity but it’s irrelevant to the answer\. IBA\[IBA\]selects a section from wrong entity\. The Aggregation/filtering methods, Wiki\-PRF\[Wiki\-PRF\], CoMeM\[CoMeM\]and ReflectiVA\[ReflectiVA\]fail to derive the correct answer\.
## Appendix 0\.DFine\-grained analysis to fixed datasets
In this section, we present some qualitative examples that fixing question and answer contributes to better re\-ranking or answer generation\. As shown in Figure[8](https://arxiv.org/html/2607.00159#Pt0.A4.F8), only revising the question under our proposed fixing protocols in Section[4](https://arxiv.org/html/2607.00159#S4)can better guide EchoSight\[Echosight\]to prioritize the correct section\.
Figure 8:Fixing Qualitative example from InfoSeek \(QID:40316\) on*Donauturm*\. After fixing, the target section is correctly prioritized by EchoSight\[Echosight\]re\-ranker\.In Figure[9](https://arxiv.org/html/2607.00159#Pt0.A4.F9), with our proposed fixing protocol in Section[4](https://arxiv.org/html/2607.00159#S4), we verify the absence of the desired answer*‘1302’*in provided knowledge base and revise the answer to*‘1301’*\. The question is also improved with proper constraint ‘dry weights’\. After the fixing, EchoSight\[Echosight\]can correctly prioritize the Design section and derive the desired answer*‘1301’*correctly\.
Figure 9:Fixing Qualitative example from InfoSeek \(QID:5441\) on*McLaren 12C*\. After fixing, the target section is correctly prioritized by EchoSight\[Echosight\]re\-ranker\.We further analyze the shared grounded InfoSeek subset where EchoSight’s top\-1 reranked section belongs to the ground\-truth entity\. As shown in Table[5](https://arxiv.org/html/2607.00159#Pt0.A4.T5), question\-only fixes are the main driver of improvement, while joint question\-answer fixes also help and answer\-only fixes are mostly unchanged\. For filtering, we manually check 100 removed InfoSeek samples and find that model accuracy remains 0 because the supporting evidence is absent from the benchmark\-provided KB\.
Table 5:Fine\-grained attribution on the shared grounded InfoSeek subset where EchoSight’s top\-1 reranked section belongs to the ground\-truth entity\.Table 6:Fixing statistics on the entity\-unique subset InfoSeek with 1924 samples and E\-VQA with 4750 samples\.
## Appendix 0\.EFine\-grained Analysis for augmentation
In this section, we present more details on the performance of Wiki\-PRF\[Wiki\-PRF\]on the Intra\-category and Inter\-category augmented InfoSeek\[InfoSeek\]and E\-VQA\[EVQA\]constructed with our augmentation protocols in Section[5](https://arxiv.org/html/2607.00159#S5)\.
As shown in Table[7](https://arxiv.org/html/2607.00159#Pt0.A5.T7), comparing with anchor samples, the numbers of tool calling and hit ratios of target entity of all tools drop significantly on both Intra\-category and Inter\-category augmented InfoSeek dataset\. Ideally, with more tool options, the retrieval performance of Wiki\-PRF\[Wiki\-PRF\]should be improved\. Specifically, with the grounding tool, it is expected that the distractors injected by our augmentation protocols in Section[5](https://arxiv.org/html/2607.00159#S5)will be excluded\. However, it doesn’t achieve better retrieval results\. We attribute this phenomenon to the nature of sparse rewards during its training process since it only focuses on the final answer accuracy without considering the potential reward in retrieval\(*i\.e*\.when retrieval hit the target entity\.\) There are more calls on Intra\-category augmented dataset than Inter\-category but the ratio is lower\. We hypothesis phenomenon that VLM still struggles on comprehending spatial relations\.
Table 7:Tool calling status of Wiki\-PRF on the anchor, Intra\-category and Inter\-category augmented InfoSeek\[InfoSeek\]with 1,604 instances constructed with our augmentation protocols in Section[5](https://arxiv.org/html/2607.00159#S5)\.For Table[8](https://arxiv.org/html/2607.00159#Pt0.A5.T8), similar trend can be observed\. Both the numbers of tool calling and hit ratios of target entity of all tools drop significantly on both Intra\-category and Inter\-category augmented E\-VQA dataset\.
Table 8:Tool calling status of Wiki\-PRF on the anchor, Intra\-category and Inter\-category augmented E\-VQA\[EVQA\]with 3,871 instances constructed with our augmentation protocols in Section[5](https://arxiv.org/html/2607.00159#S5)\.We also include two layout\-only controls to separate layout shift from semantic distractors:Blankreplaces the distractor with an empty panel, whileDoubleduplicates the anchor image\. Table[9](https://arxiv.org/html/2607.00159#Pt0.A5.T9)shows that initial retrieval is layout\-sensitive, but Blank/Double remain easier than Intra/Inter and Wiki\-PRF post\-retrieval recovers much more under these layout\-only controls\. This supports the conclusion that semantic distractors introduce additional grounded\-disambiguation difficulty beyond layout shift\.
Table 9:Blank/Double layout\-control results\. Initial R@1 uses the same image\-to\-image retrieval protocol as Table[2](https://arxiv.org/html/2607.00159#S6.T2); Wiki\-PRF post\-retrieval reports whether the final evidence pool contains the target entity\.
## Appendix 0\.FHuman Evaluation for Fixing Outcomes
Following prior dataset\-quality evaluation practice\[su2025skvqa\], we conduct human evaluation to further assess the quality of the fixed datasets\. Specifically, we expand the review to 10% of the fixed evaluation splits, including 475 E\-VQA and 200 InfoSeek instances\. We hire two PhD student\-level annotators to answer the VQA instances using the provided oracle evidence sections\. For each instance, annotators are instructed to answer the question solely based on the associated evidence section, and their responses are compared against the annotated ground\-truth answers\.
Overall, human evaluation shows strong agreement with the annotated answers, providing supporting evidence that the fixed datasets are of substantially improved quality under evaluation protocols adopted in prior work\[su2025skvqa\]\. The mean accuracy±\\pmstandard deviation is92\.9±1\.192\.9\\pm 1\.1for E\-VQA and91\.5±2\.191\.5\\pm 2\.1for InfoSeek\.
However, despite this strong overall performance, a small number of cases still exhibit misalignment between human responses and the annotated answers\. A closer examination shows that these failures are not merely random annotation noise\. Instead, they reveal a harder category of residual issues that remain difficult to fix, particularly when the annotated answer is textually present in the evidence, yet the question itself remains ambiguous or semantically misaligned with the evidence as shown in Figure[10](https://arxiv.org/html/2607.00159#Pt0.A6.F10)and Figure[11](https://arxiv.org/html/2607.00159#Pt0.A6.F11)\.
Figure[10](https://arxiv.org/html/2607.00159#Pt0.A6.F10)shows a case where the intended answer, ‘Wales and England’, is supported by the evidence, yet the question is still ambiguous from a human perspective\. The question, “In what country did people consider this castle to be the equal of any other castle?”, can be interpreted in two ways\. One interpretation asks in which country the relevant people were located, while another asks which countries’ castles the target castle was said to equal\. However, the supporting sentence only states that the castle “was considered by contemporaries to be the equal of any other in England or Wales,” which more naturally describes the comparison set of castles rather than the location of the people making the judgment\. This example suggests that answer presence alone does not guarantee question well\-posedness, and that some instances may still require additional rewriting to remove semantic ambiguity\.
Figure 10:Qualitative example of human evaluation on*Raglan Castle*\. Human annotator has different understanding towards the subject of the question\. According to evidence, the target entity is being compared with other castles located in England or Wales instead of people in Wales or England think this castle is comparable to other ones\.Figure[11](https://arxiv.org/html/2607.00159#Pt0.A6.F11)illustrates a different type of mismatch\. The evidence states that the building was constructed from 1909 to 1916, while the question asks for the year in which it officially opened, with the annotated answer given as 1916\. Human annotators noted that completion of construction does not necessarily imply the official opening date\. Because the passage does not explicitly state the opening year, the annotated answer is not strictly derivable from the provided evidence\. A semantically aligned revision would instead ask, for example, “In which year was the construction of this building completed?”
Figure 11:Qualitative example of human evaluation on*Nuremberg palace of justice*\. Construction accomplished in 1916 doesn’t guarantee it will be open at the same time\.These cases could be hard to define how a perfect fixing will be\. Although some of them may still be repairable, doing so would often require substantial rewriting of the question instead of minimal local edits\. This is not fully consistent with the minimal\-change principle of our proposed fixing pipeline in Section[4](https://arxiv.org/html/2607.00159#S4), since the meaning of the question may change significantly, for example, from asking when a building officially opened to when its construction was completed\. We leave such more aggressive remedies to future work\. A possible direction is to use more and stronger proprietary models to determine when this kind of fundamental revision is necessary\.
## Appendix 0\.GHuman Evaluation for Augmentation Outcomes
Following prior evaluation practice\[su2025skvqa\], we further conduct human evaluation to assess the quality of the augmented datasets\. Specifically, we hire two PhD\-level annotators to answer the augmented VQA instances using the provided oracle evidence of the anchor entities\. For each instance, annotators are instructed to answer the question based on the image and the associated evidence passage, and their responses are compared against the annotated ground\-truth answers\. When disagreement occurs, annotators are also asked to optionally report the main reason\.
Overall, human evaluation suggests that the augmented datasets are of good quality\. Human accuracy reaches87\.5±2\.1%87\.5\\pm 2\.1\\%and96\.0±1\.4%96\.0\\pm 1\.4\\%on E\-VQA intra\- and inter\-augmentation, respectively, and86\.0±1\.4%86\.0\\pm 1\.4\\%and89\.5±3\.5%89\.5\\pm 3\.5\\%on InfoSeek intra\- and inter\-augmentation, respectively\. These results provide supporting evidence that the augmented samples are generally answerable under evaluation protocols similar to those used in prior work\[su2025skvqa\]\.
At the same time, the remaining misaligned cases reveal several residual issues\. Most of them are not caused by the answer being unsupported by the evidence, but rather by ambiguity in identifying the visual target after augmentation\. In particular, the main challenge is that the original anchor image may already contain multiple plausible entities as in Figure[6](https://arxiv.org/html/2607.00159#Pt0.A2.F6), or the added distractor image may introduce visually confusing foreground or background content as in Figure[6](https://arxiv.org/html/2607.00159#Pt0.A2.F6)\. These issues may still be further improved, but doing so would likely require much stronger image\-specific models or operations, such as more reliable object detection, grounding, or image cropping to accurately locate the target entity\.
Figure[12](https://arxiv.org/html/2607.00159#Pt0.A7.F12)shows a representative case where the ambiguity already exists in the original anchor image\. In the anchor example on*30 St Mary Axe*, multiple buildings are visible, and the question asks, “Who occupies this building?” After augmentation, the question remains difficult because the target building is still not uniquely specified\. Even in the intra\-augmented case, where the phrase “on the right” is added, annotators noted that the right side still contains multiple plausible buildings, including both a lower building and the more salient skyscraper\. This suggests that spatial cues alone may not always be sufficient when the anchor image itself contains multiple visually plausible targets\. A possible improvement would be to use stronger grounding or cropping methods to isolate the intended entity more clearly, or to further refine the question wording \(*e\.g*\., “Who occupies the skyscraper?”\)\.
Figure 12:Qualitative example of human evaluation on augmented examples on*30 St Mary Axe*\. However, multiple buildings are shown in the query image of original anchor question\. Hence in the augmented images, the target entity is still unclear\. The annotators suggest further improve the question*e\.g*\., ‘Who occupies the skyscraper?’We also observe cases where the distractor image introduces additional foreground or background entities that interfere with target identification\. Figure[13](https://arxiv.org/html/2607.00159#Pt0.A7.F13)gives one such example on*Asplenium oblongifolium*\. In the intra\-augmented case, the distractor image is*Coprosma lucida*, and the spatial cue “on the left” is sufficient to keep the question clear\. However, in the inter\-augmented case, the distractor image of*Mont Aiguille*also contains visible trees and vegetation\. As a result, annotators reported that it becomes less clear which plant the question refers to\. This example shows that inter\-category distractors can still accidentally introduce same\-type visual content, making the target harder to identify without additional grounding signals\. A possible remedy is to use more careful distractor filtering or stronger image manipulation tools to suppress irrelevant visual entities\.
Figure 13:Qualitative example of human evaluation on augmented examples on*Asplenium oblongifolium*\. For intra\-augmented, the distractor is an image of*Coprosma lucida*\. Since the location ‘On the left’ is provided, the question is still clear\. However, for inter\-augmented, there are trees for the distractor image of*Mont Aiguille*\. Therefore, without further instructions, it will be hard to decide the target entity\.Meanwhile, in very rare cases\(only 2 for the 100 sampled InfoSeek augmented instances \), we observe a more fundamental issue that the original anchor image itself does not clearly match the textual subject\. As shown in Figure[14](https://arxiv.org/html/2607.00159#Pt0.A7.F14), the example on*Musée Bolo*uses an image of a computer exhibited in the museum rather than the museum building itself, while the question asks, “Which country is this building located in?” This creates a mismatch between the query image and the target subject, which then carries over into the augmented versions*e\.g*\., even asking with spatial cues, the visually available building is only the distractor entity, which may disobey the original intention\. Addressing this type of problem may require an additional verification stage with powerful closed proprietary VLMs to ensure that the anchor image is visually aligned with the textual subject\.
Figure 14:In a very rare case, we observe that there is a misalignment between query image and text on subject\. Qualitative example of human evaluation on augmented examples on*Musée Bolo*\. The query image in original anchor question is a computer collected in the museum instead of the building itself\. Hence in the augmented images, the target entity could be ambiguous\. For example in the intra\-augmented image, even on the left is indicated, the visually available building is only the ‘Mor Gabriel Monastery’, which is introduced as a distractor\.Overall, these cases could be hard to define how a flawless fixing will be\. They may still be fixable by removing the irrelevant entities in foreground or background or locating the intended object, but doing so would often require stronger image\-specific processing with specific models and supervision\. These remedies may go beyond the scope of our current augmentation pipeline in Section[5](https://arxiv.org/html/2607.00159#S5)to enforce grounded disambiguation and therefore could be left to future work\. A possible direction is to use more powerful multi\-modal models or specialized vision modules to decide when such stronger intervention is necessary and how the visual input should be revised\.
## Appendix 0\.HImplementation Details
For EchoSight\[Echosight\], we use the provided code and re\-ranker checkpoint to conduct the experiments, the answer generator we use is LLaMA\-3\.1\-8B\-Instruct\[grattafiori2024llama\]\. For IBA\[IBA\], we use Qwen2\.5\-VL\-7B\-Instruct to narrow down the entity scope from the 20 initially retrieved entities to 3 and use bge\-v2\-m3 textual re\-ranker\. For ReflectiVA and CoMEM, we use top\-5 and top\-10 retrieved articles to conduct experiments\. For Wiki\-PRF\[Wiki\-PRF\], we use the released checkpoint and code to conduct experiments, the post\-retrieval content consists of top\-1 initial visual retrieval result and top\-3 re\-ranked sections via tool calling retrieval\(*e\.g*\., grounding or captioning\)\. Following paper’s design, the post\-retrieved content will be further filtered to support final answer generation\.Similar Articles
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
This paper presents QuestBench, a benchmark built by students to evaluate deep research systems across humanities and social science domains. Results show that even advanced systems like GPT-5.5 pass only 57.58% of questions, highlighting failures in trustworthiness.
Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification
This paper proposes a training-free 'identify-before-answer' (IBA) framework for Knowledge-Based Visual Question Answering (KB-VQA) that decouples entity identification from evidence ranking, outperforming fine-tuned multi-modal retrieval-augmented generation baselines while reducing complexity.
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA is a benchmark for document vision-language models that evaluates both answer correctness and citation of supporting evidence, revealing widespread attribution hallucinations where models provide correct answers but cite wrong regions.
Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
This paper identifies that standard multiple-choice QA benchmarks are sensitive to phrasing artifacts, conflating knowledge with surface-form familiarity. The authors propose ParaEval, a framework that uses multiple paraphrases per answer option to score models based on most favorable phrasing, reducing false performance gaps and enabling more robust evaluation.
Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving
This paper audits five widely used Lean theorem-proving benchmarks, uncovering 398 mechanically certified issues such as counterexamples, vacuous theorems, and unsound axioms. It proposes a fault taxonomy, automated checkers, and release standards to improve evaluation reliability and trustworthiness.