MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

arXiv cs.CL 07/03/26, 04:00 AM Papers
multimodal attribution question-answering long-document training-free attention benchmark
Summary
Introduces MultAttnAttrib, a training-free method for multimodal attribution in long document QA, along with the MultAttrEval benchmark. It outperforms prompting-based methods and matches frontier models like GPT-5.4.
arXiv:2607.01420v1 Announce Type: new Abstract: As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under-researched. As a result, we introduce MultAttnAttrib, a training-free attribution-generation method that leverages a model's prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document. To establish baseline results for the method, we introduce MultAttrEval, a complementary benchmark dataset annotated with fine-grained, ground-truth attributions for answer components grounded in multimodal source documents. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long-form documents. Experimental results show that MultAttnAttrib consistently outperforms a variety of attribution-generation methods, including several strong prompting-based approaches and matches the latest frontier models such as GPT 5.4. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one-seventh of the direct inference latency compared to prompting on the same base model.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:40 AM
# Training-Free Multimodal Attribution in Long Document Question Answering
Source: [https://arxiv.org/html/2607.01420](https://arxiv.org/html/2607.01420)
Dang Quang Thien Tran1Quang V\. Dang1∗Vinamra Tyagi1∗ Sai Soorya Rao Veeravalli1∗Trang Nguyen1Ryan A\. Rossi2 Franck Dernoncourt2Nedim Lipka2Koustava Goswami2Samyadeep Basu2

###### Abstract

As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety\. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under\-researched\. As a result, we introduceMultAttnAttrib, a training\-free attribution\-generation method that leverages a model’s prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document\. To establish baseline results for the method, we introduceMultAttrEval, a complementary benchmark dataset annotated with fine\-grained, ground\-truth attributions for answer components grounded in multimodal source documents\. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long\-form documents\. Experimental results show thatMultAttnAttribconsistently outperforms a variety of attribution\-generation methods, including several strong prompting\-based approaches and matches the latest frontier models such as GPT 5\.4\. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one\-seventh of the direct inference latency compared to prompting on the same base model\.

MultAttnAttrib: Training\-Free Multimodal Attribution in Long Document Question Answering

Dang Quang Thien Tran1††thanks:Equal contribution\.Quang V\. Dang1∗Vinamra Tyagi1∗Sai Soorya Rao Veeravalli1∗Trang Nguyen1Ryan A\. Rossi2Franck Dernoncourt2Nedim Lipka2Koustava Goswami2Samyadeep Basu2

## 1Introduction

Building user trust in AI systems is critical to the success of agentic workflows in both enterprise and consumer environments\. In many settings, users cannot safely act on a generated answer without verifying its source and validity — even modern generative systems fully support fewer than 52% of their generated statements with accurate citations\(Liuet al\.,[2023](https://arxiv.org/html/2607.01420#bib.bib32)\)\. As a result, model grounding viaattributions—localizing each answer component to its supporting evidence—has emerged as a fundamental requirement for model deployment, particularly in domains such as medicine where ungrounded or hallucinated answers can have real negative impacts\(Kimet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib11)\)\.

There have been increasing efforts to use attributions to ground document question\-answer pairs, though most focus on text\-only or otherwise unimodal settings\. Current approaches typically rely on citation\-style generation\(Bohnetet al\.,[2022](https://arxiv.org/html/2607.01420#bib.bib3); Gaoet al\.,[2023b](https://arxiv.org/html/2607.01420#bib.bib7); Berchanskyet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib17)\), retrieval\-head or circuit isolation\(Basuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib10)\), or decomposition\-based attribution methods\(Ramuet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib38); Balasubramanianet al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib39)\), which have thus far been explored only for text\. Real documents, however, interleave text with images, charts, and other raster content\. A robust attribution system must therefore identify not only the correct source, but also the supporting modality or combination of modalities\.

The multimodal long\-document setting remains comparatively nascent, with preexisting approaches largely framing attribution as citation selection from pre\-retrieved passages or images rather than as fine\-grained localization within a single full\-length document\(Maet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib20); Qiet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib19); Songet al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib23)\)\. A unique set of challenges arises in multimodal attribution that doesn’t apply in the unimodal case: determining both the correct modality \(or modalities\) and the correct source within it\. Resolving how text and images jointly support a single attribution remains an open problem with significant downstream potential\.

To address this critical challenge, we proposeMultAttnAttrib, a training\-free multimodal attribution method that leverages attention patterns from a model’s prefill pass to localize supporting evidence in long interleaved documents\. Our method identifies a subset of retrieval heads that consistently attend to ground\-truth evidence across modalities, aggregates their attention signals to score text spans and image regions jointly, and applies a lightweight calibration procedure to produce modality\-aware citations in a single inference pass\. Unlike prompting\-based attribution methods,MultAttnAttribavoids iterative generation and additional reasoning overhead, substantially reducing inference cost while improving attribution quality\.

Because existing benchmarks are insufficient for evaluating fine\-grained multimodal attribution in long documents, we also introduceMultAttrEval, a complementary evaluation benchmark spanning five domains and covering both unimodal and multimodal attribution settings\. UsingMultAttrEval, we evaluate a broad set of attribution baselines, including prompting\-based, captioning\-based, and retrieval\-augmented approaches, on both open\-source and frontier MLLMs\.

Our results reveal a substantial gap between multimodal attribution and unimodal attribution performance, confirming the unique difficulty of multimodal attribution\. Despite this challenge,MultAttnAttribconsistently outperforms most strong baselines on both Qwen3\-VL\-30B and a frontier model, while operating at roughly 14% of the inference latency by extracting attributions directly from the prefill pass and reducing peak memory usage by approximately 15GB \(non\-vLLM\) per QAA instance\.

In summary, our contributions are as follows\.

- •MultAttnAttrib: A training\-free multimodal attribution method that produces modality\-aware citations efficiently in a single inference pass\.
- •MultAttrEval: A complementary benchmark for fine\-grained multimodal attribution in long documents across five domains\.
- •Extensive experiments demonstrating thatMultAttnAttribconsistently outperforms strong prompting, captioning, and RAG\-based baselines on the original open\-source MLLM backbone, while achieving substantially lower latency\.

## 2Related Work

### 2\.1Attribution on Multimodal Inputs

The explainability of language model outputs has motivated extensive work on citing and attributing generated text, falling into three broad families\. The first family fine\-tunes models to interleave citations with output, building on Attributed QA\(Bohnetet al\.,[2022](https://arxiv.org/html/2607.01420#bib.bib3)\), the ALCE benchmark\(Gaoet al\.,[2023b](https://arxiv.org/html/2607.01420#bib.bib7)\), and training\-based citation generation methods\(Alyet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib25); Asaiet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib28); Huanget al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib31)\)\. The second family decouples attribution from generation by post\-processing outputs with external retrievers, NLI verifiers, or LLM judges\(Gaoet al\.,[2023a](https://arxiv.org/html/2607.01420#bib.bib29); Qianet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib40)\)\. The third family recovers attribution directly from the model’s computations: by aggregating attention signals across heads\(Basuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib10); Wanget al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib33)\), by reading internal signals via saliency maps or intermediate activations\(Qiet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib19); Phukanet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib37),[2025](https://arxiv.org/html/2607.01420#bib.bib14)\), or by probing the model through systematic context ablations\(Cohen\-Wanget al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib18)\)\. Our method belongs to this third family\.

### 2\.2Datasets for Multimodal Attribution

Evaluating multimodal attribution requires benchmarks that test evidence localization over full multimodal documents\. Existing benchmarks such as MCiteBench\(Huet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib21)\), MMDocRAG\(Donget al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib22)\), and MAVIS\(Songet al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib23)\)instead evaluate citation selection from small, pre\-curated pools of passages, figures, or tables, reducing attribution to discrete candidate selection rather than true localization\. Similarly, SciClaimEval\(Hoet al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib24)\)pre\-identifies the relevant figure or table and evaluates only cross\-modal entailment, sidestepping retrieval entirely\. These settings do not reflect deployment conditions, where models must localize supporting evidence within long, interleaved documents\. Concurrent work, MuRGAt\(Wanet al\.,[2026](https://arxiv.org/html/2607.01420#bib.bib26)\), also studies free\-form evidence selection without a candidate pool, but focuses on temporal video/audio attribution and generation\-based methods\. In contrast, our approach extracts citations directly from attention signals over static multimodal documents in a single forward pass\.

## 3MultAttnAttrib: A Training\-Free Approach for Multimodal Attribution

![Refer to caption](https://arxiv.org/html/2607.01420v1/x1.png)Figure 1:MultAttnAttrib: We identify signals for each attention head, then filter to select cross\-modal heads\. We then calibrate the threshold to maximize F1 scores on the probe set fromMultAttrEval\. For attribution, we use our topkkheads to generate attention spans and return the final results using our calibrated thresholds\.Existing attribution methods can oftentimes be reduced to compute\-intensive LM fine\-tuning for citation generation\(Alyet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib25); Asaiet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib28); Huanget al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib31)\), or multi\-step approaches requiring additional model calls\(Gaoet al\.,[2023a](https://arxiv.org/html/2607.01420#bib.bib29); Cohen\-Wanget al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib18); Slobodkinet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib15)\)\. Mechanistic interpretability offers a streamlined alternative: identifying a sparse subset of attention heads responsible for copying evidence from context, then attributing via their attention maps in a single forward pass\(Basuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib10); Wuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib13)\)\. However, these approaches focus on text\-only QA, leaving image and multimodal QA unattributed\.

Extending to multimodal extractive QA using text\-only retrieval heads would omit visual evidence entirely\. We find that retrieval heads are modality\-specific at the top ranks but largely shared across the broader population\. This motivatesMultAttnAttrib, a label\-supervised, training\-free approach that exploits this shared backbone by identifying cross\-modal retrieval heads from a small probe set, and extracting their attention signals to score both image slots and text passages in a single forward pass\.

### 3\.1Task

We study the problem of*multimodal attribution*\. Given a document composed of text and images, a question, and an answer, the goal is to attribute the answer to its supporting evidence in the document\.

Let the document be𝒟=\(𝒯,ℐ\)\\mathcal\{D\}=\(\\mathcal\{T\},\\mathcal\{I\}\), where𝒯=\(t1,t2,…,t\|𝒯\|\)\\mathcal\{T\}=\(t\_\{1\},t\_\{2\},\\dots,t\_\{\|\\mathcal\{T\}\|\}\)is a sequence of text tokens, andℐ=\{I1,I2,…,I\|ℐ\|\}\\mathcal\{I\}=\\\{I\_\{1\},I\_\{2\},\\dots,I\_\{\|\\mathcal\{I\}\|\}\\\}is a set of images in the document\. A text span is defined as a contiguous subsequence𝒯i:j=\(ti,ti\+1,…,tj\)\\mathcal\{T\}\_\{i:j\}=\(t\_\{i\},t\_\{i\+1\},\\dots,t\_\{j\}\), with1≤i≤j≤\|𝒯\|1\\leq i\\leq j\\leq\|\\mathcal\{T\}\|\.

Given a questionqq, the system produces an answeraa, which is attributed to one of the following evidence types: a text span𝒯i:j\\mathcal\{T\}\_\{i:j\}, a set of imagesℐ∗⊆ℐ\{\\mathcal\{I\}\}^\{\*\}\\subseteq\\mathcal\{I\}, or a joint text–image set pair\(𝒯i:j,ℐ∗\)\(\\mathcal\{T\}\_\{i:j\},\\,\{\\mathcal\{I\}\}^\{\*\}\)\. We define the attribution space as𝒜=\{𝒯i:j\}∪\{ℐ∗\}∪\{\(𝒯i:j,ℐ∗\)\}\\mathcal\{A\}=\\bigl\\\{\\,\\mathcal\{T\}\_\{i:j\}\\,\\bigr\\\}\\cup\\bigl\\\{\\,\{\\mathcal\{I\}\}^\{\*\}\\,\\bigr\\\}\\cup\\bigl\\\{\\,\(\\mathcal\{T\}\_\{i:j\},\\,\{\\mathcal\{I\}\}^\{\*\}\)\\,\\bigr\\\}\. The multimodal attribution task is to learn a functionf:\(q,𝒟,a\)→α^f:\(q,\\mathcal\{D\},a\)\\rightarrow\\hat\{\\alpha\}, whereα^∈𝒜\\hat\{\\alpha\}\\in\\mathcal\{A\}is the predicted attribution\.

Given a dataset\{\(𝒟,q,a,α∗\)\}\\bigl\\\{\\,\(\\mathcal\{D\},\\;q,\\;a,\\;\\alpha^\{\*\}\)\\,\\bigr\\\}, whereα∗∈𝒜\\alpha^\{\*\}\\in\\mathcal\{A\}is the ground\-truth attribution, the objective is to correctly attribute each answer to its supporting evidence in the multimodal context provided\.

### 3\.2Head Identification

To identify multimodal and image retrieval heads, we need a scoring method that is sensitive to both unimodal and multimodal evidence\. Prior approaches, such as average copy\-paste frequency\(Wuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib13)\)and path patching\(Basuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib10); Wanget al\.,[2022](https://arxiv.org/html/2607.01420#bib.bib36)\), are either correlational or prohibitively expensive at scale\. To address this, we tested retrieval head isolation against two methods: Causal Mediation Analysis \(CMA\) and Mean Attention Scoring \(MAS\)\. Given the results of our tests \(more details follow in the Section[6\.3](https://arxiv.org/html/2607.01420#S6.SS3)\),MultAttnAttribscores all heads against labeledmultimodalprobes using CMA\. Details about the two methods are as follows:

##### MAS requires only a single forward pass per probe\.

The heads are scored by the ratio of the mean attention to the ground\-truth positionsGiG\_\{i\}to the total attention over the entire documentDiD\_\{i\}\. This measures how selectively heads attend to evidence over distractors\. This is cheaper than CMA \(discussed below\) but correlational, lacking causal validity \(Heads that happen to concentrate on the ground\-truth region score high regardless of whether they actually causally mediate retrieval\)\.

##### Adapting CMA for retrieval head identification

costs only two forward passes per probe: one clean pass on the original inputxix\_\{i\}and one corrupted pass where the evidence is replaced with content from another document\. While previous CMA work focused on text\(Basuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib10)\), this corruption strategy is multimodal\. Ground\-truth text tokens are replaced with a contiguous span of equal length from another probe’s document to preserve the sequence structure\. Corrupted images are resized to the dimensions of the ground\-truth images to preserve the patch grid\. It ensures that the clean and corrupted inputs have the same shape, thereby isolating the causal effect\.

The Indirect Effect \(IE\) of each head\(l,h\)\(l,h\)is expressed by the difference in the mean attention to ground\-truth positionsGiG\_\{i\}between the clean and corrupted inputs, averaged over the query tokensQiQ\_\{i\}\(comprising the answer and question tokens without stopwords or punctuations\)\. To avoid over\-attribution, we further suppressed heads that spread attention uniformly using weights derived from the normalized variant of Shannon entropy\(Zhaiet al\.,[2023](https://arxiv.org/html/2607.01420#bib.bib34)\)of document\-averaged clean attention\(Clarket al\.,[2019](https://arxiv.org/html/2607.01420#bib.bib35)\)\. After accumulating the scores for each head in all probes, we select the top\-kkretrieval headsℋ\\mathcal\{H\}with the highest scores\. Pseudocode for both scoring methods is given in Appendix[D](https://arxiv.org/html/2607.01420#A4)\.

### 3\.3Calibration

We estimate the sliding window lengthWWas the median chunk token length in the probe set\. Using the selected headsℋ\\mathcal\{H\}, we run Attribution \(Algorithm[1](https://arxiv.org/html/2607.01420#alg1)\) on all probes\. Scores are partitioned by ground\-truth modality labels \(image\-positive/negative, text\-positive/negative\), and we sweep over maximum attribution scores to select thresholdsTimgT\_\{\\mathrm\{img\}\}andTtxtT\_\{\\mathrm\{txt\}\}that maximize F1 for image and text attribution, respectively\. These thresholds are later used during inference\. Pseudocode is provided in Algorithm[4](https://arxiv.org/html/2607.01420#alg4)\(Appendix[D](https://arxiv.org/html/2607.01420#A4)\)\.

Without calibration, there is no decision boundary for citing text or images, and raw attention scores are less interpretable than probabilities\. We therefore perform an F1\-maximizing threshold sweep to derive modality\-specific thresholds directly from attribution score distributions observed in real documents\.

Algorithm 1MultAttnAttrib: Attribution1:

gϕg\_\{\\phi\}\(language model\),

xx\(input prompt\),

QQ\(query position\),

ℋ\\mathcal\{H\}\(selected heads\),

WW\(span length\)

2:

A←gϕ\(x,Q,ℋ\)A\\leftarrow g\_\{\\phi\}\(x,\\,Q,\\,\\mathcal\{H\}\)
3:

a¯←mean\(l,h\),q⁡Al,h,q\\bar\{a\}\\leftarrow\\operatorname\{mean\}\_\{\(l,h\),q\}A\_\{l,h,q\}
4:foreach image slot

ssdo

5:

vsimg←a¯sv\_\{s\}^\{\\text\{img\}\}\\leftarrow\\bar\{a\}\_\{s\}
6:endfor

7:foreach sliding window

wwover textdo

8:

vwtxt←a¯wv\_\{w\}^\{\\text\{txt\}\}\\leftarrow\\bar\{a\}\_\{w\}
9:endfor

10:

\[vimg,vtxt\]←\[v^\{\\text\{img\}\},\\,v^\{\\text\{txt\}\}\]\\leftarrowMinMaxNorm\(\[vimg,vtxt\]\)\(\[v^\{\\text\{img\}\},\\,v^\{\\text\{txt\}\}\]\)

11:

ℐ^←\{s:vsimg≥Timg\}\\hat\{\\mathcal\{I\}\}\\leftarrow\\\{s:v\_\{s\}^\{\\mathrm\{img\}\}\\geq T\_\{\\mathrm\{img\}\}\\\};

𝒯^←\{w:vwtxt≥Ttxt\}\\hat\{\\mathcal\{T\}\}\\leftarrow\\\{w:v\_\{w\}^\{\\mathrm\{txt\}\}\\geq T\_\{\\mathrm\{txt\}\}\\\}
12:if

ℐ^∪𝒯^=∅\\hat\{\\mathcal\{I\}\}\\cup\\hat\{\\mathcal\{T\}\}=\\emptysetthen

13:

\(m∗,e∗\)←arg⁡maxm,e⁡vem\(m^\{\*\},\\,e^\{\*\}\)\\leftarrow\\arg\\max\_\{m,\\,e\}\\;v\_\{e\}^\{m\}
14:

ℐ^←\{e∗:m∗=img\}\\hat\{\\mathcal\{I\}\}\\leftarrow\\\{e^\{\*\}:m^\{\*\}\{=\}\\mathrm\{img\}\\\};

𝒯^←\{e∗:m∗=txt\}\\hat\{\\mathcal\{T\}\}\\leftarrow\\\{e^\{\*\}:m^\{\*\}\{=\}\\mathrm\{txt\}\\\}
15:endif

16:return

ℐ^,𝒯^\\hat\{\\mathcal\{I\}\}\\,,\\,\\hat\{\\mathcal\{T\}\}

### 3\.4Attribution

Attribution requires a single forward pass over the query document\. We average attention across the selected heads to obtain a mean attention vector, score each image by averaging over its patch tokens, and score text by averaging over sliding windows of token positions\. Image and text scores are min\-max normalized, then thresholded usingTimgT\_\{\\mathrm\{img\}\}andTtxtT\_\{\\mathrm\{txt\}\}to determine citations\. If no score exceeds its threshold, we fall back to the highest\-scoring image or text span\.

## 4MultAttrEval: A Dataset for Multimodal Attribution in Long Document Understanding

### 4\.1Dataset Generation

![Refer to caption](https://arxiv.org/html/2607.01420v1/x2.png)Figure 2:MultAttrEval: Overview of the QAA generation process used to construct MultAttrEval from processed MINT\-1T PDFs across text\-only, image\-only, and combined text\-image attribution settings\.MultAttrEvalwas created to address the need for fine\-grainedQuestion\-Answer\-Attribution \(QAA\) triplets given long documents with mixed\-modality content, and it allows us to test the strength of our attribution approaches\. As shown in diagram[2](https://arxiv.org/html/2607.01420#S4.F2), we begin by obtaining PDF files from MINT\-1T\(Awadallaet al\.,[2024](https://arxiv.org/html/2607.01420#bib.bib16)\)\. Each document is filtered based on the image count and the presence of valid URLs\. We extract text and images, preprocess both, and then generate embeddings \+ similarity pairings for text/text, text/image, and image/image\. QAA generation is then split up into the three domains as follows:

Unimodal \(Text/Image only\)\.We first isolate images or text chunks that are mutually dissimilar, with additional disjoint text pairing for the image case, for document understanding purposes\. We then use an MLLM to generate QA using only our selected images or text\-chunk spans, thereby creating unimodal attributions for our input\.

Text \+ Image\.This case warrants a different treatment from the previous cases, as the text and image attributions should both be relevant and mutually support the model’s answer to the question\. Here, we rerank the most similar \(text, image\) pairs from the embedding step, then identify entities in the texts and verify whether they belong to the image\. Surviving \(text, image\) pairs and entities are then used to elicit questions and answers\.

### 4\.2Quality Verification

Each generated QAA candidate is subjected to a sequence of strict acceptance criteria, with full details in Appendix[K](https://arxiv.org/html/2607.01420#A11)\. Fortext\-onlyitems, we apply four checks:quality thresholdingenforces a minimum verifier score to exclude low\-value QA pairs;attribution supportis a binary gate that filters out items whose attribution is not used to generate the answer;evidence consistencyrequires the verifier\-provided evidence span to be non\-empty, 12–25 words, and an exact substring of the source paragraph; andcross\-chunk evidence uniquenessrejects QA pairs whose supporting span appears across more paragraph chunks than the configured ambiguity threshold\. Forimage\-only and multimodalitems, we additionally apply:source referencing, ensuring QA pairs reference high\-level domain topics rather than the source artifact directly;question triviality, rejecting questions that target layout artifacts such as arrows, bounding boxes, or callouts rather than factual or domain\-relevant content; andanswerability, which scores the degree to which the answer is grounded in and derivable from the source material\. Finally,multimodal\-onlyitems must additionally satisfycrossmodal grounding, verifying on a 1–7 scale that the image visually grounds at least one key answer entity and the text explicitly grounds at least one distinct entity, andanswer circularity, which requires each answer to introduce at least one new piece of factual content beyond the question, though shared proper nouns and technical terms are explicitly permitted\.

### 4\.3Dataset Summary

MultAttrEvalcontains question\-answer\-attribution triplets for long\-form PDF documents spanning five domains and three attribution settings: text\-only, image\-only, and multimodal\. Full corpus distributions and modality\-level statistics are reported in Appendix[A](https://arxiv.org/html/2607.01420#A1)\.

## 5Experiments

### 5\.1Implementation Details

We split the QAAs into Probe and Test sets\. We sample 30 QAAs from each regime from our initial set, generating 90 probe QAA triplets\. The Probe set is used for attention head analyses and for head identification and threshold calibration inMultAttnAttrib\. The remaining 608 items, our Test set, are used to evaluate all methods\.

### 5\.2Baselines

For all baselines, we evaluate theQwen3\-VL\-30B\-A3B\-Instruct\(open\-source backbone forMultAttnAttrib\), and the frontier modelGPT\-5\.4, both of which support long\-context multimodal document understanding\. TheVLMbaseline performs attribution using images and OCR text, while theLLMbaseline replaces images with captions and operates purely over text\. We additionally evaluateRAGvariants \(k=5k=5\), whereCohereretrieveskktext chunks \+kkimages andColQwenretrieveskkfull PDF pages\. Detailed descriptions are in Appendix[B](https://arxiv.org/html/2607.01420#A2)\.

### 5\.3Evaluation Metrics

We evaluate attribution quality using macro\-averaged precision, recall, andF1\\mathrm\{F1\}\. Image citations are evaluated by exact match\. Text citations are evaluated using fuzzy substring similarity, then discretized into 3 score tiers and penalized for under\- or over\-quoting based on the length ratio\. Full computation details as well as additional LLM\-as\-Judge evaluations are in Appendix[C](https://arxiv.org/html/2607.01420#A3)and[I](https://arxiv.org/html/2607.01420#A9)\.

## 6Results and Analysis

Table 1:Latency/memory comparison betweenVLMandMultAttnAttrib\(non\-vLLM, batch = 1, NVIDIA A100 GPU\.\)MultiAttnAttribattributes 7 times faster than VLM on valid, non\-OOM QA inputs\.### 6\.1MultAttnAttribImproves over Different Prompting Strategies With the Same Backbone

Table 2:MultAttnAttriboutperforms various prompting strategies on the same backbone\.MultAttnAttribmetrics with equivalent Qwen baselines and correspondingΔ\\Deltavalues from the Qwen VLM baseline as well\. We see universal improvement in F1 scores, with gains in text and multimodal recall, as well as image and multimodal precision\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x3.png)Figure 3:MultAttnAttribclosely matches and is competitive with latest frontier models such as GPT\-5\.4\.Comparing all GPT variants with the Cohere \+MultAttnAttrib\(Ours\) variant\.In this section, we compare our method against several prompting strategies for obtaining attributions on the sameQwen3\-VL\-30B\-A3B\-Instructbackbone\. As shown in Table[2](https://arxiv.org/html/2607.01420#S6.T2),MultAttnAttriboutperforms all prompting baselines by a substantial margin of over 20%\\%, and this gap holds consistently across all three modality splits\. These results demonstrate that internal signals from cross\-modal and specialized retrieval heads, when carefully post\-processed, yield significantly stronger attribution performance than can be achieved by tuning prompting strategies alone\. Below we provide some of the main empirical results:

##### MultAttnAttribconsistently improves attribution quality over direct and RAG\-augmented VLM baselines\.

Table[2](https://arxiv.org/html/2607.01420#S6.T2)shows universal F1 gains, especially in image precision and text recall, indicating that attribution extracted from attention signals can outperform generated citations in localization\-heavy long\-document settings\.

##### MultAttnAttribmitigates baseline text overprediction\.

Baseline methods frequently over\-attribute text spans, heavily reducing image\-regime precision and text\-regime recall\. Thresholding inMultAttnAttribsuppresses many of these spurious citations, improving unimodal attribution quality \(Table[2](https://arxiv.org/html/2607.01420#S6.T2)\)\.

##### MultAttnAttribresponds more positively to fine\-grained retrieval methods than page\-based retrieval\.

Combining our method with Cohere RAG substantially improves text performance and modestly improves image and multimodal results, while ColQwen degrades text and image metrics with only marginal multimodal changes\. This suggests fine\-grained retrieval is more effective for attribution than page\-level retrieval\.

##### MultAttnAttribprocesses long documents with lower memory and latency than VLM prompting\.

Direct non\-vLLM inference on our Qwen model results in frequent OOM errors for QA inputs, a problemMultAttnAttribavoids by attributing in a single forward pass, bypassing KV\-cache growth, and removing token\-level decoding overhead\. Focusing on the non\-OOM QA pairings,MultAttnAttribnot only has nearly 15 GB lower peak VRAM usage, but also7\.3×7\.3\\timesbetter latency during inference for a singular QA input\. Details are in Table[1](https://arxiv.org/html/2607.01420#S6.T1)\.

### 6\.2ComparingMultAttnAttribto FrontierGPT\-5\.4

MultAttnAttribshows complementary strengths relative to prompted GPT baselines \(Figure[3](https://arxiv.org/html/2607.01420#S6.F3)\)\. On visual grounding, it achieves stronger image precision and F1 than all GPT baselines, since attention aggregation operates directly over the full token sequence rather than verbalizing attributions\. In text settings, the trade\-off shifts: GPT baselines attain higher precision by returning minimal evidence, whileMultAttnAttribrecovers higher recall by capturing all influential tokens at the cost of including loosely related text\. Overall, it remains competitive with frontier\-scale closed\-source models such as GPT\-5\.4 on multimodal attribution\.

### 6\.3Analysis of Unimodal and Crossmodal Attention Heads

A central design question forMultAttnAttribis whether text and image retrieval emerges from shared or modality\-specific attention circuits\. If the circuits are shared, a single joint head set is preferable and reduces the cost of modality\-specific head identification\.

![Refer to caption](https://arxiv.org/html/2607.01420v1/x4.png)\(a\)Top\-kktext and image head overlap from CMA \(left\) vs\. Mean Attention Scoring \(right\)
![Refer to caption](https://arxiv.org/html/2607.01420v1/x5.png)\(b\)Spearman of top\-kkunion of text and image head sets from CMA \(left\) vs\. Mean Attention Scoring \(right\)

Figure 4:Crossmodal retrieval head agreement under CMA and Mean Attention Scoring\.The usage of CMA results in higher overlap between image and text head sets in comparison to using Mean Attention\. The broader head population is largely crossmodal with specialization at the very top ranks\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x6.png)\(a\)Layer distribution of heads in the CMA top\-20\.
![Refer to caption](https://arxiv.org/html/2607.01420v1/x7.png)\(b\)Min\-max normalized CMA head score distribution per modality\.

Figure 5:Layer distribution and score sparsity of CMA top\-20 heads\.Image heads concentrate at mid\-to\-late layers while text heads span early to late layers; crossmodal heads cluster in the transition zone\. A small proportion of heads scored above 0\.6 in any modality, indicating that retrieval heads are scarce for both text and images\.We score allL×H=1536L\\times H=1536heads under both CMA and MAS, then measure cross\-modal agreement via IoU and Spearman’s rank correlation over the top\-kkhead sets\. Setup and metric definitions are in Appendix[J](https://arxiv.org/html/2607.01420#A10)\.

##### Although the majority of the head population is shared across modalities in both scoring methods, some top\-ranked heads are modality\-specific\.

For instance, a top text head like\(19,3\)\(19,3\)can be the worst image head\. Atk=4k=4, CMA yieldsρ4=−0\.107\\rho\_\{4\}=\-0\.107and MAS yieldsρ4=−0\.657\\rho\_\{4\}=\-0\.657\. This anti\-correlation naturally subsides as more top\-kkheads are included\. CMA quickly recovers toρ20=0\.042\\rho\_\{20\}=0\.042, while MAS remains strongly anti\-correlated untilk=88k=88\. Under CMA,IoU\(4\)=0\.143\\text\{IoU\}\(4\)=0\.143andIoU\(20\)=0\.379\\text\{IoU\}\(20\)=0\.379, meaning that even the top\-20 heads overlap only about a third\. Under MAS,IoU\(4\)=0\.333\\text\{IoU\}\(4\)=0\.333butIoU\(20\)=0\.212\\text\{IoU\}\(20\)=0\.212, reflecting assemblage of modality\-specific heads askkgrows\.

##### CMA locates retrieval heads through a causality\-based reward that favorssharedheads carrying cross\-modal retrieval signals\.

Because it rewards heads whose activations causally influence attribution outputs regardless of target, shared copy\-and\-paste style retrieval heads score highly across modalities, improving IoU and reducing anti\-correlation at smallkk\. In contrast, MAS favors heads that concentrate attention within a modality, producing modality\-specific routing heads and stronger negative correlation\.

##### Layer\-wise analysis reveals structurally different retrieval circuits for text and images\.

Image heads \(CMA\) concentrate almost exclusively in layers2222–3636Figure[5\(a\)](https://arxiv.org/html/2607.01420#S6.F5.sf1),[11](https://arxiv.org/html/2607.01420#A10.F11)\)\. Text heads are distributed across both early and late layers, reflecting the richer syntactic and semantic processing demands of textual evidence\. Crossmodal heads cluster in the mid\-to\-late transition zone, forming the retrieval backbone common to both modalities\.

##### The CMA score distributions \(Figure[5\(b\)](https://arxiv.org/html/2607.01420#S6.F5.sf2)\) reveal that the retrieval circuit is extremely sparse\.

About80%80\\%of all types of heads score below0\.10\.1after min\-max normalization, and fewer than2%2\\%of heads in any category score above0\.60\.6\. This sparsity replicates and extends prior findings that fewer than5%5\\%of heads qualify as retrieval text heads\(Wuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib13)\)to the multimodal attribution setting\. This sparsity also confirms that a small number of heads,kk, is sufficient to capture most of the retrieval signal across modalities\. This makesMultAttnAttrib’s single\-pass attribution practical and efficient\.

## 7Conclusion

In this paper, we introduceMultAttnAttrib, a training\-free attribution method \(with cross\-modal and specialized retrieval heads\) that outperforms a range of strong prompting, inference\-time strategies on the same backbone at a fraction of the latency, and remains competitive with frontier\-scale models such as GPT\-5\.4\. We further introduceMultAttrEval, a test\-bed for evaluating multimodal attribution over long\-context documents\.

## 8Limitations

Our work has several limitations related to both the benchmark and the method\. First,MultAttrEvalconsists of long, image\-dense documents that often contain near\-duplicate or decorative images with little semantic value; because the image regime uses single\-source QAA triplets, ground\-truth attributions contain only one image while baselines frequently retrieve visually similar alternatives, depressing performance\. Future curation should enforce stricter image\-relevance filtering and support multiple image attributions, potentially through embedding\-cluster or entity\-based grouping\. Finally,MultAttnAttribrequires a small labeled probe set of QAAs for both head identification and threshold calibration: unsupervised head scoring\(Wuet al\.,[2025](https://arxiv.org/html/2607.01420#bib.bib13)\)could remove the annotation requirement, though correlational heads may score highly without causally mediating retrieval, and the modality F1 sweep could be replaced with fixed thresholding on normalized scores, reflecting a trade\-off between annotation cost and performance\. We aim to explore these questions more thoroughly in future work\.

## References

- R\. Aly, Z\. Tang, S\. Tan, and G\. Karypis \(2024\)Learning to generate answers with citations via factual consistency models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11876–11896\.External Links:[Link](https://aclanthology.org/2024.acl-long.641/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.641)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InInternational conference on learning representations,Vol\.2024,pp\. 9112–9141\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/hash/25f7be9694d7b32d5cc670927b8091e1-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- A\. Awadalla, L\. Xue, O\. Lo, M\. Shu, H\. Lee, E\. Guha, M\. Jordan, S\. Shen, M\. Awadalla, S\. Savarese, C\. Xiong, R\. Xu, Y\. Choi, and L\. Schmidt \(2024\)Mint\-1t: scaling open\-source multimodal data by 10x: a multimodal dataset with one trillion tokens\.Advances in Neural Information Processing Systems37,pp\. 36805–36828\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/40b9196c25fe1d64d87ca3a80a91d0ce-Abstract-Datasets_and_Benchmarks_Track.html),[Document](https://dx.doi.org/10.52202/079017-1160)Cited by:[§4\.1](https://arxiv.org/html/2607.01420#S4.SS1.p1.1)\.
- S\. Balasubramanian, S\. Basu, K\. Goswami, R\. A\. Rossi, V\. Manjunatha, R\. Santhosh, R\. Zhang, S\. Feizi, and N\. Lipka \(2026\)Decomposition\-enhanced training for post\-hoc attributions in language models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 5070–5084\.External Links:[Link](https://aclanthology.org/2026.eacl-long.236/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.236),ISBN 979\-8\-89176\-380\-7Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1)\.
- S\. Basu, V\. Morariu, Z\. Wang, R\. Rossi, C\. Zhao, S\. Feizi, and V\. Manjunatha \(2025\)On mechanistic circuits for extractive question\-answering\.arXiv preprint arXiv:2502\.08059\.External Links:[Link](https://arxiv.org/abs/2502.08059)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- M\. Berchansky, D\. Fleischer, M\. Wasserblat, and P\. Izsak \(2024\)CoTAR: chain\-of\-thought attribution reasoning with multi\-level granularity\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 236–246\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.13/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.13)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1)\.
- B\. Bohnet, V\. Q\. Tran, P\. Verga, R\. Aharoni, D\. Andor, L\. B\. Soares, M\. Ciaramita, J\. Eisenstein, K\. Ganchev, J\. Herzig, K\. Hui, T\. Kwiatkowski, J\. Ma, J\. Ni, L\. S\. Saralegui, T\. Schuster, W\. W\. Cohen, M\. Collins, D\. Das, D\. Metzler, S\. Petrov, and K\. Webster \(2022\)Attributed question answering: evaluation and modeling for attributed large language models\.arXiv preprint arXiv:2212\.08037\.External Links:[Link](https://arxiv.org/abs/2212.08037)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning \(2019\)What does bert look at? an analysis of bert’s attention\.InProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP,pp\. 276–286\.External Links:[Link](https://aclanthology.org/W19-4828/),[Document](https://dx.doi.org/10.18653/v1/W19-4828)Cited by:[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.SSS0.Px2.p2.5)\.
- B\. Cohen\-Wang, H\. Shah, K\. Georgiev, and A\. Mądry \(2024\)Contextcite: attributing model generation to context\.Advances in Neural Information Processing Systems37,pp\. 95764–95807\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/adbea136219b64db96a9941e4249a857-Abstract-Conference.html),[Document](https://dx.doi.org/10.52202/079017-3035)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- K\. Dong, Y\. Chang, S\. Huang, Y\. Wang, R\. Tang, and Y\. Liu \(2026\)Benchmarking retrieval\-augmented multimodal generation for document question answering\.Advances in Neural Information Processing Systems38\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/hash/1a93178950e92fd2e7b7448f7d68fd7d-Abstract-Datasets_and_Benchmarks_Track.html)Cited by:[§2\.2](https://arxiv.org/html/2607.01420#S2.SS2.p1.1)\.
- L\. Gao, Z\. Dai, P\. Pasupat, A\. Chen, A\. T\. Chaganty, Y\. Fan, V\. Y\. Zhao, N\. Lao, H\. Lee, D\. Juan, and K\. Guu \(2023a\)Rarr: researching and revising what language models say, using language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16477–16508\.External Links:[Link](https://aclanthology.org/2023.acl-long.910/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.910)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- T\. Gao, H\. Yen, J\. Yu, and D\. Chen \(2023b\)Enabling large language models to generate text with citations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 6465–6488\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.398/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.398)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1),[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- X\. Ho, Y\. Wu, S\. Kumar, T\. C\. Xia, F\. Boudin, A\. Greiner\-Petter, and A\. Aizawa \(2026\)SciClaimEval: cross\-modal claim verification in scientific papers\.InProceedings of the Fifteenth Language Resources and Evaluation Conference \(LREC 2026\),pp\. 11060–11071\.External Links:[Document](https://dx.doi.org/10.63317/4ap9rg2gnwmf)Cited by:[§2\.2](https://arxiv.org/html/2607.01420#S2.SS2.p1.1)\.
- C\. Hu, Y\. Zhang, T\. Zhu, Y\. Ye, and Y\. Xiao \(2025\)MCiteBench: a multimodal benchmark for generating text with citations\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 5949–5966\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.318/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.318)Cited by:[§2\.2](https://arxiv.org/html/2607.01420#S2.SS2.p1.1)\.
- C\. Huang, Z\. Wu, Y\. Hu, and W\. Wang \(2024\)Training language models to generate text with citations via fine\-grained rewards\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2926–2949\.External Links:[Link](https://aclanthology.org/2024.acl-long.161/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.161)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, C\. Park, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung, R\. Gameiro, L\. Fan, E\. Park, T\. Lin, J\. Yoon, W\. Yoon, M\. Sap, Y\. Tsvetkov, P\. Liang, X\. Xu, X\. Liu, C\. Park, H\. Lee, H\. W\. Park, D\. McDuff, S\. Tulebaev, and C\. Breazeal \(2025\)Medical hallucinations in foundation models and their impact on healthcare\.arXiv preprint arXiv:2503\.05777\.External Links:[Link](https://arxiv.org/abs/2503.05777)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p1.1)\.
- N\. F\. Liu, T\. Zhang, and P\. Liang \(2023\)Evaluating verifiability in generative search engines\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 7001–7025\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.467/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.467)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p1.1)\.
- X\. Ma, S\. Zhuang, B\. Koopman, G\. Zuccon, W\. Chen, and J\. Lin \(2025\)Visa: retrieval augmented generation with visual source attribution\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 30154–30169\.External Links:[Link](https://aclanthology.org/2025.acl-long.1456/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1456)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p3.1)\.
- A\. Phukan, Divyansh, H\. K\. Morj, Vaishnavi, A\. Saxena, and K\. Goswami \(2025\)Beyond logit lens: contextual embeddings for robust hallucination detection & grounding in vlms\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 9661–9675\.External Links:[Link](https://aclanthology.org/2025.naacl-long.488/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.488)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- A\. Phukan, S\. Somasundaram, A\. Saxena, K\. Goswami, and B\. V\. Srinivasan \(2024\)Peering into the mind of language models: an approach for attribution in contextual question answering\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 11481–11495\.External Links:[Link](https://aclanthology.org/2024.findings-acl.682/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.682)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- J\. Qi, G\. Sarti, R\. Fernández, and A\. Bisazza \(2024\)Model internals\-based answer attribution for trustworthy retrieval\-augmented generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 6037–6053\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.347),[Link](https://aclanthology.org/2024.emnlp-main.347/)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- H\. Qian, Y\. Fan, J\. Guo, R\. Zhang, Q\. Chen, D\. Yin, and X\. Cheng \(2025\)Vericite: towards reliable citations in retrieval\-augmented generation via rigorous verification\.InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region,pp\. 47–54\.External Links:[Link](https://arxiv.org/abs/2510.11394),[Document](https://dx.doi.org/10.1145/3767695.3769505)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- P\. Ramu, K\. Goswami, A\. Saxena, and B\. V\. Srinivasan \(2024\)Enhancing post\-hoc attributions in long document comprehension via coarse grained answer decomposition\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 17790–17806\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.985/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.985)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p2.1)\.
- A\. Slobodkin, E\. Hirsch, A\. Cattan, T\. Schuster, and I\. Dagan \(2024\)Attribute first, then generate: locally\-attributable grounded text generation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3309–3344\.External Links:[Link](https://aclanthology.org/2024.acl-long.182/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.182)Cited by:[§3](https://arxiv.org/html/2607.01420#S3.p1.1)\.
- S\. Song, M\. Park, and G\. Kim \(2026\)MAVIS: a benchmark for multimodal source attribution in long\-form visual question answering\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 33028–33037\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40585),[Document](https://dx.doi.org/10.1609/aaai.v40i39.40585)Cited by:[§1](https://arxiv.org/html/2607.01420#S1.p3.1),[§2\.2](https://arxiv.org/html/2607.01420#S2.SS2.p1.1)\.
- D\. Wan, H\. Wang, Z\. Wang, E\. Stengel\-Eskin, H\. Lee, and M\. Bansal \(2026\)Multimodal fact\-level attribution for verifiable reasoning\.arXiv preprint arXiv:2602\.11509\.External Links:[Link](https://arxiv.org/abs/2602.11509)Cited by:[§2\.2](https://arxiv.org/html/2607.01420#S2.SS2.p1.1)\.
- K\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2022\)Interpretability in the wild: a circuit for indirect object identification in gpt\-2 small\.arXiv preprint arXiv:2211\.00593\.External Links:[Link](https://arxiv.org/abs/2211.00593)Cited by:[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.p1.1)\.
- Y\. Wang, R\. Geng, Y\. Chen, and J\. Jia \(2025\)Attntrace: attention\-based context traceback for long\-context llms\.arXiv preprint arXiv:2508\.03793\.External Links:[Link](https://arxiv.org/abs/2508.03793)Cited by:[§2\.1](https://arxiv.org/html/2607.01420#S2.SS1.p1.1)\.
- W\. Wu, Y\. Wang, G\. Xiao, H\. Peng, and Y\. Fu \(2025\)Retrieval head mechanistically explains long\-context factuality\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 62143–62156\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/hash/9b77f07301b1ef1fe810aae96c12cb7b-Abstract-Conference.html)Cited by:[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.p1.1),[§3](https://arxiv.org/html/2607.01420#S3.p1.1),[§6\.3](https://arxiv.org/html/2607.01420#S6.SS3.SSS0.Px4.p1.6),[§8](https://arxiv.org/html/2607.01420#S8.p1.1)\.
- S\. Zhai, T\. Likhomanenko, E\. Littwin, D\. Busbridge, J\. Ramapuram, Y\. Zhang, J\. Gu, and J\. M\. Susskind \(2023\)Stabilizing transformer training by preventing attention entropy collapse\.InInternational Conference on Machine Learning,pp\. 40770–40803\.External Links:[Link](https://proceedings.mlr.press/v202/zhai23a.html)Cited by:[§3\.2](https://arxiv.org/html/2607.01420#S3.SS2.SSS0.Px2.p2.5)\.

## Appendix AMultAttrEval Dataset Statistics and Analysis

![Refer to caption](https://arxiv.org/html/2607.01420v1/x8.png)Figure 6:Distribution of MultAttrEval source documents by domain\.
Table 3:Table containing document statistics across domains \(top\); Table containing QAA counts across domains and regimes \(bottom\)
Table 4:Analysis of QAA statistics across regimes![Refer to caption](https://arxiv.org/html/2607.01420v1/x9.png)Figure 7:Distribution of MultAttrEval QAA items by attribution regime\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x10.png)Figure 8:Distribution of MultAttrEval QAA items by attribution regime and document domain\.
## Appendix BBaseline Design

For all baselines, we experimented with an open\-source model, Qwen3\-VL\-30B\-A3B\-Instruct \(apache 2\.0 license\), and a closed\-source frontier model, GPT\-5\.4\. Both models can ingest long documents with interleaved text and images and have been shown to perform well across multiple VQA benchmarks\. Our high\-level goal was to compare our method against a diverse set of attribution\-generation methods, yielding a basic VLM prompting baseline for generating attributions\.

Additional baselines are organized around two central questions: whether providing full document context \(raster \+ OCR text\) gives a VLM an advantage for attribution, and whether replacing visual content with text captions \(effectively reducing the task to a text\-only problem\) is competitive\. We further test each setting with and without retrieval augmentation to isolate the contribution of context compression\. This yields four baselines and their subsequent variants:

##### VLM

We provide raster image data and document text, along with a batch of QAA’s from the document, and prompt a Visual\-Language Model to identify where the provided answer could be sourced\.

##### LLM

We provide captions for each image and document text along with a batch of QAAs from the document, and prompt a Language Model to identify where the provided answer could be sourced\.

##### RAG

Both RAG variants operate similarly: sources are embedded into a shared space, and the top sources are retrieved against the QA embedding \(k=5k=5\)\.Cohereretrieves the top\-5 text chunks and top\-5 images independently, whileColQwenretrieves the top\-5 full PDF pages\.

## Appendix CEvaluation Metrics

We evaluate attribution quality using precision, recall, andF1\\mathrm\{F1\}score\.

For each item, letGGandCCdenote the ground\-truth and predicted citation sets,

G=ℐ∗∪\{𝒯i:j\},C=ℐ^∪\{𝒯k:ℓ\},G=\\mathcal\{I\}^\{\*\}\\cup\\\{\\mathcal\{T\}\_\{i:j\}\\\},\\qquad C=\\hat\{\\mathcal\{I\}\}\\cup\\\{\\mathcal\{T\}\_\{k:\\ell\}\\\},whereℐ∗,ℐ^⊆ℐ\\mathcal\{I\}^\{\*\},\\,\\hat\{\\mathcal\{I\}\}\\subseteq\\mathcal\{I\}are the ground\-truth and predicted image sets,𝒯i:j\\mathcal\{T\}\_\{i:j\}and𝒯k:ℓ\\mathcal\{T\}\_\{k:\\ell\}is the ground\-truth and predicted text spans\.

##### Citation scoring\.

Image citations are scored by exact match:

σp\(Im,G\)\\displaystyle\\sigma\_\{p\}\(I\_\{m\},\\,G\)=𝟙\[Im∈ℐ∗\],\\displaystyle=\\mathbb\{1\}\[I\_\{m\}\\in\\mathcal\{I\}^\{\*\}\],σr\(Im,C\)\\displaystyle\\sigma\_\{r\}\(I\_\{m\},\\,C\)=𝟙\[Im∈ℐ^\]\.\\displaystyle=\\mathbb\{1\}\[I\_\{m\}\\in\\hat\{\\mathcal\{I\}\}\]\.
Text citations are scored by fuzzy substring matching\. Lets∗=partial\_ratio\(𝒯k:ℓ,𝒯i:j\)∈\[0,1\]s^\{\*\}=\\texttt\{partial\\\_ratio\}\(\\mathcal\{T\}\_\{k:\\ell\},\\,\\mathcal\{T\}\_\{i:j\}\)\\in\[0,1\]\.

The match score is discretized to reduce sensitivity to trivial differences:

d\(s∗\)=\{1\.0,ifs∗≥0\.90\.5,if0\.6≤s∗<0\.90\.0,otherwised\(s^\{\*\}\)=\\begin\{cases\}1\.0,&\\text\{if \}s^\{\*\}\\geq 0\.9\\\\ 0\.5,&\\text\{if \}0\.6\\leq s^\{\*\}<0\.9\\\\ 0\.0,&\\text\{otherwise\}\\end\{cases\}\(1\)
Length ratios penalize precision for over\-quoting and recall for under\-quoting:

σp\(𝒯k:ℓ,G\)\\displaystyle\\sigma\_\{p\}\(\\mathcal\{T\}\_\{k:\\ell\},\\,G\)=d\(s∗\)⋅min⁡\(1,\|𝒯i:j\|\|𝒯k:ℓ\|\),\\displaystyle=d\(s^\{\*\}\)\\cdot\\min\\\!\\left\(1,\\frac\{\|\\mathcal\{T\}\_\{i:j\}\|\}\{\|\\mathcal\{T\}\_\{k:\\ell\}\|\}\\right\),σr\(𝒯i:j,C\)\\displaystyle\\sigma\_\{r\}\(\\mathcal\{T\}\_\{i:j\},\\,C\)=d\(s∗\)⋅min⁡\(1,\|𝒯k:ℓ\|\|𝒯i:j\|\)\.\\displaystyle=d\(s^\{\*\}\)\\cdot\\min\\\!\\left\(1,\\frac\{\|\\mathcal\{T\}\_\{k:\\ell\}\|\}\{\|\\mathcal\{T\}\_\{i:j\}\|\}\\right\)\.

##### Precision and recall\.

Per\-item precision and recall are the mean scores over predicted and ground\-truth citations, respectively:

P=1\|C\|∑c∈Cσp\(c,G\),R=1\|G\|∑g∈Gσr\(g,C\)\\displaystyle\\mathrm\{P\}=\\frac\{1\}\{\|C\|\}\\sum\_\{c\\in C\}\\sigma\_\{p\}\(c,G\),\\qquad\\mathrm\{R\}=\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\sigma\_\{r\}\(g,C\)F1=2PRP\+R\\displaystyle\\mathrm\{F1\}=\\frac\{2PR\}\{P\+R\}Macro\-averagedP\\mathrm\{P\},R\\mathrm\{R\}, andF1\\mathrm\{F1\}are reported over the dataset\.

## Appendix DMultAttnAttrib

Algorithm 2MultAttnAttrib: Head Identification \(MeanAttn\)1:

gϕg\_\{\\phi\}\(language model\),

\{\(𝐱i,Gi,Qi,Di\)\}i=1N\\\{\(\\mathbf\{x\}\_\{i\},G\_\{i\},Q\_\{i\},D\_\{i\}\)\\\}\_\{i=1\}^\{N\}\(probe set\),

kk\(number of heads\)

2:

S←𝟎L×HS\\leftarrow\\mathbf\{0\}^\{L\\times H\}
3:for

i←1,…,Ni\\leftarrow 1,\\ldots,Ndo

4:

A←gϕ\(𝐱i,Qi\)A\\leftarrow g\_\{\\phi\}\(\\mathbf\{x\}\_\{i\},Q\_\{i\}\)
5:for

\(l,h\)\(l,h\)do

6:

r←meanq∈Qi⁡Al,h,q\(Gi\)/meanq∈Qi⁡Al,h,q\(Di\)r\\leftarrow\\operatorname\{mean\}\_\{q\\in Q\_\{i\}\}A\_\{l,h,q\}\(G\_\{i\}\)\\,/\\,\\operatorname\{mean\}\_\{q\\in Q\_\{i\}\}A\_\{l,h,q\}\(D\_\{i\}\)
7:

w←max⁡\(0,1−H\(Al,h,⋅\|Di\)/log⁡\|Di\|\)w\\leftarrow\\max\(0,\\;1\-H\(A\_\{l,h,\\cdot\}\|\_\{D\_\{i\}\}\)\\,/\\,\\log\|D\_\{i\}\|\)
8:

S\[l,h\]\+=r⋅wS\[l,h\]\\mathrel\{\+\}=r\\cdot w
9:endfor

10:endfor

11:

ℋ←arg maxl,hkS\[l,h\]\\mathcal\{H\}\\leftarrow\\text\{arg\\,max\}\_\{l,h\}^\{k\}\\,S\[l,h\]
12:return

ℋ\\mathcal\{H\}
13:

Algorithm 3MultAttnAttrib: Head Identification \(CMA\)1:

gϕg\_\{\\phi\}\(language model\),

\{\(𝐱i,Gi,Qi,Di\)\}i=1N\\\{\(\\mathbf\{x\}\_\{i\},G\_\{i\},Q\_\{i\},D\_\{i\}\)\\\}\_\{i=1\}^\{N\}\(probe set\),

kk\(number of heads\)

2:

S←𝟎L×HS\\leftarrow\\mathbf\{0\}^\{L\\times H\}
3:for

i←1,…,Ni\\leftarrow 1,\\ldots,Ndo

4:

𝐱~i←Corrupt\(𝐱i\)\\tilde\{\\mathbf\{x\}\}\_\{i\}\\leftarrow\\textsc\{Corrupt\}\(\\mathbf\{x\}\_\{i\}\)
5:

Aclean←gϕ\(𝐱i,Qi\)A^\{\\mathrm\{clean\}\}\\leftarrow g\_\{\\phi\}\(\\mathbf\{x\}\_\{i\},Q\_\{i\}\);

Acorrupt←gϕ\(𝐱~i,Qi\)A^\{\\mathrm\{corrupt\}\}\\leftarrow g\_\{\\phi\}\(\\tilde\{\\mathbf\{x\}\}\_\{i\},Q\_\{i\}\)
6:for

\(l,h\)\(l,h\)do

7:

IE←meanq∈Qi⁡\[Al,h,qclean\(Gi\)−Al,h,qcorrupt\(Gi\)\]\\mathrm\{IE\}\\leftarrow\\operatorname\{mean\}\_\{q\\in Q\_\{i\}\}\\bigl\[A^\{\\mathrm\{clean\}\}\_\{l,h,q\}\(G\_\{i\}\)\-A^\{\\mathrm\{corrupt\}\}\_\{l,h,q\}\(G\_\{i\}\)\\bigr\]
8:

w←max⁡\(0,1−H\(Al,h,⋅clean\|Di\)/log⁡\|Di\|\)w\\leftarrow\\max\(0,\\;1\-H\(A^\{\\mathrm\{clean\}\}\_\{l,h,\\cdot\}\|\_\{D\_\{i\}\}\)\\,/\\,\\log\|D\_\{i\}\|\)
9:

S\[l,h\]\+=IE⋅wS\[l,h\]\\mathrel\{\+\}=\\mathrm\{IE\}\\cdot w
10:endfor

11:endfor

12:

ℋ←arg maxl,hkS\[l,h\]\\mathcal\{H\}\\leftarrow\\text\{arg\\,max\}\_\{l,h\}^\{k\}\\,S\[l,h\]
13:return

ℋ\\mathcal\{H\}
14:

Algorithm 4MultAttnAttrib: Calibration1:

\{viimg,vitxt\}\\\{v^\{\\text\{img\}\}\_\{i\},v^\{\\text\{txt\}\}\_\{i\}\\\}\(probe attribution scores\),

\{𝒢i\}\\\{\\mathcal\{G\}\_\{i\}\\\}\(ground\-truth modality labels\)

2:for

m∈\{img,txt\}m\\in\\\{\\text\{img\},\\,\\text\{txt\}\\\}do

3:

\(Vm\+,Vm−\)←Split\(\{vim\},\{𝒢i\}\)\(V^\{\+\}\_\{m\},\\,V^\{\-\}\_\{m\}\)\\leftarrow\\textsc\{Split\}\(\\\{v^\{m\}\_\{i\}\\\},\\,\\\{\\mathcal\{G\}\_\{i\}\\\}\)
4:

Tm←arg⁡maxT⁡F1\(Vm\+,Vm−,T\)T\_\{m\}\\leftarrow\\arg\\max\_\{T\}\\;\\mathrm\{F1\}\(V^\{\+\}\_\{m\},\\,V^\{\-\}\_\{m\},\\,T\)
5:endfor

6:return

Timg,TtxtT\_\{\\text\{img\}\}\\,,\\,T\_\{\\text\{txt\}\}

## Appendix EComparingGPT\-5\.4toQwen3\-VL\-30B

Table 5:Text regime metrics for GPT and Qwen3\-VLTable 6:Image regime metrics for GPT and Qwen3\-VLTable 7:Multimodal regime metrics for GPT and Qwen3\-VL
## Appendix FComparing GPT\-5\.4 toMultAttnAttrib

Table 8:Text regime metrics for GPT andMultAttnAttribTable 9:Image regime metrics for GPT andMultAttnAttribTable 10:Multimodal regime metrics for GPT andMultAttnAttrib
## Appendix GDomain Difficulty Analysis

![Refer to caption](https://arxiv.org/html/2607.01420v1/x11.png)Figure 9:Domain Difficulty ChartF1 scores for each of our regimes, grouped by document domain and method used, ordered from hardest to easiest domains \(all modalities pooled\)##### Domains have no tangible performance impact on intra\-baseline relationships\.

Generally, the VLM and LLM baselines perform the worst, with Cohere \+ VLM and Cohere \+ LLM being similarly better, andMultAttnAttrib, along with its Cohere variant, generally outperforming other methods\. Changing the document type does not affect the relationships among the baselines, indicating that our baseline implementations are robust and impartial across the domains tested in our experiments\.

##### Academic documents are consistently more difficult to generate attributions for\.

Academic documents have a regime\-wide unified F1 score of 0\.54, with marketing, the second worst performing domain, seeing an approximate 8% increase in F1 scores\. This is the highest domain\-to\-domain jump for unified F1 scores\. The reason for this disparity is likely due to the structure of academic/research documents\. Redundancy in academic documents is common, as reference material is sparsely distributed, cross\-referential, and frequently restated or paraphrased\. As a result, we often see over\-attribution in a QA pairing, leading to inaccuracies when comparing against the more lightweight ground\-truth attributions and to poor F1 results, as seen in[9](https://arxiv.org/html/2607.01420#A7.F9)\.

##### Legal documents tend to receive more accurate attributions in comparison to other domains\.

We see that there is a 6% domain\-to\-domain jump in F1 metrics between Legal \(with an F1 score of 0\.74\) and Business\. This suggests that the legal domain is relatively easier to generate accurate attributions for\. The reason is that references are densely structured within specific clauses, claims, laws, or cases\. This allows for fine\-grained attributions for QA pairings \(attributions that our baseline can locate with more ease\) and creates opportunities for better fine\-grained attributions\.

## Appendix HBaseline Findings

##### Switching to image captions improves performance in GPT baselines, with mixed results in the Qwen baselines\.

In the GPT case \([E](https://arxiv.org/html/2607.01420#A5)\), we observed improvements particularly in precision for the text and multimodal regimes and somewhat in the image\-only regime\. In the Qwen case \(Table[2](https://arxiv.org/html/2607.01420#S6.T2)and Appendix[E](https://arxiv.org/html/2607.01420#A5)\), we see a slight boost in text\-only regimes, but degradation in image\-only and text \+ image QAAs\. This asymmetry suggests that the open\-source model leans more heavily on fine\-grained visual representations and is less able to perform attribution reasoning over abstracted textual descriptions of images\. This observation directly motivates MultAttnAttrib’s design: rather than mediating images through captions, we read attribution signals off attention over image patches, where the fine\-grained visual evidence is already encoded\.

##### RAG generally improves metrics in comparison to direct inference, with gains being dependent on regime and model used\.

For GPT\-5\.4, layering Cohere\-based RAG on top of LLM nearly closes the gap on text\-only attribution, but for Qwen, the same intervention yields only modest gains, even harming performance in the image\-only regime\. We hypothesize this asymmetry arises because retrieval preselects evidence into a smaller candidate pool, which a stronger generator can exploit but a weaker one cannot\. Replacing Cohere with ColQwen as the retriever further degrades performance across all methods and splits, indicating that retrieval*quality*, not just its addition, drives the gains we observed\.

##### Multimodal attribution is challenging and resists frontier gains\.

Taking the strongest baseline within each split, GPT\-5\.4 \(Appendix[E](https://arxiv.org/html/2607.01420#A5)\) outperforms Qwen3\-VL\-30B by 35\.7 F1 points on text\-only and 14\.6 points on image\-only, but only 10\.7 points on combined attribution\. The pattern persists under direct baseline comparisons: with VLM, the GPT–Qwen F1 gap is 11\.1 points on text\-only and 11\.5 on image\-only, but collapses to just 1\.8 on combined attribution Table[2](https://arxiv.org/html/2607.01420#S6.T2)\. This suggests that combined attribution exposes a difficulty distinct from those addressed by scale alone—arbitrating between modalities and aggregating partial evidence from each—which current frontier\-generation methods do not resolve on their own\.

## Appendix ILLM Judge Analysis

### I\.1Judge Setup

To complement token\-overlap metrics, we additionally evaluate attribution quality using a multi\-judge LLM panel\. Each \(question, answer, answer\_part, citation\) tuple is scored by a panel of three GPT\-4o judges, each assigned a distinct deliberation persona: a balanced evaluator, a detail\-focused critic, and a consensus mediator\. Judges share a discussion history and deliberate for up to two rounds, with early termination upon unanimous consensus; the final decision is determined by majority vote\. A citation is judged assupportiveif it grounds at least one fact in the answer component, and asnon\-supportiveif it contradicts or is entirely unrelated to the attributed claim\. We reportRelevanceandSupportas the proportions of citations judged supportive for each method across attribution regimes\.

### I\.2Judge Results

Table 11:Text regime LLM Judge results for Qwen andMultAttnAttribTable 12:Image regime LLM Judge results for Qwen andMultAttnAttribTable 13:Multimodal regime LLM Judge results for Qwen andMultAttnAttrib

## Appendix JHead Analysis

##### Setup\.

We score each of theN=L×H=48×32=1536N=L\\times H=48\\times 32=1536attention heads under two methods: Mean Attention Scoring and CMA Scoring \(discussed in Section[3\.2](https://arxiv.org/html/2607.01420#S3.SS2)\)\. Both produce score matrices𝐒img,𝐒txt∈ℝL×H\\mathbf\{S\}^\{\\text\{img\}\},\\mathbf\{S\}^\{\\text\{txt\}\}\\in\\mathbb\{R\}^\{L\\times H\}, which we use to measure cross\-modal agreement via IoU and Spearman’s rank correlation \(Figures[4](https://arxiv.org/html/2607.01420#S6.F4)–[5](https://arxiv.org/html/2607.01420#S6.F5)\)\.

##### Metrics\.

Letℋkimg\\mathcal\{H\}^\{\\text\{img\}\}\_\{k\}andℋktxt\\mathcal\{H\}^\{\\text\{txt\}\}\_\{k\}denote the top\-kkimage and text head sets, respectively\.

IoU\(k\)=\|ℋkimg∩ℋktxt\|\|ℋkimg∪ℋktxt\|\.\\text\{IoU\}\(k\)\\;=\\;\\frac\{\|\\mathcal\{H\}^\{\\text\{img\}\}\_\{k\}\\cap\\mathcal\{H\}^\{\\text\{txt\}\}\_\{k\}\|\}\{\|\\mathcal\{H\}^\{\\text\{img\}\}\_\{k\}\\cup\\mathcal\{H\}^\{\\text\{txt\}\}\_\{k\}\|\}\.Let𝒰k=ℋkimg∪ℋktxt\\mathcal\{U\}\_\{k\}=\\mathcal\{H\}^\{\\text\{img\}\}\_\{k\}\\cup\\mathcal\{H\}^\{\\text\{txt\}\}\_\{k\}andriimg,ritxtr\_\{i\}^\{\\text\{img\}\},r\_\{i\}^\{\\text\{txt\}\}be the rank of the headii’s score within𝒰k\\mathcal\{U\}\_\{k\}under specified modality\. Spearman’s rank correlation over this union measures how similar text and image modalities order these heads:

ρk=1−6∑i∈𝒰k\(riimg−ritxt\)2\|𝒰k\|\(\|𝒰k\|2−1\)\.\\rho\_\{k\}\\;=\\;1\-\\frac\{6\\displaystyle\\sum\_\{i\\in\\mathcal\{U\}\_\{k\}\}\\\!\\bigl\(r\_\{i\}^\{\\text\{img\}\}\-r\_\{i\}^\{\\text\{txt\}\}\\bigr\)^\{2\}\}\{\|\\mathcal\{U\}\_\{k\}\|\\bigl\(\|\\mathcal\{U\}\_\{k\}\|^\{2\}\-1\\bigr\)\}\.

##### Score heatmaps\.

![Refer to caption](https://arxiv.org/html/2607.01420v1/x12.png)Figure 10:CMA attribution score heatmap for heads that jointly attend to both image and text\. Diamonds mark the top\-20 cross\-modal heads\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x13.png)Figure 11:CMA attribution score heatmap for heads that attend to image sources\. Circles mark the top\-20 image heads, and diamonds mark the top\-20 cross\-modal heads\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x14.png)Figure 12:CMA attribution score heatmap for heads that attend to text sources\. Circles mark the top\-20 text heads, and diamonds mark the top\-20 cross\-modal heads\.![Refer to caption](https://arxiv.org/html/2607.01420v1/x15.png)Figure 13:Relative modality specialization of CMA\-scored heads, measured as the normalized rank difference\(rimg−rtxt\)/N\(r\_\{\\text\{img\}\}\-r\_\{\\text\{txt\}\}\)/N\. Blue cells indicate image\-dominant heads, while orange cells indicate text\-dominant heads\.

## Appendix KQAA Rubrics

Listing 1:QAA Answerability rubric\.1:Notanswerablefromchannel

\-Answerisunsupported,contradictory,ormostlyhallucinated\.

2:Severelyweaksupport

\-Onlyatinyfragmentisgrounded;coreclaimremainsunsupported\.

3:Partiallyanswerable

\-Somegroundedsignalexistsbutmajorclaimelementsaremissingoruncertain\.

4:Moderatelyanswerable

\-Coreclaimisplausibleandpartlysupported,butspecificity/precisionislimited\.

5:Stronglyanswerable

\-Mainclaimissupportedwithminoruncertaintyormissingdetail\.

6:Verystronglyanswerable

\-Preciseandwell\-supportedwithclearchannelevidence\.

7:Near\-certainanswerability

\-Exact,unambiguous,andfullysupportedbyclear,legibleevidence\(rare\)\.

Listing 2:Verifier quality score rubric for text\-only QAA\.1:Poor

\-Unsupportedorweaklygroundedanswer;evidenceismissing,contradictory,

orlargelyhallucinated\.

2:Acceptable

\-Somesupportispresent,butgroundingislimitedinprecision,

completeness,orclarity\.

3:Good

\-Answerisclearlysupportedbytheparagraphwithreasonablyspecific

andrelevantevidence\.

4:Excellent

\-Fullysupported,precise,andunambiguous;evidencedirectlyand

convincinglygroundstheanswer\.

Listing 3:Rubric for Multimodal QA Entity Verification\.1:Absent

\-Novisualevidenceoftheentityintheimage\.

2:Unlikely

\-Faintorambiguoustrace;probablyreferstosomethingelse\.

3:Possible

\-Entitymaybepresent,butevidenceisweakorunclear\.

4:Probable

\-Entityappearstobepresentwithonlyminoruncertainty\.

5:Clear

\-Entityisunambiguouslyandprominentlyvisible\.

Listing 4:Rubric for Multimodal QA Verification\.1:Redundant

\-Onemodalityalonecontainseverythingneededtoanswerthequestion;

theotheraddsnothingessential\.

2:Weaksynergy

\-Onemodalityprovidesmostoftheanswer;theothercontributesonly

minorcorroboratingdetail\.

3:Goodsynergy

\-Bothmodalitiescontributemeaningfullyandneitheraloneissufficient,

butthesplitissomewhatuneven\.

4:Strongsynergy

\-Eachmodalityprovidesessential,non\-overlappinginformation;theanswer

canonlybeconstructedbycombiningboth\.

## Appendix LPrompts

### L\.1QAA Generation

#### L\.1\.1Image\-only

Listing 5:Image QAA System PromptYouareanexperttechnicalQ\\&Ageneratorforan\*\*image\-only\*\*benchmark\.

\*\*Contract\*\*Youreceivenon\-textrastersplusgroundingtextchunksselectedbyimage\-textsimilarityfromthesamedocument\.Usegroundingchunks\*\*only\*\*toinferdocumentdomain,terminology,andwhatisin\-scope\.\*\*Donot\*\*treatgroundingtextasevidenceforanswers:every\*\*specificfactualclaim\*\*ineachanswer\(numbers,units,labels,namedentities,relationships\)mustbe\*\*legiblyvisibleorunambiguouslyreadableintherasters\*\*\.Do\*\*not\*\*copylongphrasesorwholesentencesfromgroundingchunksintoanswers\-\-\-paraphraseminimallyandanchorclaimsinwhattherastershows\.

\*\*General\-topicquestions\*\*

Questionsmustsoundlikeordinarydomainquestions\(e\.g\.clinical,engineering,policy\)areaderwouldaskwithoutknowingafigureexists\-\-\-seeQ1\.Theymaynamedomainconstructs\(pathways,metrics,components\)butmust\*\*not\*\*pointatlayout,panels,orthecarriermedium\.

\*\*Generationgoals\*\*

\-Generatehigh\-valueQAcoverageacrossdistinctlabels,values,structures,mechanisms,relationships,comparisons,andfindings\.

\-Preferreasoning\-firstquestions\(mechanism,causality,comparison,procedure,quantity,trade\-off,failuremode,subsystemlinks\)\.

\-Useonlyentities/readoutsthatarereliablylegible;omitblurryorambiguousanchors\.

\*\*QUESTIONRULES\*\*

Q1\-\-\-SOURCE\-AGNOSTIC\\&STANDALONE

TheuserdoesNOTknowthatthisspecificdocument,page,figure,orparagraphexists\.

ThequestionmustmakesenseonitsownoutsideofthisdocumentandmustNOTassumeormentionaccesstoaparticularfigure,page,section,table,report,image,text,passage,ordocument\.

Forbiddenphrasingincludes\(non\-exhaustive\):

"intheimage","inthisimage","fromtheimage","inthefigure","inthisfigure","fromthefigure","inthediagram","inthisdiagram","fromthediagram","inthetable","inthisinfographic","intheinfographic","fromthisinfographic","shownin","depicted","illustrated","pictured","visualrepresentation","inthetext","inthistext","inthepassage","accordingto","mentionedin","describedin","asseen","asshown","thefigureshows","theimageshows","thediagramshows","theinfographicshows","thedocumentstates","thereportsays","inthisdocument","inthisreport","inthissection","onthispage","onthepage","inthescreenshot","asdisplayed"\.

GOODexamples\(generaltopic,nocarrier\):

\-"Whichoperatingmodecorrespondstothehighestthroughputvalue?"

\-"Howdoesthereportedfailureratechangeafterthecalibrationstep?"

\-"Underthestatedinclusioncriteria,whichcomorbiditycategoryisexcludedfromrandomization?"

BADexamples:

\-"Whatdoesthegraphshowaboutyield?"\-\>drop\.

\-"Whatlabelappearsinthelowerpanel?"\-\>drop\.

\-"Whatanatomicallandmarkdoesthemeasurementlineterminateat?"\-\>drop\.

\-"Accordingtothisinfographic,whichregionhadthehighestdemand?"\-\>drop\.

\-"Whattrendisshowninthisscreenshotformonth\-over\-monthgrowth?"\-\>drop\.

\-"Fromthefigureabove,whatistheoperatingtemperature?"\-\>drop\.

Q2\-\-\-DOMAINGROUNDED

Questionsmusttargetverifiablefacts\(measurements,labels,values,relationships,mechanisms,comparisons,classifications,findings\)\.Askasifthereaderalreadyknowsthereissourcematerialandwantsscientific/technicalcontent\.

GOODexamples:

\-"WhatwasthepeakactivationobservedunderconditionX?"

\-"Whichregionshowedthegreatestfold\-changebetweengroupsAandB?"

Q3\-\-\-NOPERCEPTUALQUESTIONS

Donotaskaboutcolors,spatiallayout,positions,background,lighting,shadows,aesthetics,textures,orappearance\-onlysizejudgments\.

GOODexamples:

\-"Whatpressurerangeisspecifiedforsafeoperation?"

\-"Whichsubsystemisidentifiedasthebottleneckinthedescribedworkflow?"

BADexamples:

\-"Whatcoloristhehighlightedregion?"\-\>drop\.

\-"Whattwoobjectsappeartogetherontheleft?"\-\>drop\.

\-"Whatcolorarethetilesonthestructure?"\-\>drop\.

Q4\-\-\-NOCO\-OCCURRENCEQUESTIONS

Donotaskquestionswheretheonlyansweristhattwothingsappeartogether\.

GOODexamples:

\-"Whatfunctionaldependencyisdescribedbetweenthevalvesettingandoutletflow?"

\-"Whichcomponentfailurewouldmostdirectlyexplaintheobservedpressuredrop?"

BADexamples:

\-"Whatconceptistheresearchershownalongside?"\-\>drop\.

\-"WhatistherelationshipbetweenthetractorandFood?"\-\>drop\.

Q5\-\-\-NOHALLUCINATION

Donotassertfactsnotsupportedbylegiblerastercontent\.

Q6\-\-\-MAXIMIZEDIVERSITY

Coverasmanydistinctsupportedfactsaspossible;avoidnear\-duplicaterewordings\.

GOODexamples:

\-Askonequantitativequestion,onemechanismquestion,andonecomparisonquestionwhensupportexists\.

\-Prefernewfactsoverparaphrasinganearlierquestionaboutthesamevalueorlabel\.

Harduniquenessrequirement:

\-Donotemitduplicateornear\-duplicatequestions\(includingparaphraseswiththesameanswertarget\)\.

\-Iftwocandidatesaskessentiallythesamething,keeponlythemorespecificone\.

\*\*ANSWERRULES\*\*

A0\-\-\-RASTER\-ANCHOREDSPECIFICS

Concreteclaimsintheanswermustbejustifiedby\*\*legiblerastercontent\*\*\(printedtext,axistickvalues,tablecells,diagramlabels\)\.Ifgroundingtextsuggestsafactbuttherasterdoesnotclearlyshowit,\*\*omit\*\*thatpair\.

A1\-\-\-FACTUALANDPRECISE

Stateexactvalue/label/name/relationshipusingdomainterminology\.

GOODexamples:

\-"Peaktorqueis245N$\\cdot$mat1800rpm\."

\-"Thelimitingstageistheheat\-exchangerloop,whichcapsflowto3\.2L/min\."

A2\-\-\-SELF\-CONTAINED

Answermustbeinformativetoadomainexpertwithoutseeingthesourceraster\.

GOODexamples:

\-"Thealarmstateindicatesover\-temperatureprotectiontriggeredbysustainedinletvaluesabove90C\."

\-"Theprocedurerequiresdepressurizationbeforesealreplacementtopreventcavitationdamage\."

\*\*Process\*\*

1\)Produce‘domain\_grounding‘\(2\-\-4sentences\)summarizingsubjectmatterandterminologyfrom\*\*raster\-legible\*\*content,alignedwithgroundingchunksfordomainonly\.

2\)Set‘is\_relevant‘falseonlyforblank/decorative/unusablydegradedcontent\.

3\)Ifrelevant,emitallstrongnon\-redundant‘qa\_pairs‘\.Eachpair\*\*must\*\*include:

\-‘question‘,‘answer‘,‘type‘in\\\{relational,inferential,procedural,quantitative\\\}

\-‘answer\_evidence‘:oneof‘"visual"‘\(specificvalues/labels/readoutsintherasterare\*\*essential\*\*tojustifytheanswer\)or‘"visual\_plus\_general"‘\(answercombinesoneraster\-specificfactwithashortdomain\-generalclausethatisstillconsistentwiththeraster\)

\-optional‘evidence\_anchor‘

\*\*Output\*\*RawJSONonly\.Relevant:‘domain\_grounding‘,‘is\_relevant‘true,‘relevance\_rationale‘,‘qa\_pairs‘\.Else‘is\_relevant‘false,‘qa\_pairs‘\[\]\.

#### L\.1\.2Text\-only

Listing 6:Text\-only QAA System PromptYouareahigh\-qualityQAdatagenerator\.

Givenasingleparagraphoftext,youmustgeneratequestion\-answerpairsforreading\-comprehensionstyleevaluation\.

EachtripletmustsatisfyALLoftheserules:

1\.TheanswerMAYbeparaphrased\(itdoesnotneedtobecopiedverbatim\)\.

2\.TheanswerMUSTbefullysupportedbytheparagraph\.DoNOTaddfactsnotpresentintheparagraph\.

3\.TheanswerMUSTbebetween12and25wordslong\(inclusive\)\.

4\.ThequestionMUSTrequirereadingcomprehensionoftheparagraph,notjustsimplewordornamelookup\.

5\.EachquestionMUSTbeanswerablesolelyfromthegivenparagraph,withoutanyexternalknowledge\.

6\.Tripletsmustbediverse:doNOTaskmultiplequestionsthatcanbeansweredwithnearlythesamestatement\.

7\.NEVERreferto’theparagraph’,’thisparagraph’,’thetext’,’thedocument’,orsimilarmetawordinginthequestion\.

YouMUSToutputvalidJSONonly,withatop\-levelkey’triplets’containingalistofobjectswithkeys:’question’and’answer’\.

#### L\.1\.3Multimodal

Listing 7:Multimodal QAA System PromptYouareanexpertatcreatingchallenging,non\-trivialquestionsaboutscientificandtechnicalcontent\.

YourquestionswillbeusedforMULTIMODALRETRIEVALevaluation:auserhasagenuineinformationneed,submitstheirquestion,andthesystemmustfindtherightdocument\.

Forthistowork,thequestionmustbesomethingauserwouldACTUALLYASK\-\-notapureblank\-fillwheretheansweraddsnothingthequestiondidnotalreadycontain\.

\*\*QUESTIONRULES\*\*

RULE1\-\-SOURCE\-AGNOSTIC&STANDALONE

TheuserdoesNOTknowthatthisspecificdocument,page,figure,orparagraphexists\.

Theyonlyhaveaninformationneedintheworld\.ThequestionMUSTmakesenseonitsown,

outsideofthisdocument,andMUSTNOTassumeormentionthattheuserhasaccesstoa

particularfigure,page,section,table,orreport\.

Forbiddenbehaviour:

\-DoNOTreferencetheimage,text,figure,diagram,passage,page,ordocument\.

\-DoNOTwritequestionsthatwouldonlymakesenseiftheusercouldsee"thispage","thisfigure","thisdocument",or"thissection"\.

Forbiddenphrases\(non\-exhaustive,alwaysrewriteiftheyappear\):

\-"intheimage","inthisimage","inthefigure","inthisfigure","inthetable","shownin","depicted","illustrated","pictured","visualrepresentation"

\-"inthetext","inthistext","inthepassage","accordingto","mentionedin","describedin","asseen","asshown","thefigureshows","thedocumentstates","inthisdocument","inthisreport","inthissection","onthispage"\.

RULE2\-\-ANSWERMUSTADDNEWFACTUALCONTENT

Theanswermustintroduceatleastonepieceofinformationthequestiondidnotalreadycontain:aspecificnumber,measurement,date,namedentity,mechanism,comparison,orqualifyingdetailthatcannotbereaddirectlyoutofthequestion\.

Sharedentitynames,technicalterms,andpropernounsbetweenquestionandanswerareFINE\-thesearewhatretrievalsystemsusetofindtherightdocumentandimage\.

Whatisforbiddenisananswerthatisapureblank\-fillingcompletionwithnothingnew\.

JEOPARDY\(WRONG\)\-answeraddsnothingnew:

Q:"WhichVLTdrivecontrolsthehigh\-pressurepumpontheROV?"

A:"TheVLTdrivecontrolsthehigh\-pressurepumpontheROV\."

Whywrong:theanswermerelyechoesthequestionwithnonewfactadded\.

GENUINE\(RIGHT\)\-answeraddsanewfactusingsharedterminology:

Q:"WhatdoestheVLTdrivecontrolontheROV?"

A:"TheVLTdrivecontrolsthehigh\-pressurepumpmounteddirectlyontheROV,deliveringjettingwatertotheswordatupto200bar\."

Whyright:"mounteddirectlyontheROV","deliveringjettingwater","200bar"are

allnewfactsnotpresentinthequestion\.Sharedtermslike"VLTdrive"and"ROV"areexpectedandhelpretrieval\.

RULE3\-\-REQUIRESBOTHMODALITIES

TheanswermustbeimpossibletoconstructfromeithertheimageORthetextalone\.

Askforthingsthatonlyexistattheintersection:anumbervisibleinadiagrambutexplainedinthetext;aspeciesidentifiedbyvisualfeaturesbutlocatedbythetext;amechanismshowninaschematicbutdescribedinprose;acomparisonbetweenwhatislabelledandwhatismeasured\.

RULE3B\-\-NOANNOTATION\-DEPENDENTQUESTIONS

Donotgeneratequestionsthatareonlyanswerablebecauseofaspecificgraphicalannotation\-anarrow,measurementline,boundingbox,bracket,callout,orpointer\-andthataskwhattheannotationpointsto,originatesfrom,orterminatesat\.

Thequestionmustbeindependentlyanswerablefromdomainknowledge,notfromknowingwhereagraphicalmarkhappenstoappear\.

Forbiddenexamples:

\-"Whatanatomicallandmarkdoesthemeasurementlineterminateat?"\-\>DROP

\-"Whatdoesthearrowontheleftindicate?"\-\>DROP

\-"Whatcomponentisthecalloutpointingto?"\-\>DROP

RULE4\-\-ASKFORSPECIFICFACTS,NOTENTITYNAMES

PreferquestionsthataskHOW,HOWMANY,WHY,WHATDOESXDO,WHATDISTINGUISHESXFROMY,UNDERWHATCONDITIONS,ratherthanWHICHX/WHATISTHENAMEOFX\.

Ifyoumustask"whatisX",makesurethequestiondoesnotalreadydescribeXsofullythatonlyonepossibleanswerexists\.

RULE5\-\-ANSWERMUSTADDNEWFACTUALCONTENT

Theanswermustintroduceatleastonenewfactnotalreadystatedinthequestion:anumber,measurement,mechanism,namedentitynotinthequestion,orqualifyingdetail\.

SharedpropernounsandtechnicaltermsbetweenquestionandanswerareEXPECTEDandFINE\-\-theyarethevocabularyretrievalsystemsusetofindtherightdocument\.Rewriteonlywhentheanswerisapureblank\-fillthataddsnothingnewatall

Generateexactly\{n\_questions\}question\-answerpairs\.

OutputONLYrawJSON\-\-nomarkdownfences,nopreamble\.

### L\.2QAA Filtering

#### L\.2\.1Image\-only

Youcertify\*\*image\-only\*\*Q\\&Ausingthebundlednon\-textrasterandtheQUESTION\+REFERENCEANSWER\(noseparatedocumenttext\)\.

\*\*GLOBAL\-\-\-rationales\*\*Noimage/figure/page/diagram/photo/chart/graph/

table/slide/panel/process/map/

screenshot/infographic/;noshown/depicted/visible/here/this/that/left/

right/above/below/lookat\.Use‘‘questiontext’’,‘‘referenceanswer’’,‘‘supportingchannel’’,‘‘pair’’\.

\*\*Ranking\*\*Answerability1\-\-7istheprimaryrankkey;7israre;typicalacceptablepairs4\-\-6\.Optionalper\-document\*\*retentioncap\*\*isconfiguredoutsidethisprompt\(0=keepallcertifiedrows\)\.

\*\*Whatanswerabilitymeans\(operationaldefinition\)\*\*

AnswerabilityistheextenttowhichtheQUESTIONcanbeansweredcorrectly,specifically,andunambiguouslyfromthebundledrasterchannel,withtheREFERENCEANSWERalignedtowhatthatchannelsupports\.

\-Judgesupportfromlegibletechnicalcontentonly\(readablelabels,values,structures,andexplicitrelationships\)\.

\-Penalizewhentheanswerreliesonspeculation,unstatedassumptions,weakvisualimpressions,orinformationnotrecoverablefromtheraster\.

\-Penalizewhenquestionscopeandanswerscopedonotmatch\(overclaiming,addeddetails,wronggranularity\)\.

\-PenalizewhentheREFERENCEANSWERcouldbewrittenas\*\*genericdomainboilerplate\*\*withoutchecking\*\*specificmarks,numbers,orlabels\*\*intheraster\(noimage\-tiedspecifics\)\.

\-Thisisnotfluencyscoring;awell\-writtenbutunsupportedanswershouldstillscorelow\.

\*\*Answerabilityscorerubric\(1\-7\)\*\*

\-1:Notanswerablefromchannel;answerisunsupported,contradictory,ormostlyhallucinated\.

\-2:Severelyweaksupport;onlytinyfragmentisgrounded,coreclaimremainsunsupported\.

\-3:Partiallyanswerable;somegroundedsignalexistsbutmajorclaimelementsaremissingoruncertain\.

\-4:Moderatelyanswerable;coreclaimisplausibleandpartlysupported,butspecificity/precisionislimited\.

\-5:Stronglyanswerable;mainclaimissupportedwithminoruncertaintyormissingdetail\.

\-6:Verystronglyanswerable;preciseandwell\-supportedwithclearchannelevidence\.

\-7:Near\-certainanswerability;exact,unambiguous,andfullysupportedbyclearlegibleevidence\(rare\)\.

\*\*Hardfloors\*\*Iftriggered:‘answerability‘=1;alignpasses;citeruleid\(QR1\-\-QR8\)inarationale\.

QR1questionviolatesSOURCE\-AGNOSTIC\\&STANDALONErule\(references/assumesaspecificimage/text/figure/document/page/section/

table/reportorusesforbiddensourcephrases\)\\textperiodcentered\{\}QR2circularanswer\\textperiodcentered\{\}QR3annotation\-dependent/layout\-dependentquestion\\textperiodcentered\{\}QR4trivial/low\-valuefact\\textperiodcentered\{\}QR5instancefactsnotinimagechannel\\textperiodcentered\{\}QR6perceptualorstructure\-appearancequestion\\textperiodcentered\{\}QR7co\-occurrence\-onlyquestion\\textperiodcentered\{\}QR8panel\-specificreference\.

\*\*Axes\*\*

1\*\*Answerability\*\*\-\-\-channelvsQ\+A;downgradeunsupportedclaims\.

2\*\*source\_ref\_pass\*\*\-\-\-evaluatetheQUESTIONonlyusingSOURCE\-AGNOSTIC\\&STANDALONE\.

Failifthequestionreferencestheattributedsource/carrier\(image,figure,diagram,infographic,screenshot,text,passage,document,page,section,table,report\),assumesaccessto‘‘this’’material,orusesforbiddensourcephrasessuchas‘‘intheimage’’,‘‘fromtheimage’’,‘‘inthisdiagram’’,‘‘thediagramshows’’,‘‘inthisinfographic’’,‘‘theinfographicshows’’,‘‘shownin’’,‘‘accordingto’’,‘‘asshown’’,‘‘thedocumentstates’’,‘‘thereportsays’’,or‘‘onthispage’’\(fail$\\rightarrow$QR1\)\.

Donotfailmerelybecausethequestionmentionsdomainentities;failonlywhenthewordingdependsonorpointstoaspecificsourceartifact\.

3\*\*image\_quality\_pass\*\*\-\-\-failifchannellacksreadabledomainsignalwherethepairneedsit\.

4\*\*triviality\_pass\*\*\-\-\-failshallownaming/labelechowithoutreasoning\.

\*\*Hard\-floorexamples\(pairsthatshouldtriggerfloors/verylowanswerability\)\*\*

\-QR1:‘‘Whatdoesthegraphshowaboutyield?’’,‘‘Whatisthelabelinthelowerpanel?’’

\-QR1:‘‘Accordingtothisinfographic,whatcategorydominates?’’,‘‘Fromthescreenshot,whatvalueisdisplayedinthetop\-rightwidget?’’

\-QR3:‘‘Whatdoesthearrowontheleftindicate?’’,‘‘Whatanatomicallandmarkisthemeasurementlineoriginatingfrom?’’

\-QR6:‘‘Whatcoloristhehighlightedregion?’’,‘‘Whatcolorarethetilesonthestructure?’’

\-QR7:‘‘Whatconceptisthepersonshowninrelationto?’’,‘‘WhatistherelationshipbetweenthetractorandFood?’’

\-QR8:‘‘WhatisthesignificanceoflabelIVVinthelowerpanel?’’,‘‘WhatstructureislabeledinpanelB?’’

\-Visual\-layoutanswerpenalty:‘‘Thecellisshowninthecontextofapoptosis\.’’\-\>scorein1\-\-2rangeseverity\.

\*\*GOODcertificationexamples\*\*\(questiontextavoidscarrier/deictic/sourcephrasing;answermatcheslegibletechnicalcontent;expectanswerability\\textasciitilde\{\}5\-\-6\+,passesalignedwithcontent\)

\-Q:‘‘Whatisthemaximumratedloadinkilonewtons?’’\\textperiodcentered\{\}REF:‘‘Maximumratedloadis120kN\.’’$\\rightarrow$Strongifthevalueappearsclearlyonalabel/spectableintheraster\.

\-Q:‘‘Whichinterventionarmachievedthehighermediansurvivalat24months?’’\\textperiodcentered\{\}REF:‘‘ArmBhadhighermediansurvivalat24monthsthanArmA\.’’$\\rightarrow$Strongifsurvivalorachartencodesthecomparisonwithoutguessing\.

\-Q:‘‘Whatstepimmediatelyprecedesthepressure\-reliefsequenceintheflowdiagram?’’\\textperiodcentered\{\}REF:‘‘Thepurgecyclecompletesimmediatelybeforethepressure\-reliefsequence\.’’$\\rightarrow$Strongiftheworkfloworderisunambiguousinthefigure\.

\*\*BADcertificationexamples\*\*\(expectfloorQRcodes,‘source\_ref\_pass‘false,oranswerability1\-\-3asappropriate\)

\-Q:‘‘Whattrenddoesthefigureillustrateforcohort2?’’\\textperiodcentered\{\}REF:‘‘Cohort2declinesafterweek6\.’’$\\rightarrow$BAD\(QR1\+weakgroundingif‘figure’istheonlyanchor\)\.

\-Q:‘‘Whatiswrittenonthestickerinthephoto?’’\\textperiodcentered\{\}REF:‘‘Thestickersays‘authorizedpersonnelonly’\.’’$\\rightarrow$BAD\(entity/carrierwordinginquestion;failsource\_ref/QR1\-class\)\.

\-Q:‘‘Accordingtothisinfographic,whichsegmenthasthehighestshare?’’\\textperiodcentered\{\}REF:‘‘Theenterprisesegmenthasthehighestshare\.’’$\\rightarrow$BAD\(QR1source\-referencephrasing\)\.

\-Q:‘‘Fromthescreenshot,whaterrorcodeisshown?’’\\textperiodcentered\{\}REF:‘‘ErrorcodeE\-417isshown\.’’$\\rightarrow$BAD\(QR1source\-referencephrasing\)\.

\-Q:‘‘Whatdoestheimageabovesayaboutthepressurethreshold?’’\\textperiodcentered\{\}REF:‘‘Itsaysthethresholdis2\.5bar\.’’$\\rightarrow$BAD\(QR1source\-reference\+deicticwording\)\.

\-Q:‘‘Isthearrowpointingupordown?’’\\textperiodcentered\{\}REF:‘‘Thearrowpointsup\.’’$\\rightarrow$BAD\(layout/deictic;QR3/QR6\-class\)\.

\-Q:‘‘Whatcoloristhesafetyrailing?’’\\textperiodcentered\{\}REF:‘‘Therailingisyellow\.’’$\\rightarrow$BAD\(perceptual;QR6\)\.

\-Q:‘‘Whattwologosappearsidebysideintheheader?’’\\textperiodcentered\{\}REF:‘‘CompanyAandCompanyB\.’’$\\rightarrow$BAD\(co\-occurrence/appearance;QR7\)\.

\-REFalone:‘‘Themeasurementistakenattheinletportshownontheright\.’’$\\rightarrow$BADanswerpattern\(spatial/deicticinthereferenceanswer;penalizeanswerabilityandalignrationales\)\.

\*\*Output\*\*RawJSON:answerability\(int\),answerability\_rationale,source\_ref\_pass,source\_ref\_rationale,image\_quality\_pass,image\_quality\_rationale,triviality\_pass,triviality\_rationale\.

#### L\.2\.2Text\-only

Listing 8:Text\-only Verification System PromptYouareastrictverifierforQAdata\.

Givenaparagraph,aquestion,andananswer,decideiftheanswerisfullysupportedbytheparagraph\.

Alsoidentifytheexactverbatimevidencespaninsidetheparagraphthatsupportstheanswer\.

Rules:

\-OutputJSONONLYwithkeys:supported\(boolean\),quality\_score\(int\),extractive\_answer\(string\)\.

\-DoNOTrewritethequestionoranswer\.Judgeonly\.

\-extractive\_answerMUSTbeaverbatimsubstringoftheparagraphandMUSTbe12\-25words\.

\-Ifsupported=true,extractive\_answeristheevidencespansupportingtheanswer\.

\-Ifsupported=false,extractive\_answershouldstillbethebestcontiguousverbatimspan

thatrelatestothequestion\(oremptystringifnone\)\.

\-quality\_scoreisona1\-4scale:1poor,2acceptable,3good,4excellent\.

#### L\.2\.3Multimodal

Listing 9:Multimodal Quality System PromptYouareaQAqualityevaluatorformultimodalscientificdocumentcomprehension\.

Youwillratequestion\-answerpairsona1\-7scalebasedpurelyonQAquality:howwell\-formed,specific,accurate,andnon\-trivialthepairis\.

DoNOTfactorinwhetherbothmodalitiesarerequired\-\-thatisassessedseparately\.

UsetheFULLrangeofscores\.Mostwell\-formedpairsshouldscore4\-6;reserve7fortrulyexemplarypairsand1\-2forpairsthatshouldbeexcluded\.

\-\-HARDFLOORRULES\(immediatelysetscore=1,stopevaluation\)\-\-

QR1\-SOURCE\-REFERENCEVIOLATION

ThequestionMUSTNOTreferencethesourcedocument,image,figure,table,page,orsection\.

Bannedphrases:"inthisimage","inthefigure","shownin","accordingtothisdocument","onthispage","inthetext","thediagramshows","asdepicted","basedon","asper",andanyequivalentphrasingthatassumesthereadercanseeaspecificsource\.

\-\>ANYoccurrenceofsource\-referencinglanguage:HARDFLOORscore=1\.

QR2\-CIRCULARANSWER\(nonewfactualcontent\)

Theanswermustintroduceatleastonenewfactnotalreadystatedinthequestion:anewnumber,measurement,namedentity,mechanism,orqualifyingdetail\.

SharedpropernounsandtechnicaltermsbetweenquestionandanswerarePERMITTED\-theyareusefulforretrievalandexpectedindomain\-specificQA\.

Thefloortriggersonlywhentheanswerisapureblank\-fillingcompletionthataddsnonewinformationwhatsoever\.\-\>HARDFLOORscore=1\.

QR3\-ANNOTATION\-DEPENDENTQUESTION

Aquestionwhoseanswerisonlyknowablefromagraphicalannotation\(arrow,callout,

measurementline,boundingbox\)ratherthandomainknowledge\.\-\>HARDFLOORscore=1\.

YouMUSTprovide’Rule:’\(theviolatedruleIDor"PASS"\),’Evaluation:’,and’Totalrating:’inyouranswer\.

Listing 10:Multimodal Crossmodal System PromptYouareastrictmultimodalgroundingevaluatorfordocumentQApairs\.

Yourtaskistoverifywhethereachmodality\(imageandtext\)individuallygroundsthespecificentitiesandclaimsintheanswer\-\-NOTjustwhethertheyaretopicallyrelated\.

Criticaldistinction:

\-AnimageofproductpackagingdoesNOTgroundanansweraboutabiologicalprocess,evenifbothrelatetothesamegeneraltopic\.

\-AnimagemustVISUALLYDEPICT,LABEL,orMEASUREthespecificentitiesnamedintheanswertocountasmeaningfulimagegrounding\.

\-ThescorereflectstheWEAKESTmodality:iftheimagedoesnotgroundanykeyanswerentity,theoverallscoreislowevenifthetextisperfect\.

YouMUSTprovide’Imagegrounding:’,’Textgrounding:’,’Evaluation:’,’Reason:’,and’Totalrating:’inyouranswer\.

The’Reason:’fieldmustbeexactlyonesentence,max30words,startingwiththeweakestmodalityor"PASS"\.

### L\.3Baselines

#### L\.3\.1General Baseline Prompt

Given the custom preambles for VLM/LLM approaches, our baseline system prompt is as follows:

Donotforceuseofasourcetype\.Includetextevidenceonlywhensupportivetextispresent;includeimageindicesonlywhensupportivevisualevidenceispresent\.

\#\#Evidenceextraction

\-‘text\_quote‘:copythe\*\*completeparagraph\*\*fromtheprovidedtextthatmostdirectlysupportsthereferenceanswer\.Usetheparagraphexactlyasitappears\-\-\-notruncation,notrimmingtosub\-paragraphfragments\.Iftheparagraphexceeds150words,copythemostrelevantcontiguoussentenceswithinit,butincludeatleastthefullsentencecontainingthekeyfactplusonesentenceofcontextoneachside\.Mustbeanexactsubstringoftheprovidedtext\.Ifnotextevidenceisused,return‘""‘\.

\-‘image\_indices‘:listofintegerindicesthatrefertotheprovidedimagelabels‘\[Imagek\]‘only\.Useuniqueintegerswithatmost2entries\.Ifnoimageevidenceisused,return‘\[\]‘\.

\#\#Constraints

\-‘image\_indices‘:sorted,uniqueintegers\.Atmost2entries\.

\-Iftextevidenceisused,‘text\_quote‘mustbenon\-empty\.

\-Ifimageevidenceisused,‘image\_indices‘mustbenon\-empty\.

\-Itisvalidtousetextonly\(‘image\_indices=\[\]‘\),imagesonly\(‘text\_quote=""‘\),orboth\.

\#\#Outputformat\(strict\)

Replywith\*\*only\*\*asingleJSONobject\(nomarkdownfences,nocommentary,noextrakeys\)\.

Theusermessagewillrepeattheexactquestionandreferenceanswer;echothembackinyourJSONas‘question‘and‘reference\_answer‘fortraceability\.

TheJSONobjectmustcontain\*\*exactly\*\*thesekeys:

\-‘question‘\(string\)

\-‘reference\_answer‘\(string\)

\-‘text\_quote‘\(string\)

\-‘image\_indices‘\(arrayofintegers\):uniqueintegers,atmost2entries

\-‘rationale‘\(string\):Onesentencedescribingwhichselectedsourcematerialsupportstheanswer\.

#### L\.3\.2VLM Preamble

When using raw raster content, we can apply the modification below, followed by the general prompt\.

Youaregiven\*\*documentimages\*\*\(labels‘\[Image0\]‘,‘\[Image1\]‘,\.\.\.inorder\)and\*\*documenttext\*\*\.Useonlythisprovidedcontextasevidenceforattribution\.

Theusermessageendswitha\*\*benchmarkquestion\*\*anda\*\*referenceanswer\*\*\.

Yourjobis\*\*not\*\*tore\-answerthequestion\.Yourjobistoselectthesourcematerialfromtheprovidedimagesand/ortextthatsupportsthereferenceanswer,thenattributethatevidence\.

\#\#Sourcematerialselection

Reviewallprovidedevidenceandpickwhatdirectlysupportsthereferenceanswer:

1\.\*\*Scandocumentimages\*\*andidentifyindiceswhosevisualcontent\(figures,tables,charts,diagrams,schematics,photos,labeledcomponents,layouts\)isrelevantandsupportiveofthereferenceanswer\.

2\.\*\*Scandocumenttext\*\*andfindthepassagethatmostdirectlystatesorsupportsthekeyfact\(s\)inthereferenceanswer\.

3\.Usewhicheverevidenceisactuallysupportive:

\-textonly,

\-imagesonly,or

\-bothtextandimages\.

#### L\.3\.3LLM Preamble

When using captions instead of raw raster content, we can apply the modification below, followed by the general prompt\.

Youaregiven\*\*documentimagecaptions\*\*\(forlabels‘\[Image0\]‘,‘\[Image1\]‘,\.\.\.inorder\)and\*\*documenttext\*\*\.Useonlythisprovidedcontextasevidenceforattribution\.

Theusermessageendswitha\*\*benchmarkquestion\*\*anda\*\*referenceanswer\*\*\.

Yourjobis\*\*not\*\*tore\-answerthequestion\.Yourjobistoselectthesourcematerialfromtheprovidedcaptionsand/ortextthatsupportsthereferenceanswer,thenattributethatevidence\.

\#\#Sourcematerialselection

Reviewallprovidedevidenceandpickwhatdirectlysupportsthereferenceanswer:

1\.\*\*Scanimagecaptions\*\*tiedto‘\[Imagek\]‘andidentifyindiceswhosedescribedvisualcontentisrelevantandsupportiveofthereferenceanswer\.

2\.\*\*Scandocumenttext\*\*andfindthepassagethatmostdirectlystatesorsupportsthekeyfact\(s\)inthereferenceanswer\.

3\.Usewhicheverevidenceisactuallysupportive:

\-textonly,

\-imagesonly,or

\-bothtextandimages\.
MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

The Attribution Contract: Feature Attribution for Generative Language Models

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Submit Feedback

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution
MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering
The Attribution Contract: Feature Attribution for Generative Language Models
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA