The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Hugging Face Daily Papers Papers

Summary

This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.
Original Article
View Cached Full Text

Cached at: 05/25/26, 06:36 AM

Paper page - The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Source: https://huggingface.co/papers/2604.20665

Abstract

Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics.

The rapid proliferation ofVision-Language Models(VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominantVision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibitfunctional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology ofmultimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: theModality Translation Protocol, designed to quantify what we call theExpense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- theToll (ToS),Curse (CoS), andFallacy (FoS)of Seeing -- culminating in theSemantic Sufficiency Criterion (SSC). Furthermore, we hypothesise aDivergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond “multimodal gain” as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2604\.20665

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20665 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20665 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20665 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.

Improving Multimodal Reasoning via Worst Dimension Optimization

arXiv cs.AI

This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.