Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Summary
This paper investigates a bias in vision-language models where they overestimate shared understanding in dialogue, confusing perceptual access with communicative grounding. The findings have implications for dialogue systems and VLM evaluation.
View Cached Full Text
Cached at: 07/03/26, 03:52 AM
Paper page - Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Source: https://huggingface.co/papers/2606.31719 This paper investigates a subtle but important distinction in collaborative dialogue: whether vision-language models can tell apart what could be shared (from shared perception) versus what has been shared (through grounding in interaction). Using 13,077 annotated reference expressions from HCRC MapTask dialogues, we evaluate VLMs under controlled manipulations of dialogue context and map-information access.
A key finding is that providing authentic map images improves overall VLM performance but introduces a systematic bias toward over-predicting alignment between participants — models tend to assume interlocutors share the same interpretation simply because they share the same visual input. Interestingly, textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, suggesting the bias is driven by task-relevant content rather than the visual modality itself.
This has implications for anyone working on dialogue systems, grounded language understanding, or VLM evaluation: current models conflate perceptual access with communicative grounding, which is precisely the kind of error that matters in real collaborative settings. We’d be curious to hear thoughts on how this bias might be mitigated — whether through training objectives that explicitly model asymmetric information states, or through architectural changes that separate perceptual and discourse-level representations.
Similar Articles
Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
This paper investigates whether vision-language models can distinguish potential from established common ground in asymmetric dialogue. Experiments on MapTask data show that providing task-relevant map content (visual or textual) biases models toward over-predicting alignment, as they rely on static referential cues rather than tracking grounding through dialogue history.
Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models
This paper investigates how vision-language models resolve conflicts between visual evidence and world knowledge, revealing that visual grounding is the default while prior knowledge depends on a small set of late-layer attention heads. The authors perform causal analysis across three VLM families, demonstrating an asymmetric structure where ablating these heads shifts predictions from knowledge-grounded to visually grounded answers.
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
The paper introduces SpatialUncertain, a benchmark to evaluate whether vision-language models recognize when they cannot answer spatial questions due to occlusion or perspective ambiguity, revealing overconfidence and poor abstention behavior.
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.