Tag
This paper investigates whether vision-language models can distinguish potential from established common ground in asymmetric dialogue. Experiments on MapTask data show that providing task-relevant map content (visual or textual) biases models toward over-predicting alignment, as they rely on static referential cues rather than tracking grounding through dialogue history.
This paper investigates a bias in vision-language models where they overestimate shared understanding in dialogue, confusing perceptual access with communicative grounding. The findings have implications for dialogue systems and VLM evaluation.