Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Summary
Research shows Chain-of-Thought prompting harms visual-spatial reasoning in multimodal LLMs due to shortcut learning and hallucinating visual details from text alone.
View Cached Full Text
Cached at: 04/22/26, 10:35 AM
Paper page - Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Source: https://huggingface.co/papers/2604.16060
Abstract
Chain-of-Thought prompting in multimodal reasoning models degrades performance in visual spatial reasoning due to shortcut learning and hallucination of visual details from text alone.
Multimodal Reasoning Models(MRMs) leveragingChain-of-Thought(CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance invisual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severeshortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.16060
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.16060 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.16060 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do
This paper systematically evaluates multimodal Chain-of-Thought reasoning across 12 tasks, finding it selectively effective for reasoning tasks but detrimental for perception tasks, and identifies a 'Look Light, Think Heavy' pattern where visual introspection declines during reasoning.
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.
Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
This paper investigates how toxic lexical perturbations in prompts reduce the factual accuracy and increase uncertainty of LLMs, and uses attribution-graph analyses to trace internal changes. It finds that increasing toxicity amplifies perturbation-sensitive variant nodes while core reasoning nodes remain invariant.
The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces
The article discusses a shift in LLM reasoning research from making reasoning explicit via chain-of-thought to exploring latent reasoning that doesn't require language traces, questioning whether visibility is necessary for effective reasoning.