Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Hugging Face Daily Papers 04/17/26, 12:00 AM Papers

Summary

Research shows Chain-of-Thought prompting harms visual-spatial reasoning in multimodal LLMs due to shortcut learning and hallucinating visual details from text alone.

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Original Article

View Cached Full Text

Cached at: 04/22/26, 10:35 AM

Paper page - Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Source: https://huggingface.co/papers/2604.16060

Abstract

Chain-of-Thought prompting in multimodal reasoning models degrades performance in visual spatial reasoning due to shortcut learning and hallucination of visual details from text alone.

Multimodal Reasoning Models(MRMs) leveragingChain-of-Thought(CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance invisual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severeshortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.16060

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16060 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16060 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Paper page - Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper1

Collections including this paper0

Similar Articles

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces

Submit Feedback

Similar Articles

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces