Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL Papers

Summary

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

arXiv:2604.16256v1 Announce Type: cross Abstract: Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Source: https://arxiv.org/html/2604.16256

Yige Xu1,2,∗, Yongjie Wang2,∗, Zizhuo Wu1, Kaisong Song3, Jun Lin3, Zhiqi Shen1,†

1College of Computing and Data Science, Nanyang Technological University, Singapore
2Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)
3Tongyi Lab, Alibaba Group, China

[email protected], {yongjie.wang,zqshen}@ntu.edu.sg

###### Abstract

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.

∗ The first two authors contributed equally.
† Corresponding authors.

## 1 Introduction

Building upon the profound success of Large Language Models (LLMs) (OpenAI, 2023; Dubey et al., 2024; Yang et al., 2024; DeepSeek-AI, 2025; Qwen Team, 2025), recent advancements have rapidly propelled the development of Vision-Language Models (VLMs) (Liu et al., 2023; Qwen Team, 2026; Singh et al., 2026). By seamlessly integrating visual inputs with pure text, these models exhibit formidable potential in a diverse array of applications, ranging from image captioning and visual question answering to document understanding and visual grounding. To achieve this broad multimodal intelligence, modern VLMs typically rely on a standardized modular pipeline: a vision encoder extracts visual features, a cross-modal projector aligns these representations with the latent language space, and a pre-trained text decoder performs the final autoregressive generation (Liu et al., 2023; Qwen Team, 2026).

Despite their impressive performance across multimodal benchmarks, it remains largely unexplored whether these models genuinely engage in visual reasoning, or merely exploit the inherent reasoning capabilities of their textual backbones. Disentangling genuine visual reasoning from textual reliance has thus emerged as a central problem in evaluating modern VLMs. However, existing benchmarks consistently fail to disentangle these modalities separately. On one hand, many existing baselines (Yu et al., 2024; Yue et al., 2025; 2024) either evaluate merely surface-level visual recognition or heavily exploit textual priors. They fail to satisfy the rigorous demands of visually intensive tasks that require multi-step spatial and geometric reasoning grounded entirely in the visual space. Consequently, these benchmarks fall short in capturing the nuanced differences in the genuine visual reasoning capabilities of VLMs. On the other hand, although newer benchmarks (Hao et al., 2025; Yao et al., 2025; Stogiannidis et al., 2025; Xu et al., 2026) introduce complex multimodal scenarios such as mathematics, physics, and chemistry problems, their problem formulations are often deeply entangled—requiring both visual and textual inputs simultaneously. Because the absence of either modality makes the question inherently unsolvable, these entangled tasks cannot be used to isolate and evaluate modality-specific reasoning capacities.

To rigorously analyze genuine visual reasoning ability, we argue that an effective evaluation must satisfy three core principles. First, **tasks must be intrinsically "vision-first."** Achieving optimal performance should heavily depend on reasoning over spatial, geometric, or physical dynamics. In other words, the tasks must provide both step-by-step signals to verify the intermediate visual reasoning process, as well as definitive ground-truth answers to evaluate the correctness of the final output. Second, **the dataset should encompass a stratified distribution of problem difficulties.** Systematically controlling the difficulty prevents performance saturation or floor effects, thereby allowing the benchmark to effectively differentiate the reasoning capacities of VLMs across varying parameter scales. Third, **the benchmark must provide strictly equivalent questions across visual and textual formats.** This guarantees that any differences in performance stem entirely from the model's modality-specific reasoning capacities, rather than from incomplete information. By eliminating the confounding effects of information asymmetry, we ensure that the absence of either modality does not render the problem unsolvable.

Based on the aforementioned discussion, we introduce **CrossMath**, a rigorously designed multimodal reasoning benchmark to quantitatively isolate and evaluate visual-textual reasoning capabilities. CrossMath tasks the VLMs with inferring missing values within a 2D spatial grid of intersecting mathematical equations, outputting the predicted numbers sequentially (from top to bottom, left to right). This design explicitly satisfies our three evaluation principles: First, the 2D layout of intersecting equations intrinsically demands spatial geometric understanding and step-by-step logical deduction, providing clear intermediate signals and definitive ground-truth answers. Second, the procedural generation allows us to precisely control difficulty levels by adjusting grid sizes, the number of missing equations, and the complexity of operators, thereby guaranteeing sufficient discriminative power to evaluate VLMs across diverse parameter scales. Finally, to eliminate modality confounding, each CrossMath puzzle is formulated into three strictly equivalent formats—an image-only grid, a text-only markdown table, and an image+text prompt—ensuring identical task-relevant information across all settings.

To support rigorous evaluation and demonstrate the efficacy of post-training, we construct the CrossMath benchmark, featuring three difficulty levels with 5,000 training and 250 evaluation samples. To ensure strict quality control, human annotators were recruited to manually verify the cross-modal information equivalence across all 250 evaluation samples. Through extensive evaluations on state-of-the-art VLMs, we uncover a counterintuitive phenomenon: models achieve their highest performance with text-only inputs, experience unexpected degradation when visual data is integrated, and perform worst under vision-only conditions. This indicates that current VLMs rely predominantly on textual shortcuts rather than genuine visual reasoning. To mitigate this modality gap, we post-train Qwen3.5-9B on the CrossMath training set using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) (DeepSeek-AI, 2025) with solely image-based inputs. Empirical results demonstrate that our post-training significantly boosts visual reasoning and effectively closes the performance gap across modalities. Furthermore, out-of-distribution evaluations show that this post-training preserves the model's original capabilities and yields consistent gains on external vision-based mathematical tasks.

The main contributions of this work are summarized as follows:

(1) **Rigorous Evaluation & Benchmark:** We propose a systematic methodology to measure modality-specific reasoning capacity in VLMs. To support this, we construct CrossMath, a strictly controlled, multimodal-equivalent dataset that provides step-wise visual annotations for fine-grained reasoning evaluation.

(2) **Exposure of the Modality Gap:** Through systematic evaluation of state-of-the-art VLMs, we empirically demonstrate that these models predominantly rely on text-level reasoning shortcuts, often treating visual inputs as secondary and detrimental to performance.

(3) **Effective Post-Training & Robust Transfer:** We establish that image-only post-training is highly effective in rectifying these deficits, not only fostering genuine visual grounding but also driving robust out-of-distribution transfer without compromising the model's inherent capabilities.

## 2 Related Works

### 2.1 Measuring the Visual-Textual Reasoning Gap in VLMs

Although textual reasoning has been widely explored by the community (Wei et al., 2022; Yao et al., 2023; Wang et al., 2023; Xu et al., 2025a; 2025b), a growing body of work suggests that strong language-side reasoning in Vision-Language Models (VLMs) does not automatically translate into visually grounded reasoning. Early studies connect failures in spatial reasoning to weak object localization and grounding, showing that perceptual imprecision can propagate into downstream reasoning errors (Rajabi & Kosecka, 2023; Chen et al., 2025). More recent benchmarks reinforce this limitation: state-of-the-art VLMs remain brittle on spatial reasoning, chart understanding, ARC-style transformations, and other settings in which success depends on visual structure rather than linguistic priors or knowledge recall (Stogiannidis et al., 2025; Unsal & Akkus, 2025; Tang et al., 2025; Xu et al., 2026). Related work on visualized text further shows that even semantically equivalent content can become substantially harder once it is rendered visually rather than provided as plain text, highlighting a persistent gap between language-space reasoning and image-grounded reasoning (Liu et al., 2026). Mechanistic analyses likewise suggest that perception and reasoning remain only weakly coupled in current VLMs (Chen et al., 2025; Li et al., 2025).

Despite these advances, existing studies do not yet provide a fully controlled measurement of modality-specific reasoning. Some benchmarks are diagnostic of visual failures, but do not offer strictly matched text-only and image-only versions of the same problem. Others evaluate multimodal reasoning in domains such as mathematics and science, but their tasks are inherently modality-entangled: the image and text are complementary rather than interchangeable, so removing either modality changes task solvability (Yue et al., 2024; 2025; Zhang et al., 2024; Hao et al., 2025; Yao et al., 2025). As a result, cross-modality performance differences are difficult to interpret, because they may reflect information asymmetry rather than modality-specific reasoning ability. CrossMath is designed to address this gap by constructing semantically equivalent text-only, image-only, and image+text versions of the same vision-first puzzle, enabling direct comparisons of reasoning performance across modalities.

### 2.2 Visual Reasoning Benchmarks

Visual reasoning benchmarks span a broad family of tasks, including inductive, analogical, algorithmic, deductive, and spatial/geometric reasoning (Lymperaiou et al., 2026). Early abstract-puzzle benchmarks such as PuzzleVQA deliberately minimize dependence on world knowledge and instead emphasize rule induction over attributes such as number, color, shape, and size (Chia et al., 2024). More recent datasets extend this agenda through knowledge-light visual puzzles, grid-based reasoning tasks, and ARC-style transformations that require multi-step inference and self-correction (Song et al., 2025; Ren et al., 2025; Unsal & Akkus, 2025). A complementary line of work focuses on concept-based and spatially grounded reasoning. Bongard-style datasets test whether models can infer latent concepts from sets of positive and negative visual examples (Wüst et al., 2025), while spatial reasoning benchmarks probe relative position, layout understanding, planning, and inference over partially observed scenes in both abstract and natural-image settings (Mayer et al., 2025; Lyu et al., 2025; Pothiraj et al., 2025; Khezresmaeilzadeh et al., 2026). Together, these benchmarks have shown that many VLMs struggle when reasoning depends on geometry, topology, or hidden structure rather than semantic priors. Related multimodal math and science benchmarks, including MMMU/MMMU-Pro, MathVerse, EMMA, and MMReason, push models toward more realistic expert-level reasoning over diagrams, figures, and textual context (Yue et al., 2024; 2025; Zhang et al., 2024; Hao et al., 2025; Yao et al., 2025). These datasets are valuable for evaluating end-to-end multimodal competence, but they are not designed to isolate modality-specific reasoning because they inherently require both visual and textual information simultaneously.

Similar Articles

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

arXiv cs.CL

This paper investigates multilingual latent reasoning in large reasoning models across 11 languages, revealing that while latent reasoning capabilities exist, they are unevenly distributed—stronger in resource-rich languages and weaker in low-resource ones. The study finds that despite surface-level differences, the internal reasoning mechanisms are largely aligned with an English-centered pathway.