Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
Summary
Investigates spatial representation in vision-language models, revealing a consistent bias where models conflate vertical image position with distance, and introduces SpatialTunnel synthetic benchmark to expose this shortcut; finds that better disentangled spatial representations improve robustness.
View Cached Full Text
Cached at: 05/29/26, 11:04 PM
Paper page - Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
Source: https://huggingface.co/papers/2605.30161
Abstract
Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks.
Vision-language models(VLMs) achieve strong performance onspatial reasoningbenchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce arepresentation-level analysisframework that constructs minimalcontrastive pairsto measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring theperspective biasof natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit differentinternal representations, and that these differences predict accuracy androbustnessacross diversespatial reasoningbenchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, asynthetic benchmarkdesigned to exposespatial shortcut biasesby removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greaterrobustness, suggesting that well-structured spatial representations lead to more reliablespatial reasoningacross diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.30161
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30161 in a model README.md to link it from this page.
Datasets citing this paper1
#### cubec/spatialtunnel Viewer• Updatedabout 16 hours ago • 5.37k • 23 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30161 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
This paper introduces PlanBench-V, the first comprehensive benchmark for evaluating Vision-Language Models on spatial planning map interpretation, including an expert-annotated dataset and a four-dimension evaluation framework. Experiments show significant progress but highlight persistent challenges in implementation-oriented tasks.