Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

Investigates spatial representation in vision-language models, revealing a consistent bias where models conflate vertical image position with distance, and introduces SpatialTunnel synthetic benchmark to expose this shortcut; finds that better disentangled spatial representations improve robustness.

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

Original Article

View Cached Full Text

Cached at: 05/29/26, 11:04 PM

Paper page - Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Source: https://huggingface.co/papers/2605.30161

Abstract

Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks.

Vision-language models(VLMs) achieve strong performance onspatial reasoningbenchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce arepresentation-level analysisframework that constructs minimalcontrastive pairsto measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring theperspective biasof natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit differentinternal representations, and that these differences predict accuracy androbustnessacross diversespatial reasoningbenchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, asynthetic benchmarkdesigned to exposespatial shortcut biasesby removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greaterrobustness, suggesting that well-structured spatial representations lead to more reliablespatial reasoningacross diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2605\.30161

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30161 in a model README.md to link it from this page.

Datasets citing this paper1

#### cubec/spatialtunnel Viewer• Updatedabout 16 hours ago • 5.37k • 23 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30161 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Paper page - Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

Submit Feedback

Similar Articles

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models