Tag
Investigates spatial representation in vision-language models, revealing a consistent bias where models conflate vertical image position with distance, and introduces SpatialTunnel synthetic benchmark to expose this shortcut; finds that better disentangled spatial representations improve robustness.