SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
Summary
This paper presents SpaceNum, a unified framework to evaluate how vision-language models (VLMs) understand numerical values in spatial contexts, finding that current models largely fail to ground numbers spatially and often perform close to random guessing.
View Cached Full Text
Cached at: 05/25/26, 08:59 AM
# SpaceNum: Revisiting Spatial Numerical Understanding in VLMs
Source: [https://arxiv.org/html/2605.23898](https://arxiv.org/html/2605.23898)
Jianshu Zhang Northwestern Yijiang Li11footnotemark:1 UCSD Huifeixin Chen USC Haoran Lu Northwestern Letian Xue Northwestern Bingyang Wang GaTech Han Liu Northwestern
###### Abstract
Vision\-Language Models \(VLMs\) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates\. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception\. Therefore, in this work, we revisit spatial numerical understanding throughSpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning\. We formulate two bidirectional tasks,Num2SpaceandSpace2Num, to evaluate how well VLMs map between vision\-side spatial structure and language\-side numerical representations\. We systematically study whether current VLMs truly understand numerical values in spatial settings\. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess\. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate\-aware representations, and fail to abstract structured spatial layouts from visual observations\. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks\.
## 1Introduction
Vision\-language models \(VLMs\) have recently progressed from describing what is directly visible in images\[[6](https://arxiv.org/html/2605.23898#bib.bib29),[16](https://arxiv.org/html/2605.23898#bib.bib15),[22](https://arxiv.org/html/2605.23898#bib.bib28)\]to actively exploring and understanding complex spatial environments\[[24](https://arxiv.org/html/2605.23898#bib.bib10),[11](https://arxiv.org/html/2605.23898#bib.bib11),[30](https://arxiv.org/html/2605.23898#bib.bib12),[9](https://arxiv.org/html/2605.23898#bib.bib18),[19](https://arxiv.org/html/2605.23898#bib.bib27)\]\. Two representative spatial task scenarios have emerged: \(1\)spatial exploration, where a VLM\-based agent navigates an environment by generating actions conditioned on its observations to actively gather information; and \(2\)spatial understanding, where VLMs infer the global structure of a scene and answer spatially grounded questions by constructing an internal representation of the environment\. As illustrated in Figure[1](https://arxiv.org/html/2605.23898#S1.F1), despite their different objectives, both paradigms share a common requirement: VLMs must produce explicit numerical values whose meanings are grounded in spatial context\.
In spatial exploration\[[31](https://arxiv.org/html/2605.23898#bib.bib32),[27](https://arxiv.org/html/2605.23898#bib.bib33)\], a VLM\-based agent may output an action such as “rotate\_left\(20∘\)”\. The value2020does not describe the current observation, nor does it directly specify the next observation\. Instead, it specifies the magnitude of a state change, serving as a transition quantity between consecutive observations, where numbers naturally function asdynamic transition magnitudes\.
In contrast, in spatial understanding, prior work has shown that constructing explicit spatial representations\[[32](https://arxiv.org/html/2605.23898#bib.bib31),[30](https://arxiv.org/html/2605.23898#bib.bib12),[10](https://arxiv.org/html/2605.23898#bib.bib30)\], often in the form of cognitive maps, improves performance on spatial reasoning tasks\. Here, numbers encode relative spatial relationships and correspond tostatic relative spatial layouts\. A single object’s coordinates in isolation carry limited semantic meaning; spatial information becomes interpretable only when multiple objects are considered within a shared coordinate system, where numerical values define their relative positions and overall layout\.
This naturally raises a key question:do VLMs genuinely understand numbers as metric quantities in space and generate them grounded in metric properties of space?Across both spatial exploration and spatial understanding,Num2Spaceevaluates whether a language\-side numerical value can be correctly grounded in its corresponding spatial outcome, whileSpace2Numtests whether an appropriate numerical value can be inferred from a given spatial configuration\. Together, these two tasks assess numerical understanding from both directions, enabling a systematic examination of whether VLMs merely generate plausible numbers or genuinely ground them in spatial meaning\.
To systematically study spatial numerical understanding, we investigate a series of progressively deeper questions\. We first evaluate 18 VLMs across dynamic transitions and static layouts, showing that current models largely fail to ground numerical values in spatial meaning and often perform close to random guess\. We then analyze how these failures differ across scenarios and mapping directions, revealing strong asymmetries between vision\-to\-number and number\-to\-vision grounding\. To further understand the source of these failures, we conduct structured error analysis, reasoning trace analysis, and controlled interventions\. Our results show that current VLMs often rely on shallow spatial cues, fail to construct stable coordinate\-aware representations, and struggle to abstract structured spatial layouts from visual observations\. Surprisingly, enabling explicit reasoning brings only marginal improvements, suggesting that the main limitation is not the absence of reasoning traces, but the lack of spatially calibrated reasoning operations\. Finally, we show that spatial numerical understanding can be partially improved through tuning and transfers to external spatial reasoning benchmarks\.
Figure 1:Overview ofSpaceNum\. We study spatial numerical understanding under two settings:numbers as dynamic transitionin spatial exploration \(left\) andnumbers as static layoutin spatial understanding \(right\)\. We further investigate the mapping between vision\-side space and language\-side numbers via two tasks:Num2Space, which maps numbers to visual outcomes \(top\), andSpace2Num, which maps visual inputs to numbers \(bottom\)\.
## 2SpaceNum Data Curation
#### Data Source and Platform\.
We setup simulator\-based pipelines to enable controllable data generation\. For dynamic transition, data is generated in AI2\-THOR\[[13](https://arxiv.org/html/2605.23898#bib.bib1)\], which supports embodied agents executing parameterized actions across diverse indoor environments\. For static layout data, scenes are built in NVIDIA Isaac Sim\[[20](https://arxiv.org/html/2605.23898#bib.bib2)\]using assets from BlenderKit\[[2](https://arxiv.org/html/2605.23898#bib.bib3)\], allowing controlled layout generation with access to ground\-truth spatial annotations for cognitive map construction\.
### 2\.1Number as Dynamic Transition
#### Data Collection\.
We construct dataset with careful control over action coverage, transition continuity, visual anchoring, and data validity\. \(i\)Action coverage:We define a set of primitive actions that induce spatial transitions, includingMove Forward \(F\) / Backward \(B\); Left \(L\) / Right \(R\)\) and rotations \(Rotate Up \(U\) / Down \(D\); Left \(L\) / Right \(R\)\. \(ii\)Transition continuity:The action magnitudes are chosen to ensure sufficient overlap between consecutive observations, as summarized in Table[1](https://arxiv.org/html/2605.23898#S2.T1), maintainingvisual continuitywhile introducing meaningful spatial changes and avoiding abrupt or ambiguous transitions\.
Table 1:Action parameter ranges\.\(iii\)Visual anchoring:To ensure transitions are visually identifiable, we filter out observations with insufficient anchors by discarding frames containing fewer than3 object instances\. \(iv\)Data validity:To avoid invalid transitions caused by random initialization or action execution \(e\.g\., identical frames or empty observations\), we leverageoccupancy mapsto constrain both the initial agent state and the post\-action state to be valid, ensuring all collected samples correspond to informative transitions\.
#### Task Definition\.
Letoto\_\{t\}denote the initial observation,ot\+1o\_\{t\+1\}the resulting observation,aathe action type, andnnthe numerical parameter representing the transition magnitude\.
Num2Space\.The model is given\(ot,a,n\)\(o\_\{t\},a,n\)and is required to select the correct resulting observationot\+1o\_\{t\+1\}from a set of candidates\. The distractor candidates are constructed by fixing the same initial observationoto\_\{t\}and action typeaa, while varying the numerical valuenn, resulting in alternative observationso~t\+1\\tilde\{o\}\_\{t\+1\}that correspond to different transition magnitudes\.
Space2Num\.The model is given\(ot,ot\+1,a\)\(o\_\{t\},o\_\{t\+1\},a\)and is required to infer the numerical valuennthat explains the transition\. This task requires grounding visual differences betweenoto\_\{t\}andot\+1o\_\{t\+1\}to the corresponding transition magnitude\.
### 2\.2Number as Static Layout
#### Data Collection\.
We build the layout dataset with controlled generation, covering the reference system, layout construction, scene scale, and representation\.\(i\) Coordinate system construction\.Each scene uses a clear coordinate system defined by two anchor objects\. One anchor sets the origin\. The relative position of the two anchors defines a consistent direction\. This fixes the coordinate frame \(up to scale\) and removes ambiguity\. The anchors stay fixed across samples in the same scene\.\(ii\) Layout generation\.Given the coordinate system, we place a third object with different positions and sizes\. We enforce simple constraints: objects do not overlap, and distances are within a reasonable range\. Under the same reference frame, we create three types of changes: \(a\) position only, \(b\) size only, and \(c\) both position and size\. This lets us study each factor in a controlled way\.\(iii\) Scene scale\.We include both desktop\-scale and room\-scale scenes\. This changes the spatial extent and the distribution of objects, and adds diversity\.\(iv\) Representation variation\.For each layout, we build multiple coordinate\-based representations with different dimensions \(1D, 2D, and 3D\)\. These representations describe the same layout in different forms, from simple to more complete ones\. This helps us study how models handle spatial information under different representations\.
#### Task Definition\.
Letℳ\\mathcal\{M\}denote a number\-based cognitive map,oothe layout observation, andppthe numerical coordinates of a target object under a given reference frame\.
Num2Space\.The model is given a cognitive mapℳ\\mathcal\{M\}and is required to select the observationoothat is consistent with the specified layout\. Distractor candidates are constructed by varying object positions or sizes while preserving the same reference frame\.
Figure 2:Dataset statistics\.Space2Num\.The model is given an observationooand is required to infer the numerical coordinatesppof a target object under the reference coordinate system\. This task requires grounding visual spatial structure into numerical representations\.
### 2\.3Statistics
Figure[2](https://arxiv.org/html/2605.23898#S2.F2)summarizes the benchmark composition that contains 3,800 samples\. We further use the same fully automatic pipeline to generate an additional 77,412 training samples for later training\-based explorations\. The detailed breakdown of this larger training set is also shown in gray in Figure[2](https://arxiv.org/html/2605.23898#S2.F2)\.
\\cellcolordyncolor\!50Dynamic Transition\\cellcolorstatcolor\!50Static LayoutNum2SpaceSpace2NumNum2SpaceSpace2NumMoveRotateMoveRotate1D\-Map2D\-Map3D\-Map1D\-Map2D\-Map3D\-MapMethodsRankAvg\.F/BL/RU/DL/RF/BL/RU/DL/RDRDRDRDRDRDR\\rowcolorgray\!10 Random Guess30\.025\.025\.025\.025\.025\.025\.025\.025\.050\.050\.025\.025\.025\.025\.050\.050\.025\.025\.025\.025\.0\\cellcolororange\!6Qwen2\.5\-VL\-72B139\.834\.038\.034\.037\.040\.037\.044\.041\.069\.064\.528\.024\.236\.026\.860\.051\.233\.033\.831\.032\.8\\cellcolorblue\!6InternVL3\.5\-38B239\.538\.027\.030\.029\.042\.038\.047\.042\.069\.052\.831\.024\.235\.023\.253\.054\.543\.032\.540\.038\.2\\cellcolororange\!6Qwen2\.5\-VL\-32B338\.532\.030\.036\.022\.037\.033\.041\.038\.071\.067\.025\.023\.237\.025\.263\.055\.838\.028\.534\.033\.2\\cellcolorblue\!6InternVL3\.5\-14B438\.236\.032\.037\.027\.040\.035\.053\.048\.071\.066\.820\.024\.027\.025\.553\.054\.830\.027\.534\.023\.0\\cellcolorcyan\!6Qwen3\-VL\-32B535\.926\.030\.036\.025\.036\.049\.044\.032\.068\.050\.230\.020\.832\.022\.858\.057\.228\.023\.029\.020\.8\\cellcolorblue\!6InternVL3\.5\-8B634\.830\.028\.035\.029\.045\.030\.038\.028\.064\.064\.821\.022\.531\.022\.053\.052\.836\.019\.225\.020\.8\\cellcolorgreen\!6Ovis2\.5\-9B734\.722\.032\.031\.023\.036\.044\.041\.027\.070\.066\.217\.025\.021\.024\.853\.058\.524\.028\.726\.024\.5\\cellcolorblue\!6InternVL3\.5\-4B834\.526\.029\.025\.021\.035\.029\.034\.036\.070\.061\.030\.023\.230\.022\.856\.058\.230\.018\.838\.018\.0\\cellcolorcyan\!6Qwen3\-VL\-8B933\.426\.033\.030\.025\.035\.033\.043\.030\.037\.043\.826\.030\.024\.026\.057\.049\.539\.022\.035\.022\.8\\cellcolorgreen\!6Ovis2\.5\-2B1033\.226\.022\.029\.031\.027\.027\.023\.024\.071\.067\.028\.022\.027\.024\.551\.049\.528\.026\.233\.027\.8\\cellcolorteal\!6Cosmos\-Reason2\-8B1133\.124\.037\.029\.025\.031\.026\.027\.033\.057\.053\.520\.028\.020\.027\.058\.050\.734\.023\.830\.027\.3\\cellcolororange\!6Qwen2\.5\-VL\-7B1233\.037\.022\.030\.032\.029\.029\.027\.030\.071\.067\.021\.026\.025\.027\.546\.047\.529\.023\.520\.020\.5\\cellcolorcyan\!6Qwen3\-VL\-4B1332\.122\.029\.026\.026\.031\.035\.029\.032\.041\.055\.228\.023\.523\.024\.557\.056\.033\.020\.231\.019\.5\\cellcolororange\!6Qwen2\.5\-VL\-3B1431\.924\.020\.023\.029\.026\.016\.025\.020\.071\.067\.019\.024\.830\.028\.055\.041\.534\.025\.541\.017\.8\\cellcolorteal\!6Cosmos\-Reason2\-2B1531\.628\.022\.023\.025\.023\.026\.024\.026\.071\.067\.013\.027\.013\.023\.548\.055\.225\.027\.039\.027\.3\\cellcolorpurple\!6Gemma\-3\-27B1631\.227\.025\.034\.016\.024\.029\.025\.027\.050\.043\.225\.023\.822\.022\.854\.049\.032\.024\.541\.029\.0\\cellcolorpurple\!6Gemma\-3\-12B1730\.621\.026\.035\.021\.028\.029\.027\.021\.067\.055\.827\.022\.524\.022\.048\.042\.225\.019\.525\.025\.8\\cellcolorpurple\!6Gemma\-3\-4B1828\.538\.019\.025\.021\.020\.025\.024\.026\.035\.034\.024\.023\.222\.021\.256\.045\.828\.027\.830\.024\.5
Table 2:Results onSpaceNumbenchmark\. Accuracy \(%\) is reported under two major categories:Dynamic TransitionandStatic Layout\. Each category contains bothNum2SpaceandSpace2Num\. Avg\. denotes the macro\-average\.Boldandunderlinedenote best and second best, andgray valuesindicate performances that even below random guess\.
## 3Experiments
#### Experimental Setup\.
We evaluate 18 VLMs from 6 model families onSpaceNum, ranging from 2B to 72B\[[1](https://arxiv.org/html/2605.23898#bib.bib4),[25](https://arxiv.org/html/2605.23898#bib.bib5),[26](https://arxiv.org/html/2605.23898#bib.bib6),[17](https://arxiv.org/html/2605.23898#bib.bib7),[21](https://arxiv.org/html/2605.23898#bib.bib8),[8](https://arxiv.org/html/2605.23898#bib.bib9)\]\. All models are evaluated with the same prompt format, where they are instructed to directly output the option letter without explanations or intermediate reasoning\. We run inference in bfloat16 precision with Flash Attention 2 for efficient evaluation, with temperature to 0\.7, top\-p to 0\.9, top\-k to 50\. All experiments are run on 4 NVIDIA H100 \(80GB\) GPUs\.
### 3\.1Overall Results
#### Do VLMs possess spatial numerical understanding?
As shown in Table[2](https://arxiv.org/html/2605.23898#S2.T2), current VLMs struggle to genuinely understand numerical values in spatial settings\. Their performance remains close to random guess \(30\.0%\), with the best model reaching only 39\.8% on average, and several models even falling below the random baseline\. These results suggest thatcurrent models only capture shallow spatial\-number correlations instead of truly grounding numerical values in spatial meaning\.
#### What patterns emerge across different spatial scenarios?
Dynamic transitions and static layouts exhibit fundamentally different difficulty structures\. In dynamic transitions, performance remains consistently low across all action types, with strong models achieving only around 40\.0%, just 10 points above the random baseline \(30\.0%\)\. Models show little preference or specialization across actions, suggestinga broad failure to model transition dynamics\. In contrast, static layouts exhibit much clearer structural patterns: models perform relatively well in simpler settings such as 1D layouts and desk\-scale scenes, but degrade substantially in higher\-dimensional and room\-scale settings, often only marginally above the 25\.0% random baseline\. This suggests thatlayout reasoning difficulty grows systematically with spatial complexity and scene scale\.
#### How does spatial numerical mapping differ across scenarios?
The preferred mapping direction differs substantially across scenarios\. In dynamic transitions, models consistently perform better inSpace2Numthan inNum2Space, suggesting thatdynamic transitions are more vision\-dependent: models benefit from observing spatial changes directly, but struggle to predict future visual outcomes from numerical actions alone\. In contrast, static layouts show the opposite trend, whereNum2Spaceconsistently outperformsSpace2Num\. This suggests thatstatic layouts rely more on language\-side spatial priors, where models can project numerical structures into space more easily than recovering structured numerical representations from visual scenes\.
\(a\)Error proximity in dynamic transitions\.
\(b\)Error decomposition in static layouts\.
Figure 3:Structured analysis of model errors across spatial scenarios\. Left: larger models tend to make numerically closer mistakes in dynamic transitions\. Right: static layout failures are dominated by coupled position\-and\-size errors rather than isolated attribute errors\.
### 3\.2Structured Analysis of Output Patterns
#### Are larger models making better mistakes?
Beyond standard multiple\-choice accuracy,SpaceNumenables a more structured analysis of model behavior by leveraging the semantic relations among answer choices in different spatial scenarios\. For dynamic transitions, we analyze not only exact\-match accuracy but also the semantic proximity between the selected answer and the ground truth\. Specifically, we assign scores of \{100, 70, 40, 0\} to exact, near, moderate, and far errors according to the numerical distance between the predicted and correct transition magnitudes\. Figure[3\(a\)](https://arxiv.org/html/2605.23898#S3.F3.sf1)shows a clear trend: as model size increases, predictions become progressively closer to the correct answer even when exact\-match accuracy changes only slightly\.Larger models make less severe transition errors, suggesting that scaling improves coarse spatial sensitivity even when precise numerical grounding remains difficult\.
#### Do spatial errors decompose across attributes?
For static layouts, we categorize errors according to whether the predicted layout contains incorrect position, incorrect size, or both\. Surprisingly, models consistently favor joint position\-and\-size errors over single\-factor errors across model families, as shown in Figure[3\(b\)](https://arxiv.org/html/2605.23898#S3.F3.sf2)\.Static layout failures are strongly coupled across spatial attributes: once models fail to establish a coherent layout, errors tend to propagate jointly across position and scale rather than remain isolated\. This suggests that current VLMs rely more on coarse holistic matching than disentangled spatial reasoning\.
### 3\.3Does Reasoning Help Spatial Numerical Understanding?
To answer this question, we compare reasoning\-enabled \(think\) and standard \(non\-think\) inference across InternVL3\.5\-4B/8B/14B and Qwen3\-VL\-4B/8B/32B\. Surprisingly,enabling reasoning produces only marginal changes onSpaceNum, with performance differences typically remaining within 1%\.This suggests that simply generating longer reasoning traces does not substantially improve spatial numerical understanding\. We therefore further analyze model traces and identify several recurring failure patterns that explain why reasoning often fails\.
#### Models stop at coarse spatial cues instead of performing fine\-grained comparison\.
A common failure is that models identify a plausible spatial cue and terminate reasoning too early\. For example, in dynamic transition tasks, a model may observe that “a new wooden sculpture becomes visible on the left” and immediately select the corresponding candidate\. However, the correct solution requires one more step: comparing how far objects shift across candidates to determine the correct transition magnitude\. Similarly, in static layout tasks, models often correctly identify cues such as “the sofa is left of the tree,” but fail to compare object size across candidates\. In both settings, the model performs coarse cue matching but misses the finer comparison needed to disambiguate similar options\.
#### Models fail to reason counterfactually about motion magnitude\.
SuccessfulSpace2Numreasoning often depends on counterfactual magnitude comparison\. Correct traces do not only check what changed, but also whether the observed change is large enough to support a candidate magnitude\. For example, when estimating a small rotation, correct models explicitly reason that “most objects remain aligned across the two views,” and therefore “a 70∘rotation would produce much larger layout changes\.” In contrast, incorrect traces often map any noticeable visual change directly to a large number, e\.g\., “the perspective changes noticeably, suggesting a large right rotation\.” These traces focus only on changed evidence while ignoring stable evidence\.
#### Models reason in image space instead of the defined coordinate system\.
Another recurring failure is that models rely on generic image\-space priors rather than constructing the coordinate system defined by the anchor objects\. For instance, some traces directly map “left in the image” to a smallerxxvalue, reasoning that “the piano is positioned on the left side of the image, so it should have a smaller x\-coordinate\.” However, the correct solution requires first establishing the coordinate frame using the provided anchors and then reasoning relative to that frame\. Similarly, models may correctly describe an object as “behind” another object but still assign the wrong depth direction because they fail to align the scene with the task\-defined coordinate system\.
\(a\)Blind testing\.
\(b\)Per\-action mapping asymmetry\.
\(c\)Rotational symmetry analysis\.
Figure 4:Additional analyses under dynamic transitions\. Top left: blind testing by masking visual inputs\. Top right: per\-action comparison betweenNum2SpaceandSpace2Num\. Bottom: rotational symmetry analysis under equivalent transformations\.
### 3\.4Modality Asymmetry in Spatial Numerical Understanding
#### How much do models rely on visual information?
To examine whether models truly depend on visual grounding, we conduct a blind testing study by replacing images with fully black inputs while keeping the task format unchanged\. As shown in Figure[4\(a\)](https://arxiv.org/html/2605.23898#S3.F4.sf1), masking visual inputs causes a substantial performance drop fornumber as dynamic transition, while the effect is much smaller fornumber as static layout\.Dynamic transitions are significantly more vision\-dependent, whereas static layouts can often be partially solved through language\-side priors or shortcut patterns without fully grounding the visual scene\.
#### Is spatial numerical mapping balanced across actions?
We further compareNum2SpaceandSpace2Numat the level of individual actions\. Figure[4\(b\)](https://arxiv.org/html/2605.23898#S3.F4.sf2)shows that, for almost every action type,Space2Numconsistently outperformsNum2Space\.The asymmetry between the two mapping directions persists even under the same underlying action dynamics, suggesting that models are systematically better at grounding numbers from observed visual changes than predicting future visual outcomes from numerical actions\.
#### Do models learn geometrically consistent spatial mappings?
Finally, we probeSpace2Numunder rotational symmetry transformations\. Ideally, equivalent actions such as rotating left by20∘20^\{\\circ\}and rotating right by340∘340^\{\\circ\}should lead to consistent numerical predictions\. However, Figure[4\(c\)](https://arxiv.org/html/2605.23898#S3.F4.sf3)shows substantial performance drops under these symmetric transformations\.The mapping from vision to numbers lacks geometric consistency and invariance, suggesting that models fail to build stable numerical representations from visual spatial changes\.

ModelAdd Anchor\(Transition\)Reduce Objects\(Layout\)InternVL3\.5\-4B\-0\.3%\-1\.6%InternVL3\.5\-8B\-1\.3%\-0\.0%InternVL3\.5\-14B\-2\.3%\-0\.6%InternVL3\.5\-38B\+0\.9%\+0\.1%Qwen3\-VL\-4B\+0\.5%\-1\.1%Qwen3\-VL\-8B\-2\.5%\-0\.1%Qwen3\-VL\-32B\-1\.0%\-0\.3%
Figure 5:Visual\-side interventions\. Left: adding anchors for dynamic transitions and reducing objects for static layouts\. Right: both interventions lead to only minor and inconsistent performance changes\.\(a\) Numerical representation changes\.
\(b\) Visual abstraction for layouts\.
Figure 6:Representation\-side interventions\. Left: changing numerical representations in dynamic transitions and layouts\. Right: simplifying layouts into structured visual abstractions\.
### 3\.5Disentangling Factors in Spatial Numerical Understanding
#### Can simple visual interventions improve spatial grounding?
We first modify visual inputs in both scenarios\. For dynamic transitions, we add explicit visual anchors to help models measure spatial changes\. For static layouts, we reduce irrelevant objects to simplify visual grounding\. However, Figure[5](https://arxiv.org/html/2605.23898#S3.F5)shows that both interventions lead to only minor and inconsistent improvements\.The core limitation is not caused by missing visual references or cluttered scenes\.
#### Does the numerical representation itself matter?
We then vary how numerical values are expressed\. Converting numbers into natural language yields negligible gains, while integer\-scaled representations \(e\.g\., meters to centimeters\) provide only limited improvements for larger models in transition tasks\. As shown in Figure[6](https://arxiv.org/html/2605.23898#S3.F6)\(a\), performance in layout reasoning remains largely unchanged\.The bottleneck does not primarily lie in the surface form of numerical representations\.
#### Do models struggle to abstract spatial structure from images?
Since neither visual simplification nor numerical reformulation resolves the issue, we further investigate whether models fail to extract structured spatial representations from raw images\. We therefore replace layout images with progressively more structured abstractions, including points, 2D boxes, and 3D boxes\. Figure[6](https://arxiv.org/html/2605.23898#S3.F6)\(b\) shows that this substantially improvesSpace2Num, while providing less effects forNum2Space\.The main bottleneck lies in vision\-to\-structure abstraction: current VLMs struggle to transform raw visual observations into structured spatial representations suitable for numerical reasoning\.
### 3\.6Tuning Spatial Numerical Understanding
\(a\) Cross\-dimension tuning transfer\.
\(b\) Training data mixture and scaling\.
Figure 7:Tuning analysis for spatial numerical understanding\. Left: transfer patterns across different spatial dimensions\. Right: effects of data mixture ratios and training scale\.#### Can spatial reasoning transfer across dimensions?
We fine\-tune Qwen3\-VL\-4B and Qwen3\-VL\-8B with LoRA using a learning rate of1×10−41\\times 10^\{\-4\}, cosine decay with a 0\.1 warmup ratio, bfloat16 precision, a maximum sequence length of 2048, LoRA rank 8 and alpha 16, and an effective batch size of 128 for 3 epochs\. Figure[7](https://arxiv.org/html/2605.23898#S3.F7)\(a\) shows a clear diagonal pattern: tuning on a particular dimension yields the largest improvement on the same dimension, suggesting that different dimensions encode distinct spatial structures\. At the same time, tuning on 1D data also improves performance on 2D and 3D settings, especially for larger models and more clearly inNum2Space\.Lower\-dimensional spatial reasoning can partially transfer to higher\-dimensional settings, although the transfer remains limited\.
#### What data recipe leads to the best spatial reasoning ability?
We next vary the ratio between transition and layout data\. As shown in Figure[7](https://arxiv.org/html/2605.23898#S3.F7)\(b\), the best overall performance consistently emerges when transition data accounts for roughly 25% and layout data accounts for roughly 75%\. Increasing the total amount of training data further improves performance under the same ratio\.Both data composition and training scale substantially affect spatial numerical understanding, with layout\-heavy mixtures producing the strongest overall capability\.
Table 3:Performance improvement under two different reward designs\.
#### Does RL help, and does reward design matter?
We further study RL tuning on the 4B model using GRPO with LoRA rank 64 and alpha 64, a learning rate of1×10−51\\times 10^\{\-5\}, rollout batch size 128, actor batch size 64, and 5 rollouts per prompt\. We compare a strict exact\-match reward and a graded reward based on error magnitude\. As shown in Table[3](https://arxiv.org/html/2605.23898#S3.T3), RL brings only limited gains overall, while graded rewards perform slightly better than strict rewards\.
Table 4:Transfered performance\.
#### Does the learned ability generalize beyondSpaceNum?
Finally, we evaluate tuned models on external spatial reasoning benchmarks\. Table[4](https://arxiv.org/html/2605.23898#S3.T4)shows consistent improvements across all tasks\. Gains on OmniSpatial Motion\[[11](https://arxiv.org/html/2605.23898#bib.bib11)\]indicate better understanding of camera movement, while improvements on SAT Action Consequence and Object Movement\[[24](https://arxiv.org/html/2605.23898#bib.bib10)\]demonstrate stronger reasoning about action outcomes and object dynamics\. The improvements are particularly large for the 8B model\.The learned capability transfers beyond our benchmark, suggesting that tuning improves general spatial reasoning ability rather than merely overfitting in our settings\.
## 4Related Works
#### Spatial reasoning in dynamic and embodied environments\.
Recent works study whether VLMs can reason about spatial changes caused by actions, motion, and embodied interactions\. SAT evaluates dynamic spatial aptitude through action consequence prediction, object movement, perspective taking, and spatial aiming tasks\[[24](https://arxiv.org/html/2605.23898#bib.bib10)\]\. OmniSpatial provides a comprehensive benchmark for spatial reasoning over camera motion, object motion, perspective transformation, and interaction\-centered scenarios\[[11](https://arxiv.org/html/2605.23898#bib.bib11)\]\. VSI\-Bench evaluates whether MLLMs can see, remember, and recall spatial environments from sequential visual observations\[[30](https://arxiv.org/html/2605.23898#bib.bib12)\]\. MVoT improves spatial reasoning by encouraging models to imagine intermediate visual states during reasoning\[[14](https://arxiv.org/html/2605.23898#bib.bib13)\]\. SpaceTools studies tool\-augmented spatial reasoning through interactive reinforcement learning with external spatial tools\[[4](https://arxiv.org/html/2605.23898#bib.bib14)\]\. These works show that current VLMs struggle with dynamic spatial reasoning and spatial transformations\. However, they mainly focus on whether models understand spatial changes themselves, rather than whether the numerical values parameterizing these transitions are truly grounded in spatial meaning\.
#### Spatial understanding and structured spatial reasoning\.
Another line of work studies whether VLMs can infer spatial relations, metric structure, and 3D layouts from visual observations\. Early benchmarks evaluate relations such as left/right, above/below, and object\-centric configurations, showing that VLMs often struggle with spatial prepositions despite strong object recognition ability\[[16](https://arxiv.org/html/2605.23898#bib.bib15),[12](https://arxiv.org/html/2605.23898#bib.bib16),[23](https://arxiv.org/html/2605.23898#bib.bib17),[9](https://arxiv.org/html/2605.23898#bib.bib18)\]\. More recent works extend this evaluation to metric reasoning, geometric reasoning, open\-space understanding, and domain\-specific 3D reasoning\[[7](https://arxiv.org/html/2605.23898#bib.bib19),[33](https://arxiv.org/html/2605.23898#bib.bib20),[28](https://arxiv.org/html/2605.23898#bib.bib21),[29](https://arxiv.org/html/2605.23898#bib.bib22)\]\. Beyond evaluation, several works inject explicit spatial structures into VLMs through spatial annotations, region\-level grounding, coordinates, distances, layouts, and 3D priors\[[3](https://arxiv.org/html/2605.23898#bib.bib23),[5](https://arxiv.org/html/2605.23898#bib.bib24),[18](https://arxiv.org/html/2605.23898#bib.bib25),[15](https://arxiv.org/html/2605.23898#bib.bib26),[7](https://arxiv.org/html/2605.23898#bib.bib19),[10](https://arxiv.org/html/2605.23898#bib.bib30)\]\. More recently, SpatialReasoner studies explicit and generalizable 3D spatial reasoning through structured spatial representations\[[19](https://arxiv.org/html/2605.23898#bib.bib27)\]\. Together, these works improve structured spatial understanding and reasoning ability in VLMs, but they mainly treat numbers as auxiliary labels or outputs, rather than directly studying whether numerical values themselves are grounded as meaningful spatial quantities\.
In contrast to prior work,SpaceNumdirectly studies spatial numerical understanding: whether VLMs can ground numerical values as meaningful spatial quantities across both dynamic transitions and static layouts\. Beyond benchmark evaluation, we further analyze the asymmetry, failure patterns, reasoning behaviors, and tuning characteristics of spatial numerical grounding in current VLMs\.
## 5Conclusion
In this work, we study whether current Vision Language Models \(VLMs\) truly understand numerical values in spatial settings throughSpaceNum, a unified benchmark covering both dynamic transitions and static layouts\. Our experiments show that current VLMs largely fail to ground numbers in spatial meaning, often relying on shallow spatial cues instead of stable spatial reasoning\. Through systematic analyses, we further show that these failures arise from weak spatial abstraction, asymmetric vision\-number mappings, and the inability to build structured coordinate\-aware representations\. Although tuning partially improves performance and transfers to related benchmarks, substantial gaps still remain\. We hopeSpaceNumcan serve as a useful benchmark and diagnostic framework for future research on spatial numerical understanding in VLMs\.
#### Limitations and future work\.
Our study mainly focuses on controlled spatial settings with discrete candidate\-based evaluation and simulated environments\. Extending spatial numerical understanding to more open\-ended real\-world scenes, embodied interactions, and continuous spatial prediction settings remains an important direction for future work\. We also mainly analyze failures from the vision and language sides, while how VLMs internally perform spatial reasoning remains largely unexplored\. Although we conduct preliminary attention\-based analyses, severe attention collapse in current VLMs makes it difficult to obtain clear conclusions\. Understanding the internal mechanisms behind spatial numerical reasoning therefore remains an important future direction\.
## References
- \[1\]S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin\(2025\)Qwen2\.5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[2\]BlenderKit\(2023\)BlenderKit: online asset library for blender\.Note:[https://www\.blenderkit\.com/](https://www.blenderkit.com/)Cited by:[§2](https://arxiv.org/html/2605.23898#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]B\. Chen, Z\. Xu, S\. Kirmani, B\. Ichter, D\. Sadigh, L\. Guibas, and F\. Xia\(2024\)Spatialvlm: endowing vision\-language models with spatial reasoning capabilities\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14455–14465\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[4\]S\. Chen, M\. A\. Uy, C\. H\. Song, F\. Ladhak, A\. Murali, Q\. Qu, S\. Birchfield, V\. Blukis, and J\. Tremblay\(2025\)SpaceTools: tool\-augmented spatial reasoning via double interactive rl\.arXiv preprint arXiv:2512\.04069\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px1.p1.1)\.
- \[5\]A\. Cheng, H\. Yin, Y\. Fu, Q\. Guo, R\. Yang, J\. Kautz, X\. Wang, and S\. Liu\(2024\)Spatialrgpt: grounded spatial reasoning in vision\-language models\.Advances in Neural Information Processing Systems37,pp\. 135062–135093\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[6\]W\. Dai, J\. Li, D\. Li, A\. Tiong, J\. Zhao, W\. Wang, B\. Li, P\. N\. Fung, and S\. Hoi\(2023\)Instructblip: towards general\-purpose vision\-language models with instruction tuning\.Advances in neural information processing systems36,pp\. 49250–49267\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1)\.
- \[7\]E\. Daxberger, N\. Wenzel, D\. Griffiths, H\. Gang, J\. Lazarow, G\. Kohavi, K\. Kang, M\. Eichner, Y\. Yang, A\. Dehghan,et al\.\(2025\)Mm\-spatial: exploring 3d spatial understanding in multimodal llms\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 7395–7408\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[8\]G\. DeepMind\(2025\)Gemma 3\.Note:[https://deepmind\.google/models/gemma/gemma\-3/](https://deepmind.google/models/gemma/gemma-3/)Accessed: 2026\-05\-01Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[9\]M\. Du, B\. Wu, Z\. Li, X\. Huang, and Z\. Wei\(2024\)Embspatial\-bench: benchmarking spatial understanding for embodied tasks with large vision\-language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 346–355\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[10\]Y\. Huang, W\. Xu, W\. Zhang, H\. Zhi, J\. Huang, Y\. Xu, Y\. Sun, C\. Zhu, and T\. Zhao\(2025\)Video2Layout: recall and reconstruct metric\-grounded cognitive map for spatial reasoning\.arXiv preprint arXiv:2511\.16160\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p3.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[11\]M\. Jia, Z\. Qi, S\. Zhang, W\. Zhang, X\. Yu, J\. He, H\. Wang, and L\. Yi\(2025\)Omnispatial: towards comprehensive spatial reasoning benchmark for vision language models\.arXiv preprint arXiv:2506\.03135\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§3\.6](https://arxiv.org/html/2605.23898#S3.SS6.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px1.p1.1)\.
- \[12\]A\. Kamath, J\. Hessel, and K\. Chang\(2023\)What’s “up” with vision\-language models? investigating their struggle with spatial reasoning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 9161–9175\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[13\]E\. Kolve, R\. Mottaghi, W\. Han, E\. VanderBilt, L\. Weihs, A\. Herrasti, M\. Deitke, K\. Ehsani, D\. Gordon, Y\. Zhu,et al\.\(2017\)Ai2\-thor: an interactive 3d environment for visual ai\.arXiv preprint arXiv:1712\.05474\.Cited by:[§2](https://arxiv.org/html/2605.23898#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]C\. Li, W\. Wu, H\. Zhang, Y\. Xia, S\. Mao, L\. Dong, I\. Vulić, and F\. Wei\(2025\)Imagine while reasoning in space: multimodal visualization\-of\-thought\.arXiv preprint arXiv:2501\.07542\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px1.p1.1)\.
- \[15\]Y\. Liao, R\. Mahmood, S\. Fidler, and D\. Acuna\(2024\)Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision\-language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 17028–17047\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[16\]F\. Liu, G\. Emerson, and N\. Collier\(2023\)Visual spatial reasoning\.Transactions of the Association for Computational Linguistics11,pp\. 635–651\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[17\]S\. Lu, Y\. Li, Y\. Xia, Y\. Hu, S\. Zhao, Y\. Ma, Z\. Wei, Y\. Li, L\. Duan, J\. Zhao,et al\.\(2025\)Ovis2\. 5 technical report\.arXiv preprint arXiv:2508\.11737\.Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[18\]C\. Ma, K\. Lu, T\. Cheng, N\. Trigoni, and A\. Markham\(2024\)Spatialpin: enhancing spatial reasoning capabilities of vision\-language models through prompting and interacting 3d priors\.Advances in neural information processing systems37,pp\. 68803–68832\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[19\]W\. Ma, Y\. Chou, Q\. Liu, X\. Wang, C\. de Melo, J\. Xie, and A\. Yuille\(2025\)Spatialreasoner: towards explicit and generalizable 3d spatial reasoning\.arXiv preprint arXiv:2504\.20024\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[20\]NVIDIA Corporation\(2023\)NVIDIA isaac sim\.Note:[https://developer\.nvidia\.com/isaac\-sim](https://developer.nvidia.com/isaac-sim)Cited by:[§2](https://arxiv.org/html/2605.23898#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]NVIDIA\(2026\)Cosmos\-reason2: open reasoning vision\-language models for physical ai\.Note:[https://huggingface\.co/collections/nvidia/cosmos\-reason2](https://huggingface.co/collections/nvidia/cosmos-reason2)Accessed: 2026\-05\-01Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[22\]R\. Pi, J\. Zhang, J\. Zhang, R\. Pan, Z\. Chen, and T\. Zhang\(2024\)Image textualization: an automatic framework for creating accurate and detailed image descriptions\.arXiv preprint arXiv:2406\.07502\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1)\.
- \[23\]N\. Rajabi and J\. Kosecka\(2024\)Gsr\-bench: a benchmark for grounded spatial reasoning evaluation via multimodal llms\.arXiv preprint arXiv:2406\.13246\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[24\]A\. Ray, J\. Duan, E\. Brown, R\. Tan, D\. Bashkirova, R\. Hendrix, K\. Ehsani, A\. Kembhavi, B\. A\. Plummer, R\. Krishna,et al\.\(2024\)Sat: dynamic spatial aptitude training for multimodal language models\.arXiv preprint arXiv:2412\.07755\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§3\.6](https://arxiv.org/html/2605.23898#S3.SS6.SSS0.Px4.p1.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px1.p1.1)\.
- \[25\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[26\]W\. Wang, Z\. Gao, L\. Gu, H\. Pu, L\. Cui, X\. Wei, Z\. Liu, L\. Jing, S\. Ye, J\. Shao,et al\.\(2025\)Internvl3\. 5: advancing open\-source multimodal models in versatility, reasoning, and efficiency\.arXiv preprint arXiv:2508\.18265\.Cited by:[§3](https://arxiv.org/html/2605.23898#S3.SS0.SSS0.Px1.p1.1)\.
- \[27\]Z\. Wang, H\. Fang, S\. Wang, Y\. Luo, H\. Dong, W\. Li, and Y\. Gan\(2026\)Hydra\-nav: object navigation via adaptive dual\-process reasoning\.arXiv preprint arXiv:2602\.09972\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p2.2)\.
- \[28\]H\. Wu, X\. Huang, Y\. Chen, Y\. Zhang, Y\. Wang, and W\. Xie\(2025\)Spatialscore: towards unified evaluation for multimodal spatial understanding\.arXiv e\-prints,pp\. arXiv–2505\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[29\]Z\. Xu, Y\. Zhang, S\. Adhikari, S\. Islam, T\. Xiao, Z\. Liu, S\. Chen, D\. Yan, and Z\. Jiang\(2026\)EarthSpatialBench: benchmarking spatial reasoning capabilities of multimodal llms on earth imagery\.arXiv preprint arXiv:2602\.15918\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.
- \[30\]J\. Yang, S\. Yang, A\. W\. Gupta, R\. Han, L\. Fei\-Fei, and S\. Xie\(2025\)Thinking in space: how multimodal large language models see, remember, and recall spaces\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 10632–10643\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p1.1),[§1](https://arxiv.org/html/2605.23898#S1.p3.1),[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px1.p1.1)\.
- \[31\]Y\. Yang, J\. Liu, Z\. Zhang, S\. Zhou, R\. Tan, J\. Yang, Y\. Du, and C\. Gan\(2025\)MindJourney: test\-time scaling with world models for spatial reasoning\.arXiv preprint arXiv:2507\.12508\.Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p2.2)\.
- \[32\]B\. Yin, Q\. Wang, P\. Zhang, J\. Zhang, K\. Wang, Z\. Wang, J\. Zhang, K\. Chandrasegaran, H\. Liu, R\. Krishna,et al\.\(2025\)Spatial mental modeling from limited views\.InStructural Priors for Vision Workshop at ICCV’25,Cited by:[§1](https://arxiv.org/html/2605.23898#S1.p3.1)\.
- \[33\]W\. Zhang, Z\. Zhou, X\. Zeng, X\. Liu, J\. Fang, C\. Gao, Y\. Li, J\. Cui, X\. Chen, and X\. Zhang\(2025\)Open3D\-vqa: a benchmark for comprehensive spatial reasoning with multimodal large language model in open space\.arXiv preprint arXiv:2503\.11094\.Cited by:[§4](https://arxiv.org/html/2605.23898#S4.SS0.SSS0.Px2.p1.1)\.Similar Articles
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
The paper introduces SpatialUncertain, a benchmark to evaluate whether vision-language models recognize when they cannot answer spatial questions due to occlusion or perspective ambiguity, revealing overconfidence and poor abstention behavior.
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.
A chessboard is a surprisingly good way to catch what VLMs still get wrong
An informal experiment using a chessboard reveals that vision language models often fail at spatial reasoning and precise structured output, despite correctly recognizing pieces, highlighting a key gap in VLM evaluation.
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.