AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models
Summary
AlloSpatial is an agentic framework that enhances spatial reasoning in foundation models by converting egocentric observations into structured allocentric representations, using cognitive mapping and tool-use reasoning. It improves performance by 5-18% on benchmarks and outperforms larger models through cold-start reinforcement learning.
View Cached Full Text
Cached at: 06/15/26, 12:58 PM
Paper page - AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models
Source: https://huggingface.co/papers/2606.08952 Published on Jun 8
·
Submitted byhttps://huggingface.co/RSW233
RSWon Jun 15
Abstract
AlloSpatial framework enhances spatial reasoning in foundation models by converting egocentric observations into structured allocentric representations and enabling reliable spatial cognition through cognitive mapping and tool-use reasoning.
Multimodal Foundation Models(MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform localegocentric observationsinto a globalallocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-playcognitive mappingsandbox that convertsegocentric observationsinto structured allocentric priors, includingAllocentric-Spatial Treesand route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces aSpatial Reasoning Harnessfortool-use judgment,modality-decoupled cue collection, andgeometry-semantic arbitration. We further internalize this process in Qwen3-VL throughcold-start reinforcement learningwith a harness-gatedtrajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.
View arXiv pageView PDFProject pageGitHub9Add to collection
Get this paper in your agent:
hf papers read 2606\.08952
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.08952 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.08952 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08952 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
This paper introduces A4D, a framework that maps visual observations into a shared latent space structured around affordances (e.g., 'movable') for robot planning. It achieves 94% inference accuracy on existing affordances, outperforming state-of-the-art by 15%, and enables 100x faster inference with superior generalization to unseen object functionalities.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper argues that advancing agentic AI requires scaling the system architecture around foundation models, focusing on auditable, modular, and verifiable components. The authors introduce CheetahClaws, a reference harness, and outline bottlenecks in context governance, trustworthy memory, and dynamic skill routing.