SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Summary
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
View Cached Full Text
Cached at: 06/12/26, 02:52 AM
Paper page - SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Source: https://huggingface.co/papers/2606.13673
Abstract
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge forvision-language models(VLMs).Tool-augmented agentsattempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by theaction interfacethrough which those tools are invoked. In this work, we study how the design of this interface shapes the agent’s capacity for open-endedspatial reasoning. Existing spatial agents either employ single-passcode execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework forspatial reasoningthat adopts code as theaction interface. SpatialClaw maintains a statefulPython kernelpre-loaded with input frames and a suite of perception andgeometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20spatial reasoningbenchmarksspanning a broad range of static and dynamic3D/4D spatial reasoningtasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across sixVLM backbonesfrom two model families without any benchmark- or model-specific adaptation.
View arXiv pageView PDFProject pageGitHub6Add to collection
Get this paper in your agent:
hf papers read 2606\.13673
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.13673 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13673 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13673 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
RS-Claw proposes an active tool exploration paradigm for remote sensing agents using hierarchical skill trees, enabling on-demand sequential decision-making and achieving up to 86% input token compression while outperforming passive selection baselines on Earth-Bench.
Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction
This paper proposes Embodied-BenchClaw, an autonomous multi-agent system that automatically constructs embodied spatial intelligence benchmarks from user intent through a five-stage pipeline with process quality control and an extensible Skill Library.
VisualClaw: A Real-Time, Personalized Agent for the Physical World
VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.
AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models
AlloSpatial is an agentic framework that enhances spatial reasoning in foundation models by converting egocentric observations into structured allocentric representations, using cognitive mapping and tool-use reasoning. It improves performance by 5-18% on benchmarks and outperforms larger models through cold-start reinforcement learning.