SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Hugging Face Daily Papers 06/11/26, 12:00 AM Papers

Summary

SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:52 AM

Paper page - SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Source: https://huggingface.co/papers/2606.13673

Abstract

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge forvision-language models(VLMs).Tool-augmented agentsattempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by theaction interfacethrough which those tools are invoked. In this work, we study how the design of this interface shapes the agent’s capacity for open-endedspatial reasoning. Existing spatial agents either employ single-passcode execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework forspatial reasoningthat adopts code as theaction interface. SpatialClaw maintains a statefulPython kernelpre-loaded with input frames and a suite of perception andgeometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20spatial reasoning benchmarksspanning a broad range of static and dynamic3D/4D spatial reasoningtasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across sixVLM backbonesfrom two model families without any benchmark- or model-specific adaptation.

View arXiv page View PDF Project page GitHub6 Add to collection

Get this paper in your agent:

hf papers read 2606\.13673

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.13673 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.13673 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.13673 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Paper page - SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

VisualClaw: A Real-Time, Personalized Agent for the Physical World

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Submit Feedback

Similar Articles

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

VisualClaw: A Real-Time, Personalized Agent for the Physical World

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models