4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Summary
4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.
View Cached Full Text
Cached at: 05/11/26, 02:43 AM
Paper page - 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Source: https://huggingface.co/papers/2605.05997 Published on May 7
·
Submitted byhttps://huggingface.co/jankin123
jankinon May 11
#3 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches.
Dynamic spatial reasoningfrom monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging forvision-language models(VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to “think with 4D” through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-freedata generation pipelinethat synthesizes4D reasoningdata from raw videos. We then proposeDynamic-Imagery Fine-Tuning(DIFT), which jointly supervisestextual tokensand 4D latents to ground the model in dynamic visual semantics. Building on this,4D Reinforcement Learning(4DRL) further tackles complex reasoning tasks via outcome-based rewards, restrictingpolicy gradientsto text tokens to ensure stable optimization. Extensive experiments across multipledynamic spatial reasoningbenchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward4D reasoningin VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
View arXiv pageView PDFGitHub8Add to collection
Get this paper in your agent:
hf papers read 2605\.05997
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.05997 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.05997 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.05997 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
D4RT: Teaching AI to see the world in four dimensions
DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning
This paper introduces A4D, a framework that maps visual observations into a shared latent space structured around affordances (e.g., 'movable') for robot planning. It achieves 94% inference accuracy on existing affordances, outperforming state-of-the-art by 15%, and enables 100x faster inference with superior generalization to unseen object functionalities.
Helix4D: Complex 4D Mesh Generation
Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.
Thinking with images
OpenAI releases o3 and o4-mini models that can reason with images in their chain-of-thought process, enabling visual understanding through native image manipulation tools like cropping and zooming without separate specialized models. These models achieve state-of-the-art performance on multimodal benchmarks including STEM questions, chart reading, and visual search tasks.