4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

Original Article

View Cached Full Text

Cached at: 05/11/26, 02:43 AM

Paper page - 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Source: https://huggingface.co/papers/2605.05997 Published on May 7

Submitted byhttps://huggingface.co/jankin123

jankinon May 11

#3 Paper of the day Authors:

Abstract

4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches.

Dynamic spatial reasoningfrom monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging forvision-language models(VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to “think with 4D” through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-freedata generation pipelinethat synthesizes4D reasoningdata from raw videos. We then proposeDynamic-Imagery Fine-Tuning(DIFT), which jointly supervisestextual tokensand 4D latents to ground the model in dynamic visual semantics. Building on this,4D Reinforcement Learning(4DRL) further tackles complex reasoning tasks via outcome-based rewards, restrictingpolicy gradientsto text tokens to ensure stable optimization. Extensive experiments across multipledynamic spatial reasoningbenchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward4D reasoningin VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

View arXiv page View PDF GitHub8 Add to collection

Get this paper in your agent:

hf papers read 2605\.05997

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.05997 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.05997 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.05997 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Paper page - 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

D4RT: Teaching AI to see the world in four dimensions

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Helix4D: Complex 4D Mesh Generation

Thinking with images

Submit Feedback

Similar Articles

D4RT: Teaching AI to see the world in four dimensions

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Helix4D: Complex 4D Mesh Generation