4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Hugging Face Daily Papers Papers

Summary

4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
Original Article
View Cached Full Text

Cached at: 05/11/26, 02:43 AM

Paper page - 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Source: https://huggingface.co/papers/2605.05997 Published on May 7

·

Submitted byhttps://huggingface.co/jankin123

jankinon May 11

#3 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

4DThinker enables vision-language models to perform dynamic spatial reasoning through 4D latent mental imagery, using scalable data generation and novel fine-tuning methods that outperform existing approaches.

Dynamic spatial reasoningfrom monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging forvision-language models(VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to “think with 4D” through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-freedata generation pipelinethat synthesizes4D reasoningdata from raw videos. We then proposeDynamic-Imagery Fine-Tuning(DIFT), which jointly supervisestextual tokensand 4D latents to ground the model in dynamic visual semantics. Building on this,4D Reinforcement Learning(4DRL) further tackles complex reasoning tasks via outcome-based rewards, restrictingpolicy gradientsto text tokens to ensure stable optimization. Extensive experiments across multipledynamic spatial reasoningbenchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward4D reasoningin VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

View arXiv pageView PDFGitHub8Add to collection

Get this paper in your agent:

hf papers read 2605\.05997

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.05997 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.05997 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.05997 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

D4RT: Teaching AI to see the world in four dimensions

Google DeepMind Blog

DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

arXiv cs.LG

This paper introduces A4D, a framework that maps visual observations into a shared latent space structured around affordances (e.g., 'movable') for robot planning. It achieves 94% inference accuracy on existing affordances, outperforming state-of-the-art by 15%, and enables 100x faster inference with superior generalization to unseen object functionalities.

Helix4D: Complex 4D Mesh Generation

Hugging Face Daily Papers

Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.

Thinking with images

OpenAI Blog

OpenAI releases o3 and o4-mini models that can reason with images in their chain-of-thought process, enabling visual understanding through native image manipulation tools like cropping and zooming without separate specialized models. These models achieve state-of-the-art performance on multimodal benchmarks including STEM questions, chart reading, and visual search tasks.