Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
Summary
A training-free 4D mesh generation approach using Spatio-Temporal Attention Chains accelerates creation to 9 seconds (13x speedup) while improving temporal consistency and scaling to longer sequences, with zero-shot capabilities for tracking and camera estimation.
View Cached Full Text
Cached at: 05/20/26, 06:39 PM
Paper page - Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
Source: https://huggingface.co/papers/2605.19786 Published on May 19
·
Submitted byhttps://huggingface.co/Dvir
Samuelon May 20
Abstract
A training-free 4D mesh generation approach uses spatio-temporal attention chains to accelerate mesh creation while improving temporal correspondence quality and enabling scalable long-sequence processing.
4D mesh generationhas recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates4D mesh generationwhile improving temporal correspondence quality. Our key observation is thattemporal correspondencesemerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we callSpatio-Temporal Attention Chainwhich propagates information across space and time. Starting from vertices on ananchor mesh, the chain maps vertices tolatent tokens. It then followstemporal correspondencesin latent space, and recovers frame-specific vertices throughlatent-to-vertex attention. This design avoids expensive explicit matching while preservinganchor meshdetails and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks:2D object trackingand4D tracking. We further show that our framework enables reliablecamera estimation, a capability not supported by prior4D mesh generationmethods.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.19786
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.19786 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.19786 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.19786 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Helix4D: Complex 4D Mesh Generation
Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.
D4RT: Teaching AI to see the world in four dimensions
DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
NeuROK: Generative 4D Neural Object Kinematics
This paper introduces NeuROK, a data-driven approach for generative 4D neural object kinematics that learns a latent space and transformer-based encoder-decoder to simulate realistic temporal deformations of static objects under various physical conditions, overcoming limitations of predefined physical models.