Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

Hugging Face Daily Papers Papers

Summary

A training-free 4D mesh generation approach using Spatio-Temporal Attention Chains accelerates creation to 9 seconds (13x speedup) while improving temporal consistency and scaling to longer sequences, with zero-shot capabilities for tracking and camera estimation.

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.
Original Article
View Cached Full Text

Cached at: 05/20/26, 06:39 PM

Paper page - Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

Source: https://huggingface.co/papers/2605.19786 Published on May 19

·

Submitted byhttps://huggingface.co/Dvir

Samuelon May 20

Abstract

A training-free 4D mesh generation approach uses spatio-temporal attention chains to accelerate mesh creation while improving temporal correspondence quality and enabling scalable long-sequence processing.

4D mesh generationhas recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates4D mesh generationwhile improving temporal correspondence quality. Our key observation is thattemporal correspondencesemerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we callSpatio-Temporal Attention Chainwhich propagates information across space and time. Starting from vertices on ananchor mesh, the chain maps vertices tolatent tokens. It then followstemporal correspondencesin latent space, and recovers frame-specific vertices throughlatent-to-vertex attention. This design avoids expensive explicit matching while preservinganchor meshdetails and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks:2D object trackingand4D tracking. We further show that our framework enables reliablecamera estimation, a capability not supported by prior4D mesh generationmethods.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.19786

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.19786 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.19786 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.19786 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Helix4D: Complex 4D Mesh Generation

Hugging Face Daily Papers

Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.

D4RT: Teaching AI to see the world in four dimensions

Google DeepMind Blog

DeepMind introduces D4RT, a unified AI model for dynamic 4D scene reconstruction and tracking that is up to 300x more efficient than previous methods. The model uses a query-based Transformer architecture to solve complex spatial and temporal tasks for robotics and AR applications.

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Hugging Face Daily Papers

4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.

NeuROK: Generative 4D Neural Object Kinematics

Hugging Face Daily Papers

This paper introduces NeuROK, a data-driven approach for generative 4D neural object kinematics that learns a latent space and transformer-based encoder-decoder to simulate realistic temporal deformations of static objects under various physical conditions, overcoming limitations of predefined physical models.