Helix4D: Complex 4D Mesh Generation
Summary
Helix4D introduces a framework for high-quality dynamic 4D mesh generation from video by extending Trellis2 with cross-frame attention and a 4D temporal encoding that repurposes redundant spatial RoPE bands without adding parameters.
View Cached Full Text
Cached at: 05/26/26, 06:42 AM
Paper page - Helix4D: Complex 4D Mesh Generation
Source: https://huggingface.co/papers/2605.26109
Abstract
Helix4D enables high-quality dynamic mesh generation by adapting Trellis2’s frame-local attention across frames and extending 3D positional encoding with 4D temporal information.
Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, adynamic mesh generationframework by inheriting the expressive representation ofTrellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enableTrellis2’sframe-local attentionto share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3Dpositional encodingwithout breaking pretrained capabilities. We address (a) with a sliding-windowcross-frame attentionand anchor on the first frame. The first frame is generated by the baseTrellis2model and injected into our model, letting it inheritTrellis2’s quality in rare cases throughcross-frame attention. We address (b) with a4D temporal encodingthat repurposes redundant low-frequency spatialRoPE bandsfor time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-qualitydynamic mesh generationonActionBenchand our own challengingcomplex dynamics set.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.26109
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26109 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26109 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26109 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free 4D mesh generation approach using Spatio-Temporal Attention Chains accelerates creation to 9 seconds (13x speedup) while improving temporal consistency and scaling to longer sequences, with zero-shot capabilities for tracking and camera estimation.
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
4DThinker is a new framework that enables vision-language models to perform dynamic spatial reasoning using 4D latent mental imagery. The paper introduces scalable data generation and novel fine-tuning methods, including 4D Reinforcement Learning, to improve model performance on complex dynamic tasks.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 is a multi-modal world model framework that generates high-fidelity 3D Gaussian Splatting scenes from text, images, and videos through specialized modules for panorama generation, trajectory planning, and scene composition, achieving state-of-the-art performance among open-source approaches.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.