SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction
Summary
SpatialAvatar-0 introduces a multi-stage reconstruction method for high-quality 4D head avatars using a shared FLAME-mesh-bound Gaussian representation, achieving superior performance across benchmarks with reduced iterations.
View Cached Full Text
Cached at: 06/22/26, 09:30 AM
Paper page - SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction
Source: https://huggingface.co/papers/2606.15659
Abstract
SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks.
High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction.3D Gaussian Splatting(3DGS) has emerged as the dominant representation, with two complementary regimes (generalizablefeed-forward predictors andper-subject refiners) maturing in parallel. However, existingfeed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias.Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a sharedFLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-sourcemean-pooland a monocular-temporal to multi-view-spatial two-phase schedule that anchors againstidentity-prior collapseonto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-componentanti-spike regularization. On VFHQ/HDTFcross-domain zero-shotwe surpass the in-domain leader GAGAvatar by +1.5 dBPSNRdespite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dBPSNRat up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.15659
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15659 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15659 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15659 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
FFAvatar proposes a feed-forward framework for reconstructing high-quality, animatable 3D Gaussian head avatars from few unposed images in seconds, achieving a 5.5 PSNR improvement over state-of-the-art on the NeRSemble benchmark.
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
MVCHead is a novel method for generating 3D Gaussian head avatars from single 2D images without multi-view data, using hierarchical state space models and multi-view consistency enforcement.
Avatar V: Scaling Video-Reference Avatar Video Generation
Avatar V is a production-scale framework for generating behaviorally recognizable avatar videos conditioned on full video references, introducing sparse reference attention and motion representation streams to achieve state-of-the-art identity preservation and lip synchronization.
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free 4D mesh generation approach using Spatio-Temporal Attention Chains accelerates creation to 9 seconds (13x speedup) while improving temporal consistency and scaling to longer sequences, with zero-shot capabilities for tracking and camera estimation.
Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild
Lift4D is a test-time optimization framework that reconstructs complete 4D geometry, appearance, and deformation of dynamic objects from a single monocular in-the-wild video, improving over prior methods on challenging sequences with occlusions and non-rigid motion.