SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Hugging Face Daily Papers Papers

Summary

SpatialAvatar-0 introduces a multi-stage reconstruction method for high-quality 4D head avatars using a shared FLAME-mesh-bound Gaussian representation, achieving superior performance across benchmarks with reduced iterations.

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.
Original Article
View Cached Full Text

Cached at: 06/22/26, 09:30 AM

Paper page - SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Source: https://huggingface.co/papers/2606.15659

Abstract

SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks.

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction.3D Gaussian Splatting(3DGS) has emerged as the dominant representation, with two complementary regimes (generalizablefeed-forward predictors andper-subject refiners) maturing in parallel. However, existingfeed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias.Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a sharedFLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-sourcemean-pooland a monocular-temporal to multi-view-spatial two-phase schedule that anchors againstidentity-prior collapseonto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-componentanti-spike regularization. On VFHQ/HDTFcross-domain zero-shotwe surpass the in-domain leader GAGAvatar by +1.5 dBPSNRdespite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dBPSNRat up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.15659

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15659 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15659 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15659 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Avatar V: Scaling Video-Reference Avatar Video Generation

Hugging Face Daily Papers

Avatar V is a production-scale framework for generating behaviorally recognizable avatar videos conditioned on full video references, introducing sparse reference attention and motion representation streams to achieve state-of-the-art identity preservation and lip synchronization.

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

Hugging Face Daily Papers

A training-free 4D mesh generation approach using Spatio-Temporal Attention Chains accelerates creation to 9 seconds (13x speedup) while improving temporal consistency and scaling to longer sequences, with zero-shot capabilities for tracking and camera estimation.