FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
Summary
FFAvatar proposes a feed-forward framework for reconstructing high-quality, animatable 3D Gaussian head avatars from few unposed images in seconds, achieving a 5.5 PSNR improvement over state-of-the-art on the NeRSemble benchmark.
View Cached Full Text
Cached at: 05/18/26, 02:23 AM
Paper page - FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
Source: https://huggingface.co/papers/2605.15320
Abstract
FFAvatar enables fast, high-quality 3D head avatar reconstruction from few unposed images using a feed-forward approach with multi-view fusion and end-to-end FLAME parameter prediction.
Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizablefeed-forward frameworkthat reconstructs high-quality, animatable3D Gaussian head avatarsfrom few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation throughMulti-View Query-Former, which is animated viaFLAME parameterspredicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose athree-stage training curriculumthat achieves both broad generalization and high-fidelity reconstruction: (i)scalable pretrainingon extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii)multi-view fine-tuningon a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii)optional personalizationthat adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On theNeRSemble benchmark, it outperforms the state-of-the-artLAMby a substantial 5.5 PSNR gain. Furthermore, FFAvatar enablesreal-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a singleNVIDIA A100 GPU.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.15320
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15320 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15320 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15320 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
MVCHead is a novel method for generating 3D Gaussian head avatars from single 2D images without multi-view data, using hierarchical state space models and multi-view consistency enforcement.
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces is a new framework for text-to-video generation that preserves facial identity consistency across varying poses and occlusions using pose-shared alignment and Euler angle embeddings.
tencentarc/gfpgan
GFPGAN is a practical face restoration model by Tencent ARC, available on Replicate. It restores old or low-quality face images with high fidelity.
FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder
FRAPPE is a novel autoencoding framework that uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. At high compression ratios, FRAPPE-Image achieves higher perceptual quality than AVIF with 47x faster encoding, making real-time 1080p 30fps CPU-only encoding possible.