FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Hugging Face Daily Papers Papers

Summary

FFAvatar proposes a feed-forward framework for reconstructing high-quality, animatable 3D Gaussian head avatars from few unposed images in seconds, achieving a 5.5 PSNR improvement over state-of-the-art on the NeRSemble benchmark.

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.
Original Article
View Cached Full Text

Cached at: 05/18/26, 02:23 AM

Paper page - FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Source: https://huggingface.co/papers/2605.15320

Abstract

FFAvatar enables fast, high-quality 3D head avatar reconstruction from few unposed images using a feed-forward approach with multi-view fusion and end-to-end FLAME parameter prediction.

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizablefeed-forward frameworkthat reconstructs high-quality, animatable3D Gaussian head avatarsfrom few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation throughMulti-View Query-Former, which is animated viaFLAME parameterspredicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose athree-stage training curriculumthat achieves both broad generalization and high-fidelity reconstruction: (i)scalable pretrainingon extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii)multi-view fine-tuningon a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii)optional personalizationthat adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On theNeRSemble benchmark, it outperforms the state-of-the-artLAMby a substantial 5.5 PSNR gain. Furthermore, FFAvatar enablesreal-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a singleNVIDIA A100 GPU.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.15320

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15320 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15320 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15320 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Hugging Face Daily Papers

Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.

tencentarc/gfpgan

Replicate Explore

GFPGAN is a practical face restoration model by Tencent ARC, available on Replicate. It restores old or low-quality face images with high fidelity.

FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

Hugging Face Daily Papers

FRAPPE is a novel autoencoding framework that uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. At high compression ratios, FRAPPE-Image achieves higher perceptual quality than AVIF with 47x faster encoding, making real-time 1080p 30fps CPU-only encoding possible.