FFAvatar:少样本、前馈式、可泛化的虚拟形象重建
FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction
May 14, 2026
作者: Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian, Jian Wang
cs.AI
摘要
传统头像重建方法通常依赖逐主体优化,需要耗费数小时的计算时间,或者依赖代价高昂的预处理流程,限制了可扩展性。我们提出FFAvatar——一种可泛化的前馈框架,能够在数秒内从少量非摆拍肖像图像中重建高质量、可驱动的3D高斯头部头像。FFAvatar通过多视角查询变换器(Multi-View Query-Former)将多张源图像的信息融合为统一的规范高斯表示,并通过直接从像素端到端预测的FLAME参数进行驱动,消除了离线FLAME提取的额外开销。我们进一步提出了三阶段训练课程,同时实现了广泛的泛化能力和高保真重建:(i)在包含超过100万个身份的大规模单目视频数据上进行可扩展预训练,学习强大的可泛化先验;(ii)在小规模但高质量的360度采集数据集上进行多视角微调,增强几何保真度和极端视角感知能力;(iii)可选个性化适配,在500步优化内实现对特定身份的最大保真度。大量实验表明,FFAvatar在身份保持、几何一致性及动画保真度方面树立了新标准。在NeRSemble基准上,其PSNR比当前最先进的LAM方法显著高出5.5 dB。此外,FFAvatar支持实时部署:无需个性化时在2秒内重建头像,包含个性化时仅需10秒,并在单个NVIDIA A100 GPU上支持49 FPS的动画渲染。
English
Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.