FFAvatar：少样本、前馈式、可泛化的虚拟形象重建

摘要

传统头像重建方法通常依赖逐主体优化，需要耗费数小时的计算时间，或者依赖代价高昂的预处理流程，限制了可扩展性。我们提出FFAvatar——一种可泛化的前馈框架，能够在数秒内从少量非摆拍肖像图像中重建高质量、可驱动的3D高斯头部头像。FFAvatar通过多视角查询变换器（Multi-View Query-Former）将多张源图像的信息融合为统一的规范高斯表示，并通过直接从像素端到端预测的FLAME参数进行驱动，消除了离线FLAME提取的额外开销。我们进一步提出了三阶段训练课程，同时实现了广泛的泛化能力和高保真重建：（i）在包含超过100万个身份的大规模单目视频数据上进行可扩展预训练，学习强大的可泛化先验；（ii）在小规模但高质量的360度采集数据集上进行多视角微调，增强几何保真度和极端视角感知能力；（iii）可选个性化适配，在500步优化内实现对特定身份的最大保真度。大量实验表明，FFAvatar在身份保持、几何一致性及动画保真度方面树立了新标准。在NeRSemble基准上，其PSNR比当前最先进的LAM方法显著高出5.5 dB。此外，FFAvatar支持实时部署：无需个性化时在2秒内重建头像，包含个性化时仅需10秒，并在单个NVIDIA A100 GPU上支持49 FPS的动画渲染。

English

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.