FFAvatar：少樣本、前饋且可泛化的虛擬化身重建

摘要

傳統頭像重建通常依賴於逐主體優化，需要數小時的計算，或依賴於昂貴的預處理，限制了可擴展性。我們提出 FFAvatar，一個可泛化的前饋式框架，能在數秒內從少量未經姿態處理的肖像圖像中重建高品質、可動畫化的三維高斯頭像。FFAvatar 通過多視圖查詢變形器將多張源圖像的信息融合到統一的規範高斯表示中，並透過端到端直接從像素預測的 FLAME 參數進行動畫化，消除了離線 FLAME 提取的開銷。我們進一步提出三階段訓練課程，實現了廣泛的泛化能力和高保真重建：(i) 在包含超過 100 萬個身分的龐大單目視頻數據上進行可擴展預訓練，以學習強大的可泛化先驗；(ii) 在少量但高品質的 360 度捕捉數據集上進行多視圖微調，以增強幾何保真度與極端視角感知能力；(iii) 可選的個性化步驟，在 500 步優化內適應特定身分以獲得最大保真度。大量實驗表明，FFAvatar 在身分保持、幾何一致性與動畫保真度方面樹立了新標準。在 NeRSemble 基準測試中，其 PSNR 指標比當前最佳方法 LAM 顯著提升了 5.5 dB。此外，FFAvatar 支援即時部署，無需個性化時可在 2 秒內重建頭像，加上個性化也僅需 10 秒，並在單張 NVIDIA A100 GPU 上實現 49 FPS 的動畫渲染。

English

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.