FFAvatar: 少数ショット、フィードフォワード、および汎化可能なアバター再構築

要旨

アバター再構築は伝統的に、1被験者あたり数時間の計算を要する最適化や、スケーラビリティを制限する高コストな前処理に依存してきた。本稿では、数ショットの非ポーズ付きポートレート画像から、高品質でアニメーション可能な3Dガウシアンヘッドアバターを数秒で再構築する、汎用的なフィードフォワードフレームワーク「FFAvatar」を提案する。FFAvatarは、マルチビュークエリフォーマー（Multi-View Query-Former）を通じて、複数のソース画像からの情報を統合された正準ガウシアン表現に融合する。この表現は、ピクセルからエンドツーエンドで直接予測されるFLAMEパラメータによってアニメーション化され、オフラインでのFLAME抽出のオーバーヘッドを排除する。さらに、広範な汎化と高忠実度再構築の両方を達成する3段階のトレーニングカリキュラムを提案する。(i) 100万以上のアイデンティティを含む大規模な単眼動画データでのスケーラブルな事前学習により、強力な汎用的事前知識を獲得。(ii) 少数だが高品質な360度キャプチャデータセットでのマルチビューファインチューニングにより、幾何学的忠実度と極端な視点への対応力を向上。(iii) オプションとして、最大忠実度を実現するために、特定のアイデンティティに500最適化ステップ以内で適応するパーソナライゼーション。広範な実験により、FFAvatarがアイデンティティ保存、幾何学的整合性、アニメーション忠実度において新たな基準を確立することを実証する。NeRSembleベンチマークでは、最先端手法LAMをPSNRで5.5上回る顕著な改善を達成した。さらに、FFAvatarはリアルタイムデプロイを可能にし、パーソナライゼーションなしで2秒、ありで10秒でアバターを再構築し、単一のNVIDIA A100 GPU上で49 FPSのアニメーションをサポートする。

English

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.