FFAvatar: 소수 샘플, 피드포워드, 일반화 가능한 아바타 재구성

초록

아바타 재구성은 전통적으로 수 시간의 계산을 필요로 하는 개별 객체 최적화나 확장성을 제한하는 고가의 전처리에 의존해 왔습니다. 본 논문에서는 소수의 무포즈 초상화 이미지로부터 몇 초 만에 고품질의 애니메이션 가능한 3D 가우시안 헤드 아바타를 재구성하는 일반화 가능한 피드포워드 프레임워크인 FFAvatar를 소개합니다. FFAvatar는 멀티뷰 Query-Former를 통해 여러 소스 이미지의 정보를 통합된 정준 가우시안 표현으로 융합하며, 이는 픽셀로부터 엔드투엔드로 직접 예측된 FLAME 파라미터를 통해 애니메이션되어 오프라인 FLAME 추출의 오버헤드를 제거합니다. 또한, 광범위한 일반화와 고충실도 재구성을 모두 달성하는 3단계 훈련 커리큘럼을 제안합니다: (i) 100만 개 이상의 신원에 대한 광범위한 단안 비디오 데이터에서 강력한 일반화 가능 사전 지식을 학습하기 위한 확장 가능한 사전 훈련; (ii) 기하학적 충실도와 극단 뷰 인식을 향상시키기 위해 소규모지만 고품질의 360도 캡처 데이터셋에서의 다중 뷰 미세 조정; (iii) 최대 충실도를 위해 500개의 최적화 단계 내에서 특정 신원에 적응하는 선택적 개인화. 광범위한 실험을 통해 FFAvatar가 신원 보존, 기하학적 일관성 및 애니메이션 충실도에 대한 새로운 기준을 제시함을 입증합니다. NeRSemble 벤치마크에서 최신 기술인 LAM보다 PSNR 5.5의 상당한 향상을 보여줍니다. 또한, FFAvatar는 실시간 배포를 가능하게 하여, 개인화 없이 2초, 개인화 시 10초 만에 아바타를 재구성하며, 단일 NVIDIA A100 GPU에서 49 FPS 애니메이션을 지원합니다.

English

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.