SpatialAvatar-0: 다단계 재구성을 통한 고품질 4D 헤드 아바타

초록

고품질 4D 헤드 아바타는 원격현장감, AR/VR, 그리고 디지털 휴먼 상호작용의 핵심 요소이다. 3D 가우시안 스플래팅(3DGS)이 지배적인 표현 방식으로 부상하였으며, 일반화 가능한 피드포워드 예측기와 개인별 정제기의 두 가지 상호보완적 접근법이 병렬적으로 성숙하고 있다. 그러나 기존 피드포워드 예측기는 단일 데이터셋 계열에서 훈련되고 소스 개수가 고정되어 있어, 해당 도메인의 편향을 상속받는다. 개인별 정제기는 30만~60만 회 반복이 필요하며, 적응형 밀집화를 사용하여 상위 가우시안 배치를 파괴함으로써 두 접근법이 종단 간 표현을 공유하지 못하게 한다. 두 접근법을 연결하기 위해, 우리는 FLAME 메시에 결합된 공유 가우시안 표현 기반의 SpatialAvatar-0을 제안한다: 매개변수 없는 K-소스 평균 풀링을 갖춘 피드포워드 생성기와, 단안 시계열에서 다중 시점 공간으로의 2단계 스케줄을 통해 정체성 사전이 더 작은 다중 시점 세트로 붕괴되는 것을 방지한다. 또한, FLAME 결합과 가우시안 개수를 고정하고 밀집화를 세 가지 구성요소로 이루어진 스파이크 방지 정규화로 대체하는, 10K 반복의 레이아웃 보존 개인별 정제 루프를 도입한다. VFHQ/HDTF 교차 도메인 제로샷에서, 우리는 어느 테스트 도메인에서도 훈련하지 않았음에도 불구하고 인도메인 선두주자인 GAGAvatar를 PSNR +1.5 dB로 능가하며, SplattingAvatar 단안 벤치마크에서는 모든 보고된 지표에서 선두를 차지하여 30만 반복의 GeoAvatar를 PSNR +1.3 dB로 능가하고, 일반적인 최첨단 기준선 대비 최대 60배 짧은 개인별 스케줄을 달성한다. 웹사이트: https://spatialwalk.github.io/SpatialAvatar-0.

English

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.