SpatialAvatar-0: マルチステージ再構成による高品質4Dヘッドアバター

要旨

高品質な4Dヘッドアバターは、遠隔プレゼンス、AR/VR、デジタルヒューマンインタラクションにおいて中心的役割を果たす。3D Gaussian Splatting（3DGS）は主要な表現手法として台頭し、汎用的なフィードフォワード予測器と被写体ごとのリファイナーという二つの相補的な手法が並行して成熟している。しかし、既存のフィードフォワード予測器は単一のデータセットファミリーで訓練され、ソース数がハードコードされているため、対応するドメインバイアスを継承している。被写体ごとのリファイナーは30万～60万回のイテレーションを必要とし、適応的デンシフィケーションに依存しているため上流のガウシアンレイアウトを破壊し、両手法がエンドツーエンドで表現を共有することを妨げている。この両手法を橋渡しするため、本稿では共有FLAMEメッシュ結合ガウシアン表現に基づくSpatialAvatar-0を提案する。これは、パラメータフリーのKソース平均プーリングを備えたフィードフォワード生成器と、単眼時間→多視点空間の二相スケジュールにより、アイデンティティ事前分布が小さな多視点セットに崩壊するのを防ぐものである。さらに、FLAME結合とガウシアン数を固定し、デンシフィケーションを三成分アンチスパイク正則化で置き換えた、レイアウト保存型の被写体ごとのリファイナーループ（10Kイテレーション）を導入する。VFHQ/HDTFのクロスドメインゼロショットにおいて、いずれのテストドメインでも訓練していないにもかかわらず、ドメイン内リーダーであるGAGAvatarをPSNRで+1.5 dB上回った。また、SplattingAvatar単眼ベンチマークでは報告されているすべての指標でリードし、30万イテレーションのGeoAvatarをPSNRで+1.3 dB上回り、一般的なSOTAベースラインと比較して被写体ごとのスケジュールを最大60倍短縮した。ウェブサイト: https://spatialwalk.github.io/SpatialAvatar-0。

English

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.