多視点生成を「必要としない」多視点一貫性3Dガウシアンヘッドアバター

要旨

高忠実度な3Dガウシアンヘッドアバター生成は、AR/VR、テレプレゼンス、デジタルヒューマンといったアプリケーションにおいて極めて重要である。既存手法は多視点データセット、3Dキャプチャ、または中間的な2D視点合成に依存している。対照的に、我々は多視点データや3D教師信号、中間視点生成を用いることなく、ランダムにサンプリングされた2D画像のみから条件付きおよび無条件の3Dヘッドモデルを学習する。本稿では、MVCHeadを提案する。これは、3D表現において直接的に多視点一貫性（MVC）を強制し、その制約下で3Dガウシアンを回帰する単一画像ベースの状態空間モデルである。核心部として、階層的状態空間（HiSS）ブロックを導入する。これはガウシアンを粗から細へと段階的に洗練しつつ、長距離依存関係を捉える。各HiSSブロック内では、Mambaの標準的な一方向スキャンを、多視点間の不整合が最も顕著となる軸に沿って再帰性を整列させる提案の階層的双方向状態スキャン（HiBiSS）で置き換える。さらに、SE(3)多視点批評器を設計する。これは、一連の自己レンダリング結果が単一の3D構成から生じたものであるかを判定し、実際の多視点ペアを観測することなく、クロスビューの画素位置合わせに報酬を与える。MVCHeadは最先端の知覚品質を達成し、テクスチャと幾何の一貫性において先行手法を上回り、形状の一貫性においても同等の性能を維持する。スケーラビリティを示すため、3Dヘッドモデルの学習と評価に利用可能な、初の大規模データセットであるFaceGS-10Kを公開する。プロジェクトページとコード: https://humansensinglab.github.io/MVCHead/

English

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/