다중 뷰 일관된 3D 가우시안 헤드 아바타 '없이' 다중 뷰 생성

초록

고충실도 3D 가우시안 헤드 아바타 생성은 AR/VR, 원격현장감, 디지털 휴먼과 같은 응용 분야에 필수적이다. 기존 방법들은 다중 시점 데이터셋, 3D 캡처, 또는 중간 2D 뷰 합성에 의존한다. 이와 대조적으로, 우리는 다중 시점 데이터, 3D 감독, 또는 중간 뷰 생성 없이 무작위로 샘플링된 2D 이미지만으로 조건부 및 무조건부 3D 헤드 모델을 모두 학습한다. 우리는 MVCHead를 제안하는데, 이는 단일 샷 상태 공간 모델로서 3D 표현에서 직접 다중 시점 일관성(MVC)을 강제하면서 이러한 제약 조건 하에 3D 가우시안을 회귀한다. 핵심적으로, 우리는 계층적 상태 공간(HiSS) 블록을 제안하여 가우시안을 대략적에서 정밀하게 점진적으로 개선하면서 장거리 의존성을 포착한다. 각 HiSS 블록 내에서, 우리는 Mamba의 표준 단방향 스캔을 제안된 계층적 양방향 상태 스캔(HiBiSS)으로 수정하여, 재귀를 다중 시점 불일치가 가장 강한 축과 정렬한다. 마지막으로, 우리는 SE(3) 다중 시점 비평가를 설계하여 자체 렌더링 집합이 단일 하부 3D 구성에서 비롯되었는지 판단하고, 실제 다중 시점 쌍을 관찰하지 않고도 교차 뷰 픽셀 정렬에 보상을 제공한다. MVCHead는 최첨단 지각 품질을 달성하며, 질감 및 기하학적 일관성 모두에서 이전 방법을 능가하고, 형태 일관성도 유사하게 유지한다. 확장성을 입증하기 위해, 우리는 3D 헤드 모델의 훈련 및 평가를 위한 사용 준비된 3D 가우시안 헤드 에셋으로 구성된 최초의 대규모 데이터셋인 FaceGS-10K를 공개한다. 프로젝트 페이지 및 코드: https://humansensinglab.github.io/MVCHead/

English

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/