Multi-view Consistente 3D Gaussiaanse Hoofdavatars 'zonder' Multi-view Generatie

Samenvatting

Het genereren van high-fidelity 3D-Gaussian-kopavatars is cruciaal voor toepassingen zoals AR/VR, telepresence en digitale mensen. Bestaande methoden zijn afhankelijk van multi-view datasets, 3D-opnames of tussentijdse 2D-beeldsynthese. Daarentegen leren wij zowel conditionele als onconditionele 3D-hoofdmodellen uitsluitend op basis van willekeurig gesamplede 2D-beelden, zonder gebruik te maken van multi-view data, 3D-supervisie of tussentijdse beeldgeneratie. We introduceren MVCHead, een single-shot toestandsruimtemodel dat multi-view consistentie (MVC) direct in de 3D-representatie afdwingt, terwijl het onder deze beperkingen 3D-Gaussianen regresseert. De kern vormt een Hiërarchisch Toestandsruimte (HiSS)-blok dat Gaussianen stapsgewijs verfijnt van grof naar fijn, terwijl het afhankelijkheden over lange afstand vastlegt. Binnen elk HiSS-blok vervangen we de standaard unidirectionele scan van Mamba door de voorgestelde Hiërarchische Bidirectionele Toestandsscan (HiBiSS), die de recursie afstemt op de assen waarlangs multi-view inconsistenties het sterkst zijn. Ten slotte ontwerpen we een SE(3) Multi-view Criticus die beoordeelt of een reeks zelf-renders afkomstig is van één enkele onderliggende 3D-configuratie, en die cross-view pixeluitlijning beloont zonder daadwerkelijke multi-view paren te observeren. MVCHead bereikt state-of-the-art perceptuele kwaliteit, overtreft eerdere methoden in zowel textuur- als geometrische consistentie, en behoudt vergelijkbare vormconsistentie. Om schaalbaarheid aan te tonen, brengen we FaceGS-10K uit, de eerste grootschalige dataset van kant-en-klare 3D-Gaussian-kopassets voor training en evaluatie van 3D-hoofdmodellen. Projectpagina en code: https://humansensinglab.github.io/MVCHead/

English

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/