Mengsel van Globale en Lokale Experts met Diffusie Transformer voor Bestuurbare Gezichtsgeneratie

Samenvatting

Beheersbare gezichtsgeneratie stelt kritieke uitdagingen in generatieve modellering vanwege de ingewikkelde balans die vereist is tussen semantische beheersbaarheid en fotorealisme. Terwijl bestaande benaderingen moeite hebben met het ontwarren van semantische controles uit generatiepijplijnen, herzien wij het architectonische potentieel van Diffusion Transformers (DiTs) door de lens van expertspecialisatie. Dit artikel introduceert Face-MoGLE, een nieuw framework met: (1) Semantisch-ontkoppelde latente modellering door masker-geconditioneerde ruimtefactorisatie, wat precieze attribuutmanipulatie mogelijk maakt; (2) Een mengsel van globale en lokale experts die holistische structuur en regio-niveau semantiek vastleggen voor fijnmazige beheersbaarheid; (3) Een dynamisch gating-netwerk dat tijdafhankelijke coëfficiënten produceert die evolueren met diffusiestappen en ruimtelijke locaties. Face-MoGLE biedt een krachtige en flexibele oplossing voor hoogwaardige, beheersbare gezichtsgeneratie, met sterk potentieel in generatieve modellering en beveiligingstoepassingen. Uitgebreide experimenten tonen de effectiviteit aan in multimodale en monomodale gezichtsgeneratie-instellingen en de robuuste zero-shot generalisatiecapaciteit. De projectpagina is beschikbaar op https://github.com/XavierJiezou/Face-MoGLE.

English

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

Mengsel van Globale en Lokale Experts met Diffusie Transformer voor Bestuurbare Gezichtsgeneratie

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Samenvatting

Support