全局与局部专家混合的扩散Transformer用于可控人脸生成

摘要

可控人脸生成在生成建模中提出了关键挑战，这源于在语义可控性与照片级真实感之间所需的微妙平衡。尽管现有方法在将语义控制与生成流程解耦方面存在困难，我们通过专家专业化的视角重新审视了扩散变换器（DiTs）的架构潜力。本文介绍了Face-MoGLE，一个创新框架，其特点包括：（1）通过掩码条件空间分解实现语义解耦的潜在建模，从而支持精确的属性操控；（2）混合全局与局部专家机制，捕捉整体结构及区域级语义，以实现细粒度控制；（3）动态门控网络生成随时间扩散步骤和空间位置变化的系数。Face-MoGLE为高质量、可控的人脸生成提供了一个强大而灵活的解决方案，在生成建模与安全应用领域展现出巨大潜力。大量实验验证了其在多模态与单模态人脸生成场景下的有效性，以及其强大的零样本泛化能力。项目页面详见https://github.com/XavierJiezou/Face-MoGLE。

English

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

全局与局部专家混合的扩散Transformer用于可控人脸生成

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

摘要

Support