全局与局部专家混合的扩散Transformer用于可控人脸生成
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
August 30, 2025
作者: Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li, Yushe Cao, Congyan Lang, Pin Tao, Junliang Xing
cs.AI
摘要
可控人脸生成在生成建模中提出了关键挑战,这源于在语义可控性与照片级真实感之间所需的微妙平衡。尽管现有方法在将语义控制与生成流程解耦方面存在困难,我们通过专家专业化的视角重新审视了扩散变换器(DiTs)的架构潜力。本文介绍了Face-MoGLE,一个创新框架,其特点包括:(1)通过掩码条件空间分解实现语义解耦的潜在建模,从而支持精确的属性操控;(2)混合全局与局部专家机制,捕捉整体结构及区域级语义,以实现细粒度控制;(3)动态门控网络生成随时间扩散步骤和空间位置变化的系数。Face-MoGLE为高质量、可控的人脸生成提供了一个强大而灵活的解决方案,在生成建模与安全应用领域展现出巨大潜力。大量实验验证了其在多模态与单模态人脸生成场景下的有效性,以及其强大的零样本泛化能力。项目页面详见https://github.com/XavierJiezou/Face-MoGLE。
English
Controllable face generation poses critical challenges in generative modeling
due to the intricate balance required between semantic controllability and
photorealism. While existing approaches struggle with disentangling semantic
controls from generation pipelines, we revisit the architectural potential of
Diffusion Transformers (DiTs) through the lens of expert specialization. This
paper introduces Face-MoGLE, a novel framework featuring: (1)
Semantic-decoupled latent modeling through mask-conditioned space
factorization, enabling precise attribute manipulation; (2) A mixture of global
and local experts that captures holistic structure and region-level semantics
for fine-grained controllability; (3) A dynamic gating network producing
time-dependent coefficients that evolve with diffusion steps and spatial
locations. Face-MoGLE provides a powerful and flexible solution for
high-quality, controllable face generation, with strong potential in generative
modeling and security applications. Extensive experiments demonstrate its
effectiveness in multimodal and monomodal face generation settings and its
robust zero-shot generalization capability. Project page is available at
https://github.com/XavierJiezou/Face-MoGLE.