基於擴散變換器的全局與局部專家混合模型實現可控人臉生成

摘要

可控人臉生成在生成建模中面臨著關鍵挑戰，這主要源於語義可控性與照片級真實感之間所需的精細平衡。現有方法在將語義控制與生成管道解耦方面存在困難，而我們則通過專家專業化的視角重新審視了擴散變換器（DiTs）的架構潛力。本文介紹了Face-MoGLE，這是一個新穎的框架，其特點包括：（1）通過掩碼條件空間分解實現語義解耦的潛在建模，從而實現精確的屬性操控；（2）結合全局與局部專家的混合模型，捕捉整體結構和區域級語義，以實現細粒度的可控性；（3）一個動態門控網絡，生成隨擴散步驟和空間位置變化的時間依賴係數。Face-MoGLE為高質量、可控的人臉生成提供了一個強大而靈活的解決方案，在生成建模和安全應用中具有巨大潛力。大量實驗證明了其在多模態和單模態人臉生成設置中的有效性，以及其強大的零樣本泛化能力。項目頁面可在https://github.com/XavierJiezou/Face-MoGLE 訪問。

English

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

基於擴散變換器的全局與局部專家混合模型實現可控人臉生成

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

摘要

Support