MMFace-DiT：面向高保真多模态人脸生成的双流扩散Transformer模型

摘要

近期多模态人脸生成模型通过将基于文本的条件输入与分割掩码、草图或边缘图等空间先验信息相结合，解决了文本到图像扩散模型在空间控制方面的局限性。这种多模态融合技术实现了既符合高层语义意图又匹配底层结构布局的可控生成。然而，现有方案大多通过附加辅助控制模块或拼接独立单模态网络来扩展预训练的文本到图像流程。这些临时性设计存在架构约束、参数冗余等问题，在模态冲突或隐空间失配时往往失效，限制了跨语义与空间域的协同融合能力。我们提出MMFace-DiT——一个专为协同多模态人脸生成设计的统一双流扩散Transformer模型。其核心创新在于采用双流Transformer模块并行处理空间（掩码/草图）与语义（文本）标记，通过共享旋转位置编码注意力机制实现深度融合。该设计有效防止模态主导，确保模型同时严格遵循文本与结构先验，在可控人脸生成中实现前所未有的空间-语义一致性。此外，新颖的模态嵌入器使单一紧凑模型能动态适配不同空间条件而无需重新训练。实验表明，MMFace-DiT在视觉保真度与提示对齐度上较六种先进多模态人脸生成模型提升40%，为端到端可控生成建模建立了灵活的新范式。代码与数据集详见项目页面：https://vcbsl.github.io/MMFace-DiT/

English

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/