MMFace-DiT:面向高保真多模态人脸生成的双流扩散Transformer模型
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
March 30, 2026
作者: Bharath Krishnamurthy, Ajita Rattani
cs.AI
摘要
近期多模态人脸生成模型通过结合分割掩码、草图或边缘图等空间先验信息,增强基于文本的条件控制,从而解决了文本到图像扩散模型在空间控制方面的局限性。这种多模态融合技术能够实现既符合高层语义意图又契合底层结构布局的可控生成。然而,现有方案大多通过附加辅助控制模块或拼接独立单模态网络的方式扩展预训练的文本到图像流程。这些临时性设计存在架构局限、参数冗余等问题,在模态冲突或隐空间不匹配时容易失效,限制了跨语义与空间域的协同融合能力。我们提出MMFace-DiT——一种专为协同多模态人脸生成设计的统一双流扩散Transformer模型。其核心创新在于双流Transformer模块可并行处理空间(掩码/草图)与语义(文本)标记,并通过共享旋转位置编码注意力机制实现深度融合。该设计能防止模态主导,确保文本与结构先验的强约束力,从而在可控人脸生成中实现前所未有的空间-语义一致性。此外,新颖的模态嵌入器使单一融合模型能动态适配不同空间条件而无需重新训练。实验表明,MMFace-DiT在视觉保真度与提示对齐度上较六种领先的多模态人脸生成模型提升40%,为端到端可控生成建模建立了灵活的新范式。代码与数据集详见项目页面:https://vcbsl.github.io/MMFace-DiT/
English
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/