MMFace-DiT: 고화질 멀티모달 얼굴 생성을 위한 듀얼-스트림 디퓨전 트랜스포머

초록

최근의 멀티모달 얼굴 생성 모델은 분할 마스크, 스케치, 에지 맵과 같은 공간적 사전 정보를 텍스트 기반 조건화에 추가하여 텍스트-이미지 확산 모델의 공간 제어 한계를 해결하고 있습니다. 이러한 멀티모달 융합은 높은 수준의 의미론적 의도와 낮은 수준의 구조적 레이아웃 모두에 부합하는 제어 가능한 합성을 가능하게 합니다. 그러나 대부분의 기존 접근법은 일반적으로 사전 학습된 텍스트-이미지 파이프라인을 보조 제어 모듈을 추가하거나 별도의 단일 모달 네트워크를 결합하는 방식으로 확장합니다. 이러한 특수 설계 방식은 구조적 제약을 물려받고 매개변수를 중복시키며, 모달리티 간 충돌이나 불일치하는 잠재 공간에서 종종 실패하여 의미론적 및 공간적 영역 간의 시너지적 융합 성능을 제한합니다. 본 논문에서는 시너지적 멀티모달 얼굴 합성을 위해 설계된 통합 이중 스트림 확산 트랜스포머인 MMFace-DiT를 소개합니다. 그 핵심 혁신은 공간적(마스크/스케치) 토큰과 의미론적(텍스트) 토큰을 병렬 처리하고 공유 RoPE(Rotary Position-Embedded) 어텐션 메커니즘을 통해 이를 깊이 융합하는 이중 스트림 트랜스포머 블록에 있습니다. 이 설계는 특정 모달리티의 지배적 영향을 방지하고 텍스트와 구조적 사전 정보 모두에 대한 강력한 일치도를 보장하여 제어 가능한 얼굴 생성을 위한 전례 없는 공간-의미론적 일관성을 달성합니다. 더 나아가, 새로운 모달리티 임베더를 통해 단일 통합 모델이 재학습 없이 다양한 공간 조건에 동적으로 적응할 수 있습니다. MMFace-DiT는 6개의 최첨단 멀티모달 얼굴 생성 모델 대비 시각적 충실도와 프롬프트 일치도에서 40%의 성능 향상을 달성하며, 종단 간 제어 가능 생성 모델링을 위한 유연한 새로운 패러다임을 정립합니다. 코드와 데이터셋은 프로젝트 페이지(https://vcbsl.github.io/MMFace-DiT/)에서 이용 가능합니다.

English

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/

MMFace-DiT: 고화질 멀티모달 얼굴 생성을 위한 듀얼-스트림 디퓨전 트랜스포머

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

초록

Support