UniDDT: 基于解耦扩散Transformer统一多模态理解与生成

摘要

统一多模态模型（UMMs）已成为通用多模态智能的关键方向，将理解与生成整合到单一框架中。然而，现有UMMs面临显著挑战：（1）视觉理解与生成任务之间固有的学习冲突，导致两个任务建模效果均不理想；（2）理解与生成任务采用不同视觉空间，阻碍了可扩展性；（3）过度依赖任务特定数据，忽略了文本-图像理解与生成的二元性。针对这些挑战，我们提出UniDDT模型，该模型利用噪声ViT编码器与大型语言模型（LLM）统一视觉生成与理解任务的语义编码，同时采用独立的扩散解码器将扩散解码与文本解码解耦。借助噪声ViT编码器，UniDDT能够将潜在空间作为统一视觉表征，实现理解与生成任务的无缝兼容，从而在生成任务的可扩展性与理解任务的语义表达能力之间取得平衡。此外，我们从同一图像-文本对构建双数据结构，促进生成数据与理解数据之间的相互依赖，以利用其内在二元性。大量实验表明，UniDDT能够在增强语义一致性与可扩展性的前提下，有效统一多模态理解与生成任务。在视觉生成任务中，我们的UniDDT在GenEval指标上达到0.87分，在DPG综合指标上达到86.9分；在多模态理解任务中，在MME基准上取得1699.5分，在SEEDbench综合指标上取得76.5分。

English

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.