UniDDT：透過解耦擴散變換器統一多模態理解與生成

摘要

統一多模態模型（UMMs）已成為通用多模態智能的關鍵發展方向，將理解與生成整合至單一架構中。然而，現有UMMs面臨顯著的挑戰：（1）視覺理解與生成任務之間存在固有的學習衝突，導致兩項任務的建模效果次優；（2）理解與生成視覺空間的差異阻礙了可擴展性；（3）過度依賴任務特定資料，忽略了文字-影像理解與生成的雙向性。為因應這些挑戰，我們提出UniDDT，該模型利用雜訊ViT編碼器搭配LLM來統一視覺生成與理解任務的語義編碼，同時採用獨立擴散解碼器將擴散解碼與文字解碼分離。藉由此雜訊ViT編碼器，UniDDT能運用潛在空間作為統一的視覺表徵，實現理解與生成任務之間的無縫相容。如此一來，生成任務內的可擴展性與理解任務內的語義表達力即可取得平衡。此外，我們從相同的影像-文字對構建雙重資料結構，促進生成與理解資料間的相互依存關係，以善用其內在的雙向性。大量實驗證明，UniDDT能在增強語義一致性與可擴展性的前提下，有效統一多模態理解與生成。在視覺生成任務中，UniDDT達到GenEval評分0.87與DPG總分86.9；在多模態理解任務中，UniDDT於MME基準獲得1699.5分，並在SEEDbench取得76.5的總分。

English

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.