UniDDT: 分離型拡散トランスフォーマーによるマルチモーダル理解と生成の統合

要旨

統一型マルチモーダルモデル（UMMs）は、理解と生成を単一のフレームワークに統合する汎用マルチモーダル知能の重要な方向性として浮上している。しかし、既存のUMMsは顕著な課題に直面している：（1）視覚的理解タスクと生成タスクの間の本質的な学習競合により、両タスクにおいて最適ではないモデリングを引き起こすこと、（2）理解と生成の視覚空間が異なることでスケーラビリティが阻害されること、（3）テキスト-画像の理解と生成の二重性を無視し、タスク固有のデータに過度に依存すること。これらの課題に対処するため、我々はUniDDTを提案する。これはノイズ付きViTエンコーダとLLMを活用して、視覚生成・理解タスクのための意味的エンコーディングを統合し、同時に別個の拡散デコーダを用いて拡散デコーディングをテキストデコーディングから分離するものである。このノイズ付きViTエンコーダにより、UniDDTは潜在空間を統一された視覚表現として利用することが可能となり、理解タスクと生成タスクの間のシームレスな互換性を実現する。これにより、生成タスク内のスケーラビリティと理解タスク内の意味表現力のバランスを取ることができる。また、同一の画像-テキストペアから二重データ構造を構築し、生成データと理解データの間の相互依存性を促進することで、それらの内在する二重性を活用する。大規模な実験により、UniDDTは強化された意味的一貫性とスケーラビリティを備えたマルチモーダル理解と生成の効果的な統合を達成することが示された。視覚生成タスクにおいて、我々のUniDDTはGenEvalスコア0.87、DPG総合スコア86.9を達成した。マルチモーダル理解タスクにおいては、MMEベンチマークで1699.5点、SEEDbenchで総合スコア76.5を達成している。

English

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.