Omni-Diffusion：マスク付き離散拡散による統合的多モーダル理解と生成

要旨

近年、マルチモーダル大規模言語モデル（MLLM）は目覚ましい進歩を遂げているが、その多くは従来の自己回帰型アーキテクチャを基盤として採用しており、効果的かつ効率的な代替アーキテクチャの設計には依然として大きな探求の余地が残されている。一方、最近の研究では離散拡散モデルが視覚理解や画像生成など様々な領域で応用され、マルチモーダルシステムの有望な基盤モデルとしての潜在能力が明らかになってきている。これらの先駆的研究に着想を得て、我々はマスクベースの離散拡散モデルを完全に採用した初のany-to-anyマルチモーダル言語モデルであるOmni-Diffusionを提案する。本モデルはテキスト、音声、画像にわたる理解と生成を統一する。Omni-Diffusionは、統一されたマスクベースの離散拡散モデルを採用し、離散化されたマルチモーダルトークンの結合分布を直接学習する。このアプローチは二モーダルタスクに加え、複数モーダルが関わるより複雑なシナリオもサポートする。多様なベンチマークにおいて、本手法は2つ以上のモーダルを処理する既存のマルチモーダルシステムを上回る、または同等の性能を示し、次世代マルチモーダル基盤モデルを支える拡散モデルの大きな可能性を浮き彫りにしている。プロジェクトWebページ: https://omni-diffusion.github.io。

English

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Omni-Diffusion：マスク付き離散拡散による統合的多モーダル理解と生成

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

要旨

Support