Omni-Diffusion：基於掩碼離散擴散的統一多模態理解與生成框架

摘要

儘管近期多模態大型語言模型（MLLMs）取得了顯著進展，但其主要仍採用傳統的自迴歸架構作為骨幹，在架構設計層面仍存在探索高效替代方案的巨大空間。與此同時，最新研究已成功將離散擴散模型應用於視覺理解與圖像生成等多個領域，展現出該模型作為多模態系統骨幹的巨大潛力。受這些前沿研究的啟發，我們推出首個完全基於掩碼離散擴散模型的任意模態轉換架構——Omni-Diffusion，該模型統一了文本、語音和圖像的理解與生成任務。Omni-Diffusion採用統一的掩碼離散擴散模型直接建模離散多模態標記的聯合分佈，不僅支持雙模態任務，更能處理涉及多種模態的複雜場景。在多樣化基準測試中，本方法在處理兩種及以上模態的任務時，表現優於或持平現有多模態系統，彰顯了擴散模型驅動下一代多模態基礎模型的巨大潛力。項目頁面：https://omni-diffusion.github.io。

English

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Omni-Diffusion：基於掩碼離散擴散的統一多模態理解與生成框架

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

摘要

Support