統一多模態理解與生成模型：進展、挑戰與機遇

摘要

近年來，多模態理解模型與圖像生成模型均取得了顯著進展。儘管各自領域已取得成功，但這兩個領域的發展卻相對獨立，形成了截然不同的架構範式：自迴歸架構在多模態理解中佔據主導地位，而擴散模型則成為圖像生成的基石。最近，開發統一框架以整合這些任務的興趣日益增長。GPT-4o新功能的出現正是這一趨勢的體現，凸顯了統一化的潛力。然而，兩大領域間的架構差異帶來了重大挑戰。為清晰概述當前朝向統一化的努力，我們提供了一份全面調查，旨在引導未來研究。首先，我們介紹了多模態理解與文本到圖像生成模型的基礎概念及最新進展。接著，我們回顧了現有的統一模型，將其分為三大架構範式：基於擴散的模型、基於自迴歸的模型，以及融合自迴歸與擴散機制的混合方法。針對每一類別，我們分析了相關工作引入的結構設計與創新點。此外，我們還彙編了專為統一模型設計的數據集與基準測試，為未來探索提供資源。最後，我們討論了這一新興領域面臨的關鍵挑戰，包括分詞策略、跨模態注意力機制及數據問題。鑑於該領域仍處於早期階段，我們預期將有快速進展，並將定期更新本調查。我們的目標是激發進一步研究，並為學術界提供有價值的參考。本調查的相關參考文獻已發佈於GitHub（https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models）。

English

Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).

統一多模態理解與生成模型：進展、挑戰與機遇

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

摘要

Support