統一多模態理解與生成模型:進展、挑戰與機遇
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
May 5, 2025
作者: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
cs.AI
摘要
近年來,多模態理解模型與圖像生成模型均取得了顯著進展。儘管各自領域已取得成功,但這兩個領域的發展卻相對獨立,形成了截然不同的架構範式:自迴歸架構在多模態理解中佔據主導地位,而擴散模型則成為圖像生成的基石。最近,開發統一框架以整合這些任務的興趣日益增長。GPT-4o新功能的出現正是這一趨勢的體現,凸顯了統一化的潛力。然而,兩大領域間的架構差異帶來了重大挑戰。為清晰概述當前朝向統一化的努力,我們提供了一份全面調查,旨在引導未來研究。首先,我們介紹了多模態理解與文本到圖像生成模型的基礎概念及最新進展。接著,我們回顧了現有的統一模型,將其分為三大架構範式:基於擴散的模型、基於自迴歸的模型,以及融合自迴歸與擴散機制的混合方法。針對每一類別,我們分析了相關工作引入的結構設計與創新點。此外,我們還彙編了專為統一模型設計的數據集與基準測試,為未來探索提供資源。最後,我們討論了這一新興領域面臨的關鍵挑戰,包括分詞策略、跨模態注意力機制及數據問題。鑑於該領域仍處於早期階段,我們預期將有快速進展,並將定期更新本調查。我們的目標是激發進一步研究,並為學術界提供有價值的參考。本調查的相關參考文獻已發佈於GitHub(https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models)。
English
Recent years have seen remarkable progress in both multimodal understanding
models and image generation models. Despite their respective successes, these
two domains have evolved independently, leading to distinct architectural
paradigms: While autoregressive-based architectures have dominated multimodal
understanding, diffusion-based models have become the cornerstone of image
generation. Recently, there has been growing interest in developing unified
frameworks that integrate these tasks. The emergence of GPT-4o's new
capabilities exemplifies this trend, highlighting the potential for
unification. However, the architectural differences between the two domains
pose significant challenges. To provide a clear overview of current efforts
toward unification, we present a comprehensive survey aimed at guiding future
research. First, we introduce the foundational concepts and recent advancements
in multimodal understanding and text-to-image generation models. Next, we
review existing unified models, categorizing them into three main architectural
paradigms: diffusion-based, autoregressive-based, and hybrid approaches that
fuse autoregressive and diffusion mechanisms. For each category, we analyze the
structural designs and innovations introduced by related works. Additionally,
we compile datasets and benchmarks tailored for unified models, offering
resources for future exploration. Finally, we discuss the key challenges facing
this nascent field, including tokenization strategy, cross-modal attention, and
data. As this area is still in its early stages, we anticipate rapid
advancements and will regularly update this survey. Our goal is to inspire
further research and provide a valuable reference for the community. The
references associated with this survey are available on GitHub
(https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).Summary
AI-Generated Summary