MM-LLMs:多模態大型語言模型的最新進展
MM-LLMs: Recent Advances in MultiModal Large Language Models
January 24, 2024
作者: Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu
cs.AI
摘要
在過去的一年中,多模式大型語言模型(MM-LLMs)取得了顯著進展,通過成本效益的訓練策略,擴展了現成的LLMs以支持多模式輸入或輸出。由此產生的模型不僅保留了LLMs固有的推理和決策能力,還賦予了多樣的多模式任務。在本文中,我們提供了一份全面的調查,旨在促進對MM-LLMs的進一步研究。具體而言,我們首先概述了模型架構和訓練流程的一般設計公式。隨後,我們簡要介紹了26種現有的MM-LLMs,每種都以其特定的公式特徵。此外,我們回顧了MM-LLMs在主流基準測試中的表現,並總結了關鍵的訓練配方,以增強MM-LLMs的效力。最後,我們探討了MM-LLMs的前景方向,同時維護一個實時追蹤網站,以追蹤該領域的最新發展。我們希望這份調查對於MM-LLMs領域的持續進步有所貢獻。
English
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or
outputs via cost-effective training strategies. The resulting models not only
preserve the inherent reasoning and decision-making capabilities of LLMs but
also empower a diverse range of MM tasks. In this paper, we provide a
comprehensive survey aimed at facilitating further research of MM-LLMs.
Specifically, we first outline general design formulations for model
architecture and training pipeline. Subsequently, we provide brief
introductions of 26 existing MM-LLMs, each characterized by its specific
formulations. Additionally, we review the performance of MM-LLMs on mainstream
benchmarks and summarize key training recipes to enhance the potency of
MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently
maintaining a real-time tracking website for the latest developments in the
field. We hope that this survey contributes to the ongoing advancement of the
MM-LLMs domain.