MM-LLMs：多模態大型語言模型的最新進展

摘要

在過去的一年中，多模式大型語言模型（MM-LLMs）取得了顯著進展，通過成本效益的訓練策略，擴展了現成的LLMs以支持多模式輸入或輸出。由此產生的模型不僅保留了LLMs固有的推理和決策能力，還賦予了多樣的多模式任務。在本文中，我們提供了一份全面的調查，旨在促進對MM-LLMs的進一步研究。具體而言，我們首先概述了模型架構和訓練流程的一般設計公式。隨後，我們簡要介紹了26種現有的MM-LLMs，每種都以其特定的公式特徵。此外，我們回顧了MM-LLMs在主流基準測試中的表現，並總結了關鍵的訓練配方，以增強MM-LLMs的效力。最後，我們探討了MM-LLMs的前景方向，同時維護一個實時追蹤網站，以追蹤該領域的最新發展。我們希望這份調查對於MM-LLMs領域的持續進步有所貢獻。

English

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of 26 existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

MM-LLMs：多模態大型語言模型的最新進展

MM-LLMs: Recent Advances in MultiModal Large Language Models

摘要

Support