MM-LLMs: マルチモーダル大規模言語モデルの最近の進展

要旨

過去1年間において、マルチモーダル大規模言語モデル（MM-LLMs）は大幅な進化を遂げ、既存のLLMを低コストな学習戦略によってマルチモーダル入出力に対応するように拡張してきました。その結果得られたモデルは、LLMが本来持つ推論能力や意思決定能力を維持しつつ、多様なマルチモーダルタスクを実現する力を備えています。本論文では、MM-LLMsのさらなる研究を促進することを目的とした包括的なサーベイを提供します。具体的には、まずモデルアーキテクチャと学習パイプラインの一般的な設計手法を概説します。続いて、それぞれ独自の設計手法を持つ26の既存MM-LLMsを簡潔に紹介します。さらに、主要なベンチマークにおけるMM-LLMsの性能をレビューし、MM-LLMsの能力を高めるための重要な学習手法をまとめます。最後に、MM-LLMsの有望な研究方向を探るとともに、この分野の最新動向をリアルタイムで追跡するウェブサイトを維持しています。本サーベイがMM-LLMs領域の継続的な進展に寄与することを期待しています。

English

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of 26 existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

MM-LLMs: マルチモーダル大規模言語モデルの最近の進展

MM-LLMs: Recent Advances in MultiModal Large Language Models

要旨

Support