MM-LLMs: 멀티모달 대규모 언어 모델의 최근 동향

초록

지난 한 해 동안, 멀티모달 대형 언어 모델(MultiModal Large Language Models, MM-LLMs)은 상당한 발전을 이루었으며, 비용 효율적인 훈련 전략을 통해 기존의 대형 언어 모델(LLMs)을 개선하여 멀티모달(MM) 입력 또는 출력을 지원하도록 확장하였다. 그 결과, 이러한 모델들은 LLMs의 본질적인 추론 및 의사 결정 능력을 유지하면서도 다양한 멀티모달 작업을 수행할 수 있게 되었다. 본 논문에서는 MM-LLMs의 추가 연구를 촉진하기 위한 포괄적인 조사를 제공한다. 구체적으로, 먼저 모델 아키텍처와 훈련 파이프라인에 대한 일반적인 설계 방식을 개괄한다. 이어서, 각각의 특정한 설계 방식을 특징으로 하는 26개의 기존 MM-LLMs에 대한 간략한 소개를 제공한다. 또한, MM-LLMs의 주류 벤치마크에서의 성능을 검토하고, MM-LLMs의 효능을 강화하기 위한 주요 훈련 방법을 요약한다. 마지막으로, MM-LLMs의 유망한 방향성을 탐구하면서, 해당 분야의 최신 개발 동향을 실시간으로 추적하는 웹사이트를 유지한다. 본 조사가 MM-LLMs 분야의 지속적인 발전에 기여하기를 바란다.

English

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of 26 existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

MM-LLMs: 멀티모달 대규모 언어 모델의 최근 동향

MM-LLMs: Recent Advances in MultiModal Large Language Models

초록

Support