命令調整された大規模マルチモーダルモデルのスケーリングに関する実証的研究

要旨

視覚的指示チューニングは最近、LLaVAやMiniGPT-4のようなオープンソースの大規模マルチモーダルモデル（LMM）を用いて、有望な進展を見せています。しかし、既存のオープンソースLMMに関する研究のほとんどは、13Bパラメータ以下のモデルを使用して行われています。本論文では、LLaVAを33Bおよび65B/70Bまでスケールアップした際の実証研究を提示し、画像解像度、データ混合、LoRA/QLoRAなどのパラメータ効率的なトレーニング手法に関する探索から得られた知見を共有します。これらは、実世界のタスクを遂行する際のマルチモーダル能力と言語能力への影響によって評価されます。 LMMのスケーリングは一貫してモデルの性能を向上させ、言語能力を高めることがわかりました。また、LMMのLoRA/QLoRAチューニングの性能は、フルモデルのファインチューニングと同等であることが示されました。さらに、この研究は、より高い画像解像度とマルチモーダル言語データの混合がLMMの性能向上に重要であることを強調し、視覚的指示チューニングが時としてLMMの純粋な言語能力を向上させることができることを示しています。本研究が、より大規模な最先端のLMM研究をよりアクセスしやすくし、将来の研究のためのより強力なベースラインを確立する一助となることを願っています。コードとチェックポイントは公開される予定です。

English

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

命令調整された大規模マルチモーダルモデルのスケーリングに関する実証的研究

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

要旨

Support