視覚的指示チューニングによるベースラインの改善

要旨

大規模マルチモーダルモデル（LMM）は最近、視覚的指示チューニングにおいて有望な進展を示しています。本稿では、LLaVAにおける完全接続型の視覚-言語クロスモーダルコネクタが驚くほど強力でデータ効率が高いことを示します。LLaVAに簡単な修正を加えることで、具体的には、MLP投影を伴うCLIP-ViT-L-336pxを使用し、学術タスク指向のVQAデータを単純な応答フォーマットプロンプトと共に追加することで、11のベンチマークにおいて最先端の性能を達成するより強力なベースラインを確立しました。最終的な13Bチェックポイントは、わずか120万の公開データを使用し、単一の8-A100ノードで約1日で完全なトレーニングを完了します。これにより、最先端のLMM研究がよりアクセスしやすくなることを期待しています。コードとモデルは公開される予定です。

English

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

視覚的指示チューニングによるベースラインの改善

Improved Baselines with Visual Instruction Tuning

要旨

Support