GPT4スタイルの言語モデルをマルチモーダル入力でトレーニングする際に重要な要素は何か？

要旨

GPT4のような大規模言語モデル（LLM）の最近の進展は、画像を伴うオープンエンドな指示に従う際に卓越したマルチモーダル能力を示しています。しかし、これらのモデルの性能は、ネットワーク構造、トレーニングデータ、トレーニング戦略などの設計選択に大きく依存しており、これらの選択は文献で十分に議論されていないため、この分野の進歩を定量化することが困難です。この問題に対処するため、本論文では、そのようなモデルのトレーニングについて、定量的かつ定性的に体系的かつ包括的な研究を提示します。我々は、制御された設定で20以上のバリエーションを実装しました。具体的には、ネットワーク構造については、異なるLLMバックボーンとモデル設計を比較します。トレーニングデータについては、データとサンプリング戦略の影響を調査します。指示については、多様化されたプロンプトがトレーニングされたモデルの指示追従能力に及ぼす影響を探ります。ベンチマークについては、我々の知る限り、画像とビデオタスクを含む初の包括的な評価セットをクラウドソーシングを通じて提供します。我々の調査結果に基づき、既存のオープンソースのGPT4スタイルモデルと比較して、最も正確なマルチモーダル理解を実現しつつ、最高のマルチモーダル生成能力を保持するLynxを紹介します。

English

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

GPT4スタイルの言語モデルをマルチモーダル入力でトレーニングする際に重要な要素は何か？

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

要旨

Support