VLMは画像データを用いて事前学習すべきか？

要旨

事前学習済みの大規模言語モデル（LLM）に画像データを追加で学習させると、視覚言語タスクで良好な性能を発揮します。第二段階の学習で画像を追加することでこの能力が効果的に引き出される一方で、この二段階パイプラインが、より早期に画像を学習プロセスに統合する視覚言語モデル（VLM）と比較してどの程度の利得または損失をもたらすかは明らかではありません。これを調査するため、我々は様々なデータセット、スケール、画像テキスト比率、視覚トークンの導入前に行われる事前学習の量にわたるモデルを学習させました。その後、これらのモデルをファインチューニングし、一連の視覚言語タスクおよびテキストのみのタスクにおける下流性能を評価しました。その結果、画像とテキストデータの混合による事前学習を行うことで、モデルは視覚言語タスクでより良い性能を発揮しつつ、テキストのみの評価でも強力な性能を維持できることがわかりました。6つの多様なタスクの平均において、10億パラメータのモデルでは、事前学習の80%の時点で視覚トークンを導入することで、完全に事前学習されたモデルに視覚トークンを導入する場合と比較して平均2%の改善が見られました。

English

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

VLMは画像データを用いて事前学習すべきか？

Should VLMs be Pre-trained with Image Data?

要旨

Support