視覚言語モデルの向上における手頃な戦略

要旨

近年、ビジョン言語モデルは著しい進歩を遂げ、光学文字認識や幾何学的問題解決などのタスクで優れた成績を収めています。しかしながら、いくつかの重要な課題が残っています。1) プロプライエタリなモデルはしばしばアーキテクチャに関して透明性を欠いており、一方、オープンソースのモデルはより詳細なトレーニング戦略の分析が必要です。2) オープンソースの作業における事前トレーニングデータは未開拓であり、データセットが経験的に追加されるため、プロセスが煩雑です。3) ファインチューニングはしばしばデータセットの追加に焦点を当て、収益の減少につながります。これらの課題に対処するため、以下の貢献を提案します。1) 最新のビジョン言語モデルの進歩を活用し、効果的な改善を導入し、各手法について包括的な削除と検証を行い、堅牢なベースラインモデルをトレーニングしました。2) 大規模言語モデルに関する最近の研究に触発され、パープレキシティを使用して事前トレーニングデータをフィルタリングし、トレーニング用に最も低いパープレキシティのデータを選択しました。このアプローチにより、キュレーションされた100万のデータセットでトレーニングを行い、競争力のあるパフォーマンスを達成しました。3) ビジュアルインストラクションのチューニング中、追加のデータセットが限られた改善しかもたらさない場合には、異なるデータセットでモデルスープを使用しました。これらの革新により、最先端のモデルと競争力のあるパフォーマンスを発揮する9Bパラメータモデルが生まれました。私たちの戦略は効率的で軽量であり、コミュニティによって簡単に採用されることができます。

English

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

視覚言語モデルの向上における手頃な戦略

POINTS: Improving Your Vision-language Model with Affordable Strategies

要旨

Support