LOVM: 言語のみによる視覚モデル選択

要旨

事前学習済みのマルチモーダル視覚言語モデル（VLM）は、特に少数ショットやゼロショット設定における下流視覚タスクでの優れた性能により、ますます注目を集めています。しかし、特定の下流アプリケーションにおいて最適なVLMを選択することは容易ではなく、データセットやタスクに依存します。一方、新しいアプリケーションに対して利用可能なすべてのVLMを網羅的に評価することは、時間と計算リソースを要するだけでなく、評価用のラベル付きデータセットの収集も必要とします。オープンソースのVLMバリアントが増える中、評価用データセットへのアクセスを必要としない効率的なモデル選択戦略が求められています。本論文では、下流タスクのデータセットにアクセスすることなく、VLMのゼロショット性能を効率的に評価するための新しいタスクとベンチマークを提案します。具体的には、LOVM（Language-Only Vision Model Selection）という新しいタスクを導入し、所望の下流アプリケーションのテキスト記述のみに基づいてモデル選択と性能予測を行う手法を期待します。さらに、35の事前学習済みVLMと23のデータセットに対する真の評価からなる大規模なLOVMベンチマークを構築し、手法が事前学習済みVLMをランク付けし、そのゼロショット性能を予測することを期待します。

English

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.