LOVM：僅語言視覺模型選擇

摘要

預先訓練的多模式視覺語言模型（VLMs）因在下游視覺應用中表現卓越，尤其在少量和零樣本設置下，而日益受到歡迎。然而，為某些下游應用選擇表現最佳的 VLM 並不簡單，因為這取決於數據集和任務。與此同時，在新應用上對所有可用的 VLM 進行詳盡評估不僅耗時且需要大量計算，還需要收集一個帶標籤的數據集進行評估。隨著開源 VLM 變體數量的增加，需要一種高效的模型選擇策略，而無需訪問經過精心編輯的評估數據集。本文提出了一種新的任務和基準，用於在無法訪問下游任務數據集的情況下有效評估 VLMs 在下游應用中的零樣本表現。具體來說，我們引入了一個新任務 LOVM：僅語言視覺模型選擇，期望方法僅基於對所需下游應用的文本描述執行模型選擇和性能預測。然後，我們引入了一個包含 35 個預先訓練的 VLMs 和 23 個數據集的廣泛 LOVM 基準，期望方法對預先訓練的 VLMs 進行排名並預測它們的零樣本表現。

English

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.