LOVM：仅语言视觉模型选择

摘要

预训练的多模态视觉语言模型（VLMs）因在下游视觉应用中表现出色，尤其是在少量和零样本设置下，而变得日益受欢迎。然而，为某些下游应用选择表现最佳的VLM并不容易，因为这取决于数据集和任务。与此同时，在新应用上对所有可用的VLM进行详尽评估不仅耗时且计算量大，还需要收集一个带标签的数据集进行评估。随着开源VLM变体数量的增加，需要一种高效的模型选择策略，而无需访问精心筛选的评估数据集。本文提出了一种新颖的任务和基准，用于在没有访问下游任务数据集的情况下有效评估VLMs在下游应用中的零样本性能。具体而言，我们引入了一个新任务LOVM：仅语言视觉模型选择，其中方法应仅基于所需下游应用的文本描述来执行模型选择和性能预测。然后，我们介绍了一个包含对35个预训练VLMs和23个数据集进行地面真实评估的广泛LOVM基准，其中方法应对预训练VLMs进行排名并预测它们的零样本性能。

English

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.