LOVM: 언어 전용 비전 모델 선택

초록

사전 학습된 다중 모달 비전-언어 모델(VLMs)은 특히 소수 샷 및 제로 샷 설정에서 다운스트림 비전 애플리케이션에서의 뛰어난 성능으로 인해 점점 더 인기를 얻고 있습니다. 그러나 일부 다운스트림 애플리케이션에 대해 최고 성능을 보이는 VLM을 선택하는 것은 데이터셋과 작업에 따라 달라지기 때문에 간단하지 않습니다. 한편, 새로운 애플리케이션에서 사용 가능한 모든 VLM을 철저히 평가하는 것은 시간과 계산 자원이 많이 소모될 뿐만 아니라 평가를 위한 레이블이 지정된 데이터셋을 수집해야 한다는 문제가 있습니다. 오픈소스 VLM 변형의 수가 증가함에 따라, 정제된 평가 데이터셋에 대한 접근 없이도 효율적으로 모델을 선택할 수 있는 전략이 필요합니다. 본 논문은 다운스트림 작업 데이터셋에 접근하지 않고도 VLMs의 제로 샷 성능을 효율적으로 평가하기 위한 새로운 작업과 벤치마크를 제안합니다. 구체적으로, 우리는 LOVM(Language-Only Vision Model Selection)이라는 새로운 작업을 소개합니다. 이 작업에서는 원하는 다운스트림 애플리케이션에 대한 텍스트 설명만을 기반으로 모델 선택과 성능 예측을 수행해야 합니다. 그런 다음, 35개의 사전 학습된 VLMs와 23개의 데이터셋에 대한 실제 평가로 구성된 광범위한 LOVM 벤치마크를 도입했습니다. 이 벤치마크에서는 사전 학습된 VLMs를 순위 매기고 그들의 제로 샷 성능을 예측하는 방법이 요구됩니다.

English

Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.