AutoCLIP：為視覺語言模型自動調整零樣本分類器

摘要

建立在視覺語言模型（如CLIP）基礎上的分類器已展示出在各種圖像分類任務中顯著的零樣本性能。先前的研究探討了不同的自動創建每個類別的描述符集的方法，這些方法基於提示模板，從手動設計的模板到從大型語言模型獲取的模板，再到由隨機單詞和字符構建的模板。相比之下，從相應的編碼類描述符中推導出零樣本分類器幾乎沒有改變，即：將圖像分類到最大化其平均編碼類描述符與編碼圖像之間的余弦相似度的類別。然而，當某些描述符與給定圖像上的視覺線索更匹配時，將所有類描述符等權重可能並不是最優的。在這項工作中，我們提出了AutoCLIP，一種用於自動調整零樣本分類器的方法。AutoCLIP為每個提示模板分配了根據推斷時類描述符-圖像相似性統計得出的權重。AutoCLIP是完全無監督的，開銷非常低，並且可以輕鬆實現，只需幾行代碼。我們展示了對於各種視覺語言模型、數據集和提示模板，AutoCLIP始終且最多可提高3個百分點的準確性，優於基準方法。

English

Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. In contrast, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, that is: classify to the class that maximizes the cosine similarity between its averaged encoded class descriptors and the encoded image. However, weighting all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP assigns to each prompt template per-image weights, which are derived from statistics of class descriptor-image similarities at inference time. AutoCLIP is fully unsupervised, has very low overhead, and can be easily implemented in few lines of code. We show that for a broad range of vision-language models, datasets, and prompt templates, AutoCLIP outperforms baselines consistently and by up to 3 percent point accuracy.

AutoCLIP：為視覺語言模型自動調整零樣本分類器

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

摘要

Support