大型多模態模型作為通用情境分類器

摘要

在分類任務中應選用何種多模態模型？先前研究指出，基於CLIP的對比式視覺語言模型（VLM）因其在零樣本分類中的卓越表現而成為首選，而大型多模態模型（LMM）更適用於複雜任務。本文主張此觀點忽略了LMM的一項關鍵能力：情境學習。我們在多樣化數據集上對頂尖LMM進行閉集分類基準測試，發現儘管其零樣本性能低於CLIP，但只需少量情境示例的LMM即可媲美甚至超越配備緩存適配器的對比式VLM（後者可視為VLM的「情境學習」等效方案）。我們進一步將分析延伸至開放世界場景，其中LMM的生成式特性使其更適合該任務。在此挑戰性情境下，當提供不完善的上下文信息時，LMM會表現不佳。為解決此問題，我們提出CIRCLE——一種無需訓練的簡易方法，通過為情境示例分配偽標籤，並利用可用上下文自身進行迭代優化。大量實驗表明，CIRCLE為開放世界分類建立了強健基準，性能超越VLM同類模型，彰顯LMM作為統一分類器的潛力，以及其作為專業化模型的靈活替代方案。

English

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

大型多模態模型作為通用情境分類器

Large Multimodal Models as General In-Context Classifiers

摘要

Support