大型多模态模型作为通用上下文分类器

摘要

应选择何种多模态模型进行分类？先前研究指出，答案在于CLIP式的对比视觉语言模型（VLM），因其在零样本分类任务中表现卓越。相比之下，大型多模态模型（LMM）更适用于复杂任务。本文提出，这一结论忽略了LMM的一项重要能力——上下文学习。我们在多个封闭世界分类数据集上对前沿LMM进行基准测试，发现尽管其零样本性能低于CLIP，但配备少量上下文示例的LMM可匹配甚至超越带有缓存适配器的对比VLM（后者可视为VLM的“上下文学习”等效形式）。我们将该分析拓展至开放世界场景，其中LMM的生成式特性使其更适配此类任务。在这一挑战性场景下，当上下文信息不完善时，LMM表现欠佳。为解决该问题，我们提出CIRCLE——一种无需训练的简易方法，通过为上下文示例分配伪标签，并利用可用上下文自身进行迭代优化。大量实验表明，CIRCLE为开放世界分类建立了稳健基准，其性能超越VLM同类方法，彰显了LMM作为统一分类器的潜力，成为专用模型的灵活替代方案。

English

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.