大規模マルチモーダルモデルによる汎用インコンテキスト分類器 (Note: The translation maintains the technical terms "Large Multimodal Models" as "大規模マルチモーダルモデル" and "In-Context" as "インコンテキスト" which are standard in Japanese AI literature, while making the title natural and readable in Japanese academic style.)

要旨

どのマルチモーダルモデルを分類タスクに使用すべきか？これまでの研究は、ゼロショット分類における優れた性能から、CLIPのような対比的なVision-Languageモデル（VLM）が答えであると示唆してきた。一方、大規模マルチモーダルモデル（LMM）は複雑なタスクにより適している。本研究では、この答えがLMMの重要な能力——文脈内学習——を見落としていると論じる。我々は最先端のLMMを閉じた世界の分類において多様なデータセットで評価し、そのゼロショット性能はCLIPより低いものの、少数の文脈内事例を与えられたLMMが、キャッシュベースのアダプターを備えた対比的なVLM（その「文脈内」相当）に匹敵し、場合によっては凌駕することを明らかにした。この分析を開かれた世界の設定に拡張すると、LMMの生成的性質が本タスクにより適していることがわかる。しかしこの困難なシナリオでは、不完全な文脈情報が与えられるとLMMは苦戦する。この問題を解決するため、我々はCIRCLEを提案する。これは文脈内事例に擬似ラベルを割り当て、利用可能な文脈自体でそれらを反復的に洗練する、単純な訓練不要の手法である。大規模な実験を通じて、CIRCLEが開かれた世界の分類における頑健なベースラインを確立し、VLMを上回り、LMMが専門モデルに代わる統一的な分類器としての可能性と柔軟性を示すことを明らかにした。

English

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

Large Multimodal Models as General In-Context Classifiers

要旨

Support