대규모 멀티모달 모델의 일반적인 인-컨텍스트 분류기로서의 가능성

초록

분류 작업에는 어떤 다중모달 모델을 사용해야 할까? 선행 연구들은 제로샷 분류에서 뛰어난 성능을 보이는 CLIP 유사의 대조적 시각-언어 모델(VLM)이 답이라고 제안해 왔습니다. 반면, 대규모 다중모달 모델(LMM)은 복잡한 작업에 더 적합합니다. 본 연구에서는 이러한 답변이 LMM의 중요한 능력인 컨텍스트 내 학습(in-context learning)을 간과하고 있다고 주장합니다. 우리는 폐쇄형 세계 분류를 위해 다양한 데이터셋에서 최첨단 LMM의 성능을 벤치마킹한 결과, 제로샷 성능은 CLIP보다 낮지만, 소수의 컨텍스트 예시를 제공받은 LMM은 캐시 기반 어댑터를 사용하는 대조적 VLM(이는 VLM의 "컨텍스트 내"에 해당하는 방식)의 성능을 따라잡거나 능가할 수 있음을 발견했습니다. 우리는 이 분석을 개방형 세계 설정으로 확장하며, 생성적 특성을 가진 LMM이 이 작업에 더 적합함을 보입니다. 이처럼 어려운 시나리오에서 LMM은 불완전한 컨텍스트 정보가 제공될 때마다 어려움을 겪습니다. 이 문제를 해결하기 위해 우리는 컨텍스트 내 예시에 가짜 레이블을 할당하고 사용 가능한 컨텍스트 자체를 통해 이를 반복적으로 개선하는 간단한 학습 불필요 방법인 CIRCLE를 제안합니다. 광범위한 실험을 통해 CIRCLE가 개방형 세계 분류를 위한 견고한 기준선을确立하며, VLM 대조군을 능가하고 LMM이 통합 분류기로서, 그리고 전문화된 모델에 대한 유연한 대안으로서 잠재력을 가지고 있음을 입증합니다.

English

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

대규모 멀티모달 모델의 일반적인 인-컨텍스트 분류기로서의 가능성

Large Multimodal Models as General In-Context Classifiers

초록

Support