Grootschalige multimodale modellen als algemene in-contextclassificatoren

Samenvatting

Welk multimodaal model moeten we gebruiken voor classificatie? Eerdere studies suggereren dat het antwoord ligt bij CLIP-achtige contrastieve Vision-Language Models (VLM's), vanwege hun opmerkelijke prestaties in zero-shot classificatie. Daarentegen zijn Large Multimodal Models (LMM's) geschikter voor complexe taken. In dit werk beargumenteren wij dat dit antwoord een belangrijke capaciteit van LMM's over het hoofd ziet: in-context leren. We benchmarken state-of-the-art LMM's op diverse datasets voor closed-world classificatie en ontdekken dat, hoewel hun zero-shot prestaties lager zijn dan die van CLIP, LMM's met een paar in-context voorbeelden de prestaties kunnen evenaren of zelfs overtreffen van contrastieve VLM's met cache-gebaseerde adapters, hun "in-context" equivalent. We breiden deze analyse uit naar de open-world setting, waar de generatieve aard van LMM's hen geschikter maakt voor de taak. In dit uitdagende scenario hebben LMM's moeite wanneer ze worden voorzien van imperfecte contextinformatie. Om dit probleem aan te pakken, stellen we CIRCLE voor, een eenvoudige traininingsvrije methode die pseudo-labels toekent aan in-context voorbeelden en deze iteratief verfijnt met de beschikbare context zelf. Door middel van uitgebreide experimenten tonen we aan dat CIRCLE een robuuste baseline vestigt voor open-world classificatie, waarbij VLM-tegenhangers worden overtroffen en het potentieel van LMM's wordt benadrukt om te dienen als uniforme classificatoren en een flexibel alternatief voor gespecialiseerde modellen.

English

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

Grootschalige multimodale modellen als algemene in-contextclassificatoren

Large Multimodal Models as General In-Context Classifiers

Samenvatting

Support