ABC: Betere controle bereiken over multimodale embeddings met behulp van VLMs

Samenvatting

Visuele inbeddingsmodellen blinken uit in zero-shot taken zoals visuele retrievel en classificatie. Deze modellen kunnen echter niet worden gebruikt voor taken die ambiguïteit bevatten of gebruikersinstructies vereisen. Deze taken vereisen een multimodaal inbeddingsmodel, dat inbeddingen uitvoert die visuele en natuurlijke taalinput combineren. Bestaande CLIP-gebaseerde benaderingen embedden afbeeldingen en tekst onafhankelijk van elkaar en fuseren het resultaat. Wij constateren dat dit resulteert in zwakke interacties tussen modaliteiten en slechte gebruikerscontrole over de representatie. Wij introduceren ABC, een open-source multimodaal inbeddingsmodel dat een visie-taalmodelbackbone gebruikt om beeldkenmerken diep te integreren met natuurlijke taal instructies. ABC behaalt de beste prestatie voor zijn grootte op MSCOCO beeld-naar-tekst retrievel en is het best presterende model voor classificatie- en VQA-taken in de Massive Multimodal Embedding Benchmark. Met een sterk geünificeerde visie-taalrepresentatie kan ABC natuurlijke taal gebruiken om subtiele en potentieel ambigue visuele retrievelproblemen op te lossen. Om deze capaciteit te evalueren, ontwerpen wij CtrlBench, een benchmark die vereist dat tekstuele instructies worden verweven met beeldinhoud voor correcte retrievel. ABC zet de standaard voor multimodale inbeddingen verder door hoogwaardige representaties en flexibele natuurlijke taalcontrole te bieden. Ons model en datasets zijn beschikbaar op onze projectpagina.

English

Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate a multimodal embedding model, which outputs embeddings that combine visual and natural language input. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves bestfor-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of multimodal embeddings by offering high-quality representations and flexible natural language control. Our model and datasets are available at our project page.

ABC: Betere controle bereiken over multimodale embeddings met behulp van VLMs

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Samenvatting

Support