LLaVA-Grounding: Chat Visivo Contestualizzato con Modelli Multimodali di Grande Scala

Abstract

Con i recenti e significativi progressi nei grandi modelli multi-modali (LMM), l'importanza della loro capacità di grounding nel contesto del visual chat è sempre più riconosciuta. Nonostante i recenti sforzi per consentire agli LMM di supportare il grounding, le loro capacità di grounding e chat sono solitamente separate, e le prestazioni nella chat diminuiscono drasticamente quando viene richiesto il grounding. Il problema risiede nella mancanza di un dataset per il grounded visual chat (GVC). I dataset esistenti per il grounding contengono solo brevi didascalie. Per affrontare questa questione, abbiamo creato dati GVC che consentono la combinazione delle capacità di grounding e chat. Per valutare meglio le capacità GVC, abbiamo introdotto un benchmark chiamato Grounding-Bench. Inoltre, abbiamo proposto un design di modello che può supportare il GVC e vari tipi di prompt visivi collegando modelli di segmentazione con modelli linguistici. I risultati sperimentali dimostrano che il nostro modello supera altri LMM su Grounding-Bench. Inoltre, il nostro modello raggiunge prestazioni competitive su benchmark classici di grounding come RefCOCO/+/g e Flickr30K Entities. Il nostro codice sarà rilasciato su https://github.com/UX-Decoder/LLaVA-Grounding.

English

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding: Chat Visivo Contestualizzato con Modelli Multimodali di Grande Scala

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Abstract

Support