LLaVA-Grounding：具有大型多模型模型的基於視覺的對話系统

摘要

隨著最近大型多模型（LMMs）的重大進展，人們越來越重視它們在視覺對話中的基礎能力。儘管近期已有努力使LMMs支持基礎能力，但它們的基礎和對話能力通常是分開的，當要求進行基礎時，其對話表現會急劇下降。問題在於缺乏一個用於基礎視覺對話（GVC）的數據集。現有的基礎數據集僅包含簡短的標題。為了解決這個問題，我們創建了GVC數據，可以結合基礎和對話能力。為了更好地評估GVC的能力，我們引入了一個名為Grounding-Bench的基準。此外，我們提出了一種模型設計，可以通過將分割模型與語言模型相連接，支持GVC和各種類型的視覺提示。實驗結果表明，我們的模型在Grounding-Bench上優於其他LMMs。此外，我們的模型在經典基礎基準上（如RefCOCO/+/g和Flickr30K Entities）取得了競爭性表現。我們的代碼將在https://github.com/UX-Decoder/LLaVA-Grounding 上發布。

English

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding：具有大型多模型模型的基於視覺的對話系统

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

摘要

Support