LLaVA-Grounding：具有大型多模态模型的基于视觉的对话交流

摘要

随着大型多模态模型（LMMs）的显著进展，人们越来越意识到它们在视觉对话中的基础能力的重要性。尽管最近已经有努力使LMMs支持基础能力，但它们的基础和对话能力通常是分开的，当要求进行基础时，它们的对话性能会急剧下降。问题在于缺乏用于基础视觉对话（GVC）的数据集。现有的基础数据集只包含简短的标题。为解决这一问题，我们创建了允许结合基础和对话能力的GVC数据。为了更好地评估GVC的能力，我们引入了一个名为Grounding-Bench的基准。此外，我们提出了一种模型设计，可以通过连接分割模型和语言模型来支持GVC和各种类型的视觉提示。实验结果表明，我们的模型在Grounding-Bench上优于其他LMMs。此外，我们的模型在经典的基础基准测试中，如RefCOCO/+/g和Flickr30K Entities上取得了竞争性能。我们的代码将在https://github.com/UX-Decoder/LLaVA-Grounding 上发布。

English

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding：具有大型多模态模型的基于视觉的对话交流

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

摘要

Support