ChatPaper.aiChatPaper

BuboGPT:实现多模式LLM中的视觉定位

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

July 17, 2023
作者: Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang
cs.AI

摘要

LLM在与人类交互中展现出卓越的能力,尤其是在使用遵循指令的数据方面。LLM的最新进展,如MiniGPT-4、LLaVA和X-LLM,通过整合多模态输入(包括图像、视频和语音),进一步扩大了它们的能力。尽管这些LLM在生成给定模态信号的精确和详细语言理解方面非常有效,但它们放弃了对输入的特定部分进行基础定位的能力,因此只构建了粗粒度映射。然而,文本与其他模态之间明确且信息丰富的对应关系不仅可以改善用户体验,还有助于扩展多模态LLM的应用场景。因此,我们提出了BuboGPT,这是一个具有视觉基础定位的多模态LLM,可以在视觉、音频和语言之间进行跨模态交互,提供对视觉对象和其他给定模态的细粒度理解。因此,BuboGPT能够在为图像中的对象生成响应或描述时指出对象的具体位置。我们的贡献有两个方面:1)基于SAM的现成视觉基础定位模块,可以提取句子中的实体并在图像中找到相应的蒙版。2)一个两阶段训练方案和指令数据集,赋予联合文本-图像-音频理解能力。我们的实验表明,BuboGPT在与人类交互时实现了令人印象深刻的多模态理解和视觉基础定位能力。在提供任意模态组合(无论是对齐还是不对齐)时,它表现稳定优异。我们的代码、模型和数据集可在https://bubo-gpt.github.io 上获取。
English
LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .
PDF280December 15, 2024