BuboGPT：在多模式LLM中實現視覺定位

摘要

LLM在與人類進行語言互動方面展現出卓越的能力，尤其是在使用遵循指示的數據方面。LLM的最新進展，如MiniGPT-4、LLaVA和X-LLM，通過整合多模輸入（包括圖像、視頻和語音）進一步擴大了它們的能力。儘管這些LLM在生成給定模態信號的精確和詳細語言理解方面非常有效，但它們放棄了將輸入的特定部分與基礎事實聯繫起來的能力，因此僅構建了粗粒度映射。然而，文本與其他模態之間的明確和信息豐富的對應不僅會改善用戶體驗，還將有助於擴展多模LLM的應用場景。因此，我們提出了BuboGPT，這是一種具有視覺基礎的多模LLM，可以在視覺、音頻和語言之間進行跨模態交互，提供對視覺對象和其他給定模態的細粒度理解。因此，BuboGPT能夠在為該對象生成響應或描述時指出圖像中對象的具體位置。我們的貢獻有兩個方面：1）基於SAM的即插即用視覺基礎模塊，可提取句子中的實體並在圖像中找到相應的遮罩。2）一種兩階段訓練方案和指示數據集，賦予聯合文本-圖像-音頻理解能力。我們的實驗表明，BuboGPT在與人類進行交互時實現了令人印象深刻的多模理解和視覺基礎能力。當提供任意模態組合（無論是對齊還是不對齊）時，它表現出色。我們的代碼、模型和數據集可在https://bubo-gpt.github.io 上獲得。

English

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

BuboGPT：在多模式LLM中實現視覺定位

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

摘要

Support