BuboGPT: マルチモーダルLLMにおける視覚的グラウンディングの実現

要旨

大規模言語モデル（LLM）は、特に指示追従データの使用を通じて、人間と言語を用いて対話する際に顕著な能力を発揮しています。MiniGPT-4、LLaVA、X-LLMなどの最近のLLMの進展により、画像、動画、音声といったマルチモーダル入力を組み込むことで、その能力がさらに拡大されています。これらのLLMは、与えられたモダリティ信号に対する正確で詳細な言語理解を生成する点で効果的ですが、入力の特定部分を接地する能力を放棄しているため、粗粒度のマッピングしか構築できません。しかし、テキストと他のモダリティ間の明示的で有益な対応関係は、ユーザー体験を向上させるだけでなく、マルチモーダルLLMの応用シナリオを拡大するのにも役立ちます。そこで我々は、視覚、音声、言語間のクロスモーダル相互作用を実現し、視覚オブジェクトや他の与えられたモダリティに対する細粒度の理解を提供する、視覚接地機能を備えたマルチモーダルLLMであるBuboGPTを提案します。その結果、BuboGPTは、オブジェクトに対する応答や説明を生成する際に、画像内のそのオブジェクトの特定の位置を指し示すことができます。我々の貢献は2つあります：1）SAMに基づくオフ・ザ・シェルフの視覚接地モジュールで、文中のエンティティを抽出し、画像内の対応するマスクを見つけます。2）テキスト、画像、音声の共同理解を可能にするための2段階のトレーニングスキームと指示データセット。実験の結果、BuboGPTは人間との対話中に印象的なマルチモーダル理解と視覚接地能力を達成し、任意のモダリティの組み合わせ（整列しているかどうかに関わらず）が与えられた場合でも一貫して良好な性能を発揮します。我々のコード、モデル、データセットはhttps://bubo-gpt.github.ioで公開されています。

English

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

BuboGPT: マルチモーダルLLMにおける視覚的グラウンディングの実現

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

要旨

Support