LLaVA-Grounding: 大規模マルチモーダルモデルによるグラウンディングを伴う視覚的チャット

要旨

大規模マルチモーダルモデル（LMMs）の最近の著しい進展に伴い、ビジュアルチャットにおけるそれらのグラウンディング能力の重要性がますます認識されています。LMMsがグラウンディングをサポートできるようにするための最近の取り組みにもかかわらず、それらのグラウンディング能力とチャット能力は通常分離されており、グラウンディングを求められるとチャットのパフォーマンスが大幅に低下します。この問題の原因は、グラウンディングされたビジュアルチャット（GVC）のためのデータセットの不足です。既存のグラウンディングデータセットには短いキャプションしか含まれていません。この問題に対処するため、私たちはグラウンディング能力とチャット能力を組み合わせることができるGVCデータを作成しました。GVC能力をより適切に評価するために、Grounding-Benchというベンチマークを導入しました。さらに、セグメンテーションモデルと言語モデルを接続することで、GVCとさまざまなタイプのビジュアルプロンプトをサポートできるモデル設計を提案しました。実験結果は、私たちのモデルがGrounding-Benchにおいて他のLMMsを上回ることを示しています。さらに、私たちのモデルはRefCOCO/+/gやFlickr30K Entitiesのような古典的なグラウンディングベンチマークにおいても競争力のあるパフォーマンスを達成しています。私たちのコードはhttps://github.com/UX-Decoder/LLaVA-Groundingで公開されます。

English

With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

LLaVA-Grounding: 大規模マルチモーダルモデルによるグラウンディングを伴う視覚的チャット

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

要旨

Support