項目を一つずつ列挙する：マルチモーダルLLMのための新たなデータソースと学習パラダイム

要旨

Set-of-Mark (SoM) Promptingは、GPT-4Vの視覚的基盤能力を解放し、モデルが画像に挿入されたタグと視覚的オブジェクトを関連付けられるようにします。これらのタグは英数字でマークされ、テキストトークンを通じて簡単に参照できます。GPT-4Vの驚異的な性能にもかかわらず、他のマルチモーダル大規模言語モデル（MLLM）はこれらの視覚タグを理解するのに苦労することが観察されます。オープンソースモデルにおけるSoM Promptingの学習を促進するため、新しい学習パラダイム「list items one by one」を提案します。これは、モデルに画像に配置されたすべての視覚タグを英数字順に列挙し、説明するよう求めるものです。私たちが作成したデータセットを他の視覚指示チューニングデータセットと統合することで、既存のMLLMにSoM Prompting能力を付与することが可能です。さらに、ファインチューニングされたSoMモデルを5つのMLLMベンチマークで評価しました。この新しいデータセットは、比較的小さなサイズ（10k-30kのタグ付き画像）であっても、視覚的推論能力を大幅に向上させ、MLLMの幻覚を減少させることがわかりました。驚くべきことに、これらの改善は、推論時に視覚タグが入力画像から省略された場合でも持続します。これは、「list items one by one」が、トレーニング段階で視覚タグを使用することでオブジェクトとテキストの整合性を強化する、MLLMの新しいトレーニングパラダイムとしての可能性を示唆しています。最後に、トレーニングされたモデルをプローブしてSoMの動作メカニズムを理解するための分析を行います。私たちのコードとデータはhttps://github.com/zzxslp/SoM-LLaVAで公開されています。

English

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at https://github.com/zzxslp/SoM-LLaVA.

項目を一つずつ列挙する：マルチモーダルLLMのための新たなデータソースと学習パラダイム

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

要旨

Summary

Support

Support