Inst-IT: 明示的なビジュアルプロンプト指示チューニングを介したマルチモーダルインスタンス理解の向上

要旨

大規模多モーダルモデル（LMMs）は、指示チューニングの進歩により、重要なブレークスルーを達成してきました。しかし、既存のモデルは画像やビデオを全体的なレベルで理解できる一方、より微妙な理解と整合性を必要とするインスタンスレベルの理解には苦労しています。インスタンスレベルの理解は重要であり、私たちが最も興味を持つ具体的な要素に焦点を当てています。興味深いことに、既存の研究では、最先端のLMMsが明示的な視覚的手がかりを提供されると強力なインスタンス理解能力を示すことが分かっています。このことに触発され、私たちはGPT-4oによって支援される自動注釈パイプラインを導入し、明示的な視覚的プロンプティングを通じて画像やビデオからインスタンスレベルの情報を抽出するためのインスタンスガイダンスを提供しています。このパイプラインを基に、Inst-ITという、インスタンス理解を向上させる解決策を提案しました。Inst-ITは、多モーダルインスタンスレベル理解を診断するためのベンチマーク、大規模な指示チューニングデータセット、および既存のLMMsの空間的・時間的なインスタンス理解能力を効果的に向上させるための連続した指示チューニングトレーニングパラダイムで構成されています。実験結果は、Inst-ITの助けを借りることで、私たちのモデルがInst-IT Benchで優れたパフォーマンスを達成するだけでなく、さまざまな一般的な画像およびビデオ理解ベンチマークで著しい改善を示すことを示しています。これにより、私たちのデータセットがインスタンスレベルの理解を向上させるだけでなく、一般的な画像およびビデオ理解の全体的な能力を強化することが強調されます。

English

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Inst-IT: 明示的なビジュアルプロンプト指示チューニングを介したマルチモーダルインスタンス理解の向上

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

要旨

Support