OMG-LLaVA: 画像レベル、オブジェクトレベル、ピクセルレベルの推論と理解を橋渡しする

要旨

現在のユニバーサルセグメンテーション手法は、ピクセルレベルの画像および動画理解において強力な能力を示しています。しかし、それらには推論能力が欠けており、テキスト指示による制御ができません。一方、大規模な視覚-言語マルチモーダルモデルは、視覚に基づく会話と推論能力を備えていますが、ピクセルレベルの理解が不足しており、柔軟なユーザーインタラクションのための視覚的プロンプトを受け入れるのが困難です。本論文では、強力なピクセルレベルの視覚理解と推論能力を組み合わせた新しいエレガントなフレームワーク、OMG-LLaVAを提案します。これは、様々な視覚的およびテキストのプロンプトを受け入れて、柔軟なユーザーインタラクションを可能にします。具体的には、ユニバーサルセグメンテーション手法を視覚エンコーダーとして使用し、画像情報、知覚事前情報、および視覚的プロンプトをLLMに提供される視覚トークンに統合します。LLMは、ユーザーのテキスト指示を理解し、視覚情報に基づいてテキスト応答とピクセルレベルのセグメンテーション結果を提供する役割を担います。知覚事前情報を画像特徴とより良く統合するために、知覚事前埋め込みを提案します。OMG-LLaVAは、単一のモデルで画像レベル、オブジェクトレベル、およびピクセルレベルの推論と理解を実現し、複数のベンチマークで専門手法の性能に匹敵またはそれを上回ります。各専門家をLLMで接続するのではなく、本手法は1つのエンコーダー、1つのデコーダー、および1つのLLMに対するエンドツーエンドのトレーニングを目指しています。コードとモデルは、さらなる研究のために公開されています。

English

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

OMG-LLaVA: 画像レベル、オブジェクトレベル、ピクセルレベルの推論と理解を橋渡しする

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

要旨

Support