UniPixel: ピクセルレベルの視覚的推論のための統合オブジェクト参照とセグメンテーション

要旨

大規模マルチモーダルモデル（LMM）の最近の進展は、汎用マルチモーダルアシスタントとしての顕著な成功を示しており、特に画像とビデオの言語理解に焦点が当てられてきた。一方で、ピクセルレベルの細粒度理解能力のスケーリングにはあまり注目が集まっていない。この領域では、モデルが視覚信号と言語意味の間のピクセルレベルの整合性を実現することが期待されている。これまでの研究では、LMMを領域レベルのキャプション生成や参照表現セグメンテーションなどの関連タスクに適用してきた。しかし、これらのモデルは参照タスクやセグメンテーションタスクを独立して実行するにとどまり、これらの細粒度知覚能力を視覚推論に統合することには至っていない。このギャップを埋めるため、我々はUniPixelを提案する。これは視覚プロンプト入力を柔軟に理解し、マスクに基づいた応答を生成できる大規模マルチモーダルモデルである。我々のモデルは、ピクセルレベルの知覚と一般的な視覚理解能力をシームレスに統合する点で特徴的である。具体的には、UniPixelは視覚プロンプトを処理し、要求に応じて関連するマスクを生成し、推論中にこれらの中間ポインタを条件として後続の推論を行うことで、細粒度のピクセルレベル推論を可能にする。我々のアプローチの有効性は、ピクセルレベルの参照/セグメンテーションや画像/ビデオにおけるオブジェクト中心理解を含む多様なタスクにわたる10のベンチマークで検証されている。さらに、参照、セグメンテーション、質問応答を同時に必要とする新しいPixelQAタスクも設計され、我々の手法の柔軟性を検証している。

English

Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

UniPixel: ピクセルレベルの視覚的推論のための統合オブジェクト参照とセグメンテーション

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

要旨

Support