Pixel-SAIL: ピクセルに基づく理解のための単一トランスフォーマー

要旨

マルチモーダル大規模言語モデル（MLLM）は、細粒度のピクセルレベル理解タスクにおいて顕著な性能を発揮します。しかし、これまでの研究はすべて、ビジョンエンコーダ（CLIP）やセグメンテーション専門家などの追加コンポーネントに大きく依存しており、システムの複雑さを高め、モデルのスケーリングを制限しています。本研究では、追加コンポーネントを導入せずに、高度に簡素化されたMLLMを探求することを目指しています。私たちの研究は、最近のSingle trAnsformer as a unified vIsion-Language Model（SAIL）設計に関する研究に触発されており、これらの研究では、トランスフォーマー内でビジョントークンとテキストトークンを共同で学習しています。私たちは、ピクセル単位のMLLMタスクのための単一トランスフォーマーであるPixel-SAILを提案します。特に、プレーンなベースラインに対して3つの技術的改善を提示します。まず、視覚トークンの特徴を洗練するための学習可能なアップサンプリングモジュールを設計します。次に、単一トランスフォーマーが視覚プロンプト入力を理解し、視覚プロンプト埋め込みとビジョントークンの早期融合から利益を得られるようにする新しい視覚プロンプト注入戦略を提案します。第三に、単一トランスフォーマーの細粒度特徴抽出能力を効率的に強化するためのビジョン専門家蒸留戦略を導入します。さらに、手動チェックを使用して、包括的なピクセル理解ベンチマーク（PerBench）を収集しました。これには、詳細なオブジェクト記述、視覚プロンプトに基づく質問応答、視覚-テキスト参照セグメンテーションの3つのタスクが含まれます。4つの参照セグメンテーションベンチマーク、1つの視覚プロンプトベンチマーク、および私たちのPerBenchでの広範な実験により、Pixel-SAILがはるかに簡素化されたパイプラインで同等またはそれ以上の結果を達成することが示されました。コードとモデルはhttps://github.com/magic-research/Sa2VAで公開されます。

English

Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.

Pixel-SAIL: ピクセルに基づく理解のための単一トランスフォーマー

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

要旨

Support