パッチをデコード可能なトークンとして：MLLMにおける統一的なマルチモーダル視覚タスクに向けて

要旨

マルチモーダル大規模言語モデル（MLLMs）は近年急速に進化を遂げています。しかし、既存の視覚タスクに対するアプローチは、検出のために座標をテキストとして生成するなど、間接的な表現に依存することが多く、性能が制限され、セグメンテーションのような高密度予測タスクを妨げています。これらの課題を克服するため、本論文では「Patch-as-Decodable Token（PaDT）」を提案します。PaDTは、MLLMsがテキストと多様な視覚出力を直接生成することを可能にする統一パラダイムです。PaDTの中核となるのは、クエリ画像の視覚パッチ埋め込みから導出され、LLMの出力テキストトークンとシームレスに交互に配置される「Visual Reference Tokens（VRTs）」です。軽量なデコーダがLLMの出力を検出、セグメンテーション、グラウンディング予測に変換します。従来の手法とは異なり、PaDTは各フォワードパスでVRTsを独立して処理し、埋め込みテーブルを動的に拡張することで、類似オブジェクト間の位置特定と識別を改善します。さらに、PaDTのためのトレーニング戦略をカスタマイズし、教師ありファインチューニングのためにランダムにVRTsを選択し、堅牢なトークンごとのクロスエントロピー損失を導入します。4つの視覚知覚および理解タスクにわたる実証研究は、PaDTが大幅に大きなMLLMモデルと比較しても、一貫して最先端の性能を達成することを示唆しています。コードはhttps://github.com/Gorilla-Lab-SCUT/PaDTで公開されています。

English

Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

パッチをデコード可能なトークンとして：MLLMにおける統一的なマルチモーダル視覚タスクに向けて

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

要旨

Support