TokenPacker: マルチモーダルLLMのための効率的なビジュアルプロジェクタ

要旨

視覚プロジェクターは、マルチモーダル大規模言語モデル（MLLM）において、視覚エンコーダと大規模言語モデル（LLM）の間の重要な橋渡しとして機能します。通常、MLLMは単純なMLPを採用し、1対1の変換を通じてすべての視覚コンテキストを保持します。しかし、視覚トークンは冗長であり、高解像度画像を扱う際に大幅に増加する可能性があり、MLLMの効率を著しく損なうことがあります。最近の研究では、リサンプラーやアブストラクターを導入して、生成される視覚トークンの数を削減しようとしています。しかし、これらの手法は細かい詳細を捉えられず、MLLMの視覚推論能力を損なうことがあります。本研究では、新しい視覚プロジェクターを提案し、粗から細へのスキームを採用して、凝縮された視覚トークンを生成するために豊かな特性を注入します。具体的には、まず視覚特徴を低解像度のポイントクエリとして補間し、全体の視覚表現を基盤として提供します。次に、高解像度の多レベル領域ベースの手がかりを細かい参照キーと値として利用する領域からポイントへの注入モジュールを導入し、それらが対応するローカルコンテキスト領域内で完全に吸収されるようにします。このステップにより、粗いポイントクエリが効果的に更新され、後続のLLM推論のための豊かなクエリに変換されます。大規模な実験により、我々のアプローチが視覚トークンを75％～89％圧縮しつつ、多様なベンチマークで同等またはそれ以上の性能を達成し、大幅に高い効率を実現することが示されています。ソースコードはhttps://github.com/CircleRadon/TokenPackerで公開されています。

English

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.

TokenPacker: マルチモーダルLLMのための効率的なビジュアルプロジェクタ

TokenPacker: Efficient Visual Projector for Multimodal LLM

要旨

Support