HiRED: リソース制約環境における高解像度ビジョン言語モデルの効率的推論のための注意誘導トークンドロップ

要旨

高解像度ビジョン言語モデル（VLM）は、詳細な画像情報を保持することで精度を向上させるために、多様なタスクで広く使用されています。しかしながら、これらのモデルは、入力画像の複数のパーティションをエンコードすることにより、過剰なビジュアルトークンを生成することがよくあります。これらの過剰なビジュアルトークンを処理することは、特に資源制約の厳しい環境でのコンピューテーショナルな挑戦となります。リソース制約を満たしながら高解像度画像をサポートするために、我々は、大規模言語モデル（LLM）段階の前で動作するトークンドロップスキームであるHigh-Resolution Early Dropping（HiRED）を提案します。HiREDは既存の高解像度VLMにプラグアンドプレイで統合でき、追加のトレーニングは必要ありませんが、優れた精度を維持します。我々は、ビジョンエンコーダのアテンションを初期層で戦略的に使用して、各画像パーティションのビジュアルコンテンツを評価し、それに応じてトークン予算を割り当てます。次に、最終層のアテンションを使用して、割り当てられた予算内で各パーティションから最も重要なビジュアルトークンを選択し、残りを削除します。経験的には、NVIDIA TESLA P40 GPU上のLLaVA-Next-7Bに適用した場合、HiREDはトークン生成スループットを4.7倍に向上させ、最初のトークン生成レイテンシを15秒短縮し、単一の推論においてGPUメモリを2.3 GB節約します。

English

High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

HiRED: リソース制約環境における高解像度ビジョン言語モデルの効率的推論のための注意誘導トークンドロップ

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

要旨

Support