効率的なLLaMA-3.2-Vision：クロスアテンションされた視覚特徴のトリミングによる実現

要旨

視覚的トークン削減は、大規模視覚言語モデル（LVLM）における広範な画像特徴に起因する推論コストを低減します。自己注意機構のみのLVLMにおいてトークンを枝刈りする関連研究とは異なり、本論文は優れた性能を達成するクロスアテンションベースのモデルに独自に取り組んでいます。我々は、クロスアテンション層における画像トークンのキー・バリュー（KV）キャッシュサイズが、自己注意層のテキストトークンを大幅に上回り、主要な計算ボトルネックとなっていることを特定しました。この問題を緩和するため、クロスアテンションマップの疎性を活用して冗長な視覚的特徴を選択的に枝刈りします。我々のTrimmed Llamaは、追加の学習を必要とせずにKVキャッシュの要求を効果的に削減します。50%削減された視覚的特徴の恩恵により、本モデルは推論遅延とメモリ使用量を削減しつつ、ベンチマーク同等の性能を達成できます。

English

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

効率的なLLaMA-3.2-Vision：クロスアテンションされた視覚特徴のトリミングによる実現

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

要旨

Support