少なくて十分な場合：効率的な画像表現のための適応的トークン削減

要旨

ビジョンエンコーダは通常、大量のビジュアルトークンを生成し、情報量の多い表現を提供しますが、計算コストを大幅に増加させます。これにより、生成されたすべてのトークンが等しく価値があるのか、あるいは品質を損なうことなく計算コストを削減するために一部のトークンを破棄できるのかという疑問が生じます。本論文では、価値の低い特徴は価値の高い特徴から再構築できるという考えに基づいて、特徴の有用性を決定する新しい方法を紹介します。この概念を実装するために、オートエンコーダとGumbel-Softmax選択メカニズムを統合し、最も情報量の多いビジュアルトークンのみを特定して保持できるようにします。我々のアプローチを検証するために、LLaVA-NeXTモデルの性能を、我々の方法で選択された特徴とランダムに選択された特徴を使用して比較しました。OCRベースのタスクでは、ビジュアルコンテキストの50%以上を削除しても性能の低下は最小限であり、同じ割合の特徴をランダムに破棄するとモデルの能力に大きな影響を与えることがわかりました。さらに、一般的なドメインのタスクでは、トークンの30%をランダムに保持するだけで、ビジュアルトークンの完全なセットを使用した場合と同等の性能を達成できます。我々の結果は、性能を損なうことなくスケーラブルで低オーバーヘッドの推論を可能にする適応的で効率的なマルチモーダルプルーニングに向けた有望な方向性を示しています。

English

Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

少なくて十分な場合：効率的な画像表現のための適応的トークン削減

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

要旨

Support