VisionZip: ビジョン言語モデルにおいて、長いほど良いが必ずしも必要ではない

要旨

最近のビジョン言語モデルの進歩により、視覚トークンの長さが増加し、テキストトークンよりもはるかに長くなり、計算コストが大幅に増加してパフォーマンスが向上しました。しかし、一般的なビジョンエンコーダーで生成される視覚トークン（例：CLIPやSigLIP）には、かなりの冗長性が含まれていることが観察されています。この問題に対処するために、視覚トークンの冗長性を減らし、モデルのパフォーマンスを維持しながら効率を向上させる一方で、情報量の多いトークンのセットを選択するシンプルかつ効果的な方法であるVisionZipを提案します。提案されたVisionZipは、画像やビデオ理解のタスクに広く適用でき、従来の方法が性能を発揮しない実世界のマルチターンダイアログに適しています。実験結果によると、VisionZipは、ほぼすべての設定で、従来の最先端の方法よりも少なくとも5％のパフォーマンス向上を達成しています。さらに、当社の手法はモデルの推論速度を大幅に向上させ、プリフィリング時間を8倍に短縮し、LLaVA-Next 13BモデルをLLaVA-Next 7Bモデルよりも速く推論させながらより良い結果を達成しています。さらに、この冗長性の原因を分析し、コミュニティに対して、単なるトークンの長さを増やすのではなく、より良い視覚特徴を抽出することに焦点を当てるよう奨励しています。当社のコードはhttps://github.com/dvlab-research/VisionZip で入手可能です。

English

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

VisionZip: ビジョン言語モデルにおいて、長いほど良いが必ずしも必要ではない

VisionZip: Longer is Better but Not Necessary in Vision Language Models

要旨

Support