PyramidDrop: ピラミッドを介した大規模ビジョン言語モデルの加速化による視覚冗長性削減

要旨

大規模なビジョン言語モデル（LVLM）では、画像は情報を豊富に持つ入力として機能します。ことわざ「一枚の絵は千語に値する」が示すように、現在のLVLMでは1枚の画像を表現するのに数百から数千のトークンが必要となることがあります。これにより、入力画像の解像度が高くなるにつれて計算コストが二乗的に増加し、それによってトレーニングと推論の効率に深刻な影響を与えます。これまでのアプローチでは、LVLMの初期レイヤーの前または内部で画像トークンの数を減らすことを試みてきました。しかし、これらの戦略は避けられなく重要な画像情報の損失をもたらし、結果としてモデルの性能を低下させます。この課題に対処するため、浅いレイヤーにおいてLVLMにとってすべての視覚トークンが必要であり、モデルの深いレイヤーにおいてトークンの冗長性が徐々に増加することを示す経験的研究を行います。このため、我々はLVLMの効率を向上させるための視覚冗長性削減戦略であるPyramidDropを提案します。具体的には、LVLMを複数の段階に分割し、各段階の最後で事前に定義された比率で一部の画像トークンを削除し、モデルの各層にわたってピラミッド状の視覚トークンを作成します。削除は、無視できる時間オーバーヘッドを持つ軽量な類似性計算に基づいて行われます。幅広い実験により、PyramidDropは、同等の性能を維持しつつ、LLaVA-NeXTのトレーニング時間を40%、推論FLOPsを55%加速できることが示されました。さらに、PyramidDropは、トレーニングなしで推論を加速するプラグアンドプレイ戦略としても機能し、競合する手法よりも優れた性能と低い推論コストを提供します。PyramidDropによって導入された洞察とアプローチが、将来の研究がLVLMにおける画像トークンの役割をさらに探求するためのインスピレーションとなることを期待しています。

English

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

PyramidDrop: ピラミッドを介した大規模ビジョン言語モデルの加速化による視覚冗長性削減

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

要旨

Support