PyramidDrop：通过金字塔视觉冗余减少加速大型视觉-语言模型

摘要

在大型视觉-语言模型（LVLMs）中，图像作为输入携带丰富的信息。正如谚语“一图胜千言”所暗示的那样，在当前的LVLMs中，代表单个图像可能需要数百甚至数千个标记。这导致了显著的计算成本，随着输入图像分辨率的增加呈二次增长，严重影响了训练和推理的效率。先前的方法尝试在LVLMs的早期层之前或内部减少图像标记的数量。然而，这些策略不可避免地导致关键图像信息的丢失，最终降低了模型的性能。为了解决这一挑战，我们进行了一项实证研究，揭示了LVLMs在浅层中所有视觉标记的必要性，并且在模型的深层中，标记冗余性逐渐增加。为此，我们提出了金字塔减少（PyramidDrop），这是一种用于LVLMs的视觉冗余减少策略，可以提高它们在训练和推理中的效率，同时性能损失可以忽略不计。具体而言，我们将LVLM分为几个阶段，并在每个阶段末尾删除部分图像标记，采用预定义的比率，在模型层之间创建类似金字塔的视觉标记。删除基于轻量级相似性计算，时间开销可以忽略不计。大量实验表明，PyramidDrop可以实现与LLaVA-NeXT相比，训练时间加速40%，推理FLOPs加速55%，并具有可比较的性能。此外，PyramidDrop还可以作为一种即插即用的策略用于推理加速，无需训练，性能更好，推理成本更低。我们希望PyramidDrop引入的见解和方法将激发未来研究进一步探讨图像标记在LVLMs中的作用。

English

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

PyramidDrop：通过金字塔视觉冗余减少加速大型视觉-语言模型

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

摘要

Support