少即是多:面向高效图像表征的自适应令牌缩减
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
March 20, 2025
作者: Eduard Allakhverdov, Elizaveta Goncharova, Andrey Kuznetsov
cs.AI
摘要
视觉编码器通常生成大量视觉标记,提供了信息丰富的表征,但显著增加了计算需求。这引发了一个问题:所有生成的标记是否同等重要,或者是否可以舍弃其中一部分以降低计算成本而不影响质量。本文中,我们引入了一种基于“低价值特征可从高价值特征中重建”理念的新方法,用于确定特征效用。我们通过将自编码器与Gumbel-Softmax选择机制相结合来实现这一概念,该机制能够识别并仅保留最具信息量的视觉标记。为验证我们的方法,我们比较了LLaVA-NeXT模型在使用我们方法筛选的特征与随机选择特征时的表现。我们发现,在基于OCR的任务中,超过50%的视觉上下文可以被移除而仅有微小的性能损失,而随机丢弃相同比例的特征则显著影响模型能力。此外,在通用领域任务中,即使随机保留仅30%的标记,也能达到与使用完整视觉标记集相当的性能。我们的研究结果揭示了一个有前景的方向,即实现自适应且高效的多模态剪枝,从而在不牺牲性能的前提下促进可扩展且低开销的推理。
English
Vision encoders typically generate a large number of visual tokens, providing
information-rich representations but significantly increasing computational
demands. This raises the question of whether all generated tokens are equally
valuable or if some of them can be discarded to reduce computational costs
without compromising quality. In this paper, we introduce a new method for
determining feature utility based on the idea that less valuable features can
be reconstructed from more valuable ones. We implement this concept by
integrating an autoencoder with a Gumbel-Softmax selection mechanism, that
allows identifying and retaining only the most informative visual tokens. To
validate our approach, we compared the performance of the LLaVA-NeXT model,
using features selected by our method with randomly selected features. We found
that on OCR-based tasks, more than 50% of the visual context can be removed
with minimal performance loss, whereas randomly discarding the same proportion
of features significantly affects the model capabilities. Furthermore, in
general-domain tasks, even randomly retaining only 30% of tokens achieves
performance comparable to using the full set of visual tokens. Our results
highlight a promising direction towards adaptive and efficient multimodal
pruning that facilitates scalable and low-overhead inference without
compromising performance.Summary
AI-Generated Summary