ChatPaper.aiChatPaper

當精簡即足:自適應令牌縮減實現高效圖像表徵

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

March 20, 2025
作者: Eduard Allakhverdov, Elizaveta Goncharova, Andrey Kuznetsov
cs.AI

摘要

視覺編碼器通常會生成大量的視覺標記,提供資訊豐富的表徵,但同時也大幅增加了計算需求。這引發了一個問題:所有生成的標記是否都具有同等價值,或者是否可以捨棄其中一部分以降低計算成本而不影響品質。本文中,我們提出了一種基於「較不重要的特徵可從更重要的特徵中重建」這一概念的新方法來判定特徵效用。我們通過將自編碼器與Gumbel-Softmax選擇機制相結合來實現這一理念,該機制能夠識別並僅保留最具資訊量的視覺標記。為驗證我們的方法,我們比較了LLaVA-NeXT模型在使用我們方法選取的特徵與隨機選取特徵時的表現。結果發現,在基於OCR的任務中,超過50%的視覺上下文可以被移除而僅帶來極小的性能損失,而隨機捨棄相同比例的特徵則會顯著影響模型能力。此外,在通用領域任務中,即使僅隨機保留30%的標記,也能達到與使用完整視覺標記集相當的性能。我們的成果揭示了一條朝向適應性且高效的多模態剪枝的可行路徑,這有助於實現可擴展且低開銷的推理,同時不犧牲性能。
English
Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

Summary

AI-Generated Summary

PDF732March 24, 2025