MMTok：多模態覆蓋最大化以實現視覺語言模型的高效推理

摘要

視覺語言模型（VLMs）在通過將視覺輸入轉換為視覺標記來理解帶有語言指令的視覺內容方面展現了令人印象深刻的性能。然而，視覺標記中的冗餘導致了VLMs推理效率的下降。儘管已有許多算法被提出來減少視覺標記的數量，但大多數僅應用單模態信息（即視覺/文本）進行剪枝，忽略了視覺語言任務固有的多模態特性。此外，缺乏一個可以應用於不同模態的通用標準。為了解決這一限制，在本研究中，我們提出利用視覺和文本標記，通過覆蓋標準來選擇信息豐富的視覺標記。我們首先將子集選擇問題形式化為最大覆蓋問題。隨後，優化視覺標記的子集以同時覆蓋文本標記和原始視覺標記集。最後，可以採用VLM代理進一步提高文本標記的質量，以指導視覺剪枝。所提出的方法MMTok在不同VLM的基準數據集上進行了廣泛評估。比較結果表明，視覺和文本信息是互補的，結合多模態信息可以明顯超越單模態基線。此外，在POPE數據集上基於最大覆蓋標準，我們的方法實現了1.87倍的加速，同時保持了LLaVA-NeXT-13B原始性能的98.7%。此外，僅使用四個視覺標記，它仍然保留了LLaVA-1.5-7B原始性能的87.7%。這些結果突出了覆蓋在標記選擇中的有效性。

English

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.

MMTok：多模態覆蓋最大化以實現視覺語言模型的高效推理

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

摘要

Support