MMTok：面向视觉语言模型高效推理的多模态覆盖最大化

摘要

视觉-语言模型（VLMs）在通过将视觉输入转换为视觉标记来理解带有语言指令的视觉内容方面展现了卓越的性能。然而，视觉标记中的冗余导致了VLMs推理效率的下降。尽管已有许多算法被提出以减少视觉标记的数量，但大多数仅利用单模态信息（即视觉/文本）进行剪枝，忽视了视觉-语言任务固有的多模态特性。此外，缺乏一个适用于不同模态的通用标准。为缓解这一局限，本工作提出利用视觉和文本标记，通过覆盖准则选择信息丰富的视觉标记。我们首先将子集选择问题形式化为最大覆盖问题。随后，优化一个视觉标记子集，使其同时覆盖文本标记和原始视觉标记集。最后，可采用VLM代理进一步提升文本标记的质量，以指导视觉剪枝。所提出的方法MMTok在不同VLMs的基准数据集上进行了广泛评估。对比结果表明，视觉与文本信息具有互补性，结合多模态信息能显著超越单模态基线。此外，在POPE数据集上采用最大覆盖准则，我们的方法实现了1.87倍的加速，同时保持了LLaVA-NeXT-13B模型98.7%的原始性能。更有甚者，仅使用四个视觉标记，仍能保留LLaVA-1.5-7B模型87.7%的原始性能。这些结果凸显了覆盖准则在标记选择中的有效性。

English

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.

MMTok：面向视觉语言模型高效推理的多模态覆盖最大化

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

摘要

Support