我们是否选对了基准:视觉令牌压缩方法的评估框架
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
October 8, 2025
作者: Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu
cs.AI
摘要
近期,在加速多模态大语言模型(MLLMs)推理方面的努力主要集中于视觉令牌压缩。这些方法的有效性通常通过在既定基准上测量准确率下降来评估,比较压缩前后模型的性能。然而,这些基准最初设计用于评估MLLMs的感知与推理能力,而非专门针对压缩技术。因此,直接将其应用于视觉令牌压缩任务时,存在任务不匹配的问题。引人注目的是,我们的研究发现,在多个广泛使用的基准测试中,简单的图像下采样持续优于许多先进的压缩方法。通过大量实验,我们得出以下观察:(i) 当前基准对于视觉令牌压缩任务存在噪声。(ii) 下采样能够作为数据过滤器,评估视觉令牌压缩任务中样本的难度。基于这些发现,我们引入了VTC-Bench,这是一个包含数据过滤机制的评估框架,旨在去噪现有基准,从而实现对视觉令牌压缩方法更公平、更准确的评估。所有数据与代码均可访问https://github.com/Chenfei-Liao/VTC-Bench。
English
Recent endeavors to accelerate inference in Multimodal Large Language Models
(MLLMs) have primarily focused on visual token compression. The effectiveness
of these methods is typically assessed by measuring the accuracy drop on
established benchmarks, comparing model performance before and after
compression. However, these benchmarks are originally designed to assess the
perception and reasoning capabilities of MLLMs, rather than to evaluate
compression techniques. As a result, directly applying them to visual token
compression introduces a task mismatch. Strikingly, our investigation reveals
that simple image downsampling consistently outperforms many advanced
compression methods across multiple widely used benchmarks. Through extensive
experiments, we make the following observations: (i) Current benchmarks are
noisy for the visual token compression task. (ii) Down-sampling is able to
serve as a data filter to evaluate the difficulty of samples in the visual
token compression task. Motivated by these findings, we introduce VTC-Bench, an
evaluation framework that incorporates a data filtering mechanism to denoise
existing benchmarks, thereby enabling fairer and more accurate assessment of
visual token compression methods. All data and code are available at
https://github.com/Chenfei-Liao/VTC-Bench.