我們是否選對了基準：視覺標記壓縮方法的評估框架

摘要

近期，针对多模态大语言模型（MLLMs）推理加速的研究主要集中在视觉令牌压缩上。这些方法的有效性通常通过测量在既定基准上的准确率下降来评估，比较压缩前后模型的性能。然而，这些基准最初设计用于评估MLLMs的感知与推理能力，而非专门针对压缩技术。因此，直接将其应用于视觉令牌压缩任务时，存在任务不匹配的问题。引人注目的是，我们的研究发现，在多个广泛使用的基准测试中，简单的图像下采样方法持续优于许多先进的压缩技术。通过大量实验，我们得出以下观察：（i）当前基准在视觉令牌压缩任务中存在噪声。（ii）下采样能够作为一种数据过滤器，用于评估视觉令牌压缩任务中样本的难度。基于这些发现，我们引入了VTC-Bench，这是一个包含数据过滤机制的评估框架，旨在去噪现有基准，从而实现对视觉令牌压缩方法更公平、更准确的评估。所有数据与代码均可在https://github.com/Chenfei-Liao/VTC-Bench获取。

English

Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.

我們是否選對了基準：視覺標記壓縮方法的評估框架

Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

摘要

Support