我们是否选对了基准：视觉令牌压缩方法的评估框架

摘要

近期，在加速多模态大语言模型（MLLMs）推理方面的努力主要集中于视觉令牌压缩。这些方法的有效性通常通过在既定基准上测量准确率下降来评估，比较压缩前后模型的性能。然而，这些基准最初设计用于评估MLLMs的感知与推理能力，而非专门针对压缩技术。因此，直接将其应用于视觉令牌压缩任务时，存在任务不匹配的问题。引人注目的是，我们的研究发现，在多个广泛使用的基准测试中，简单的图像下采样持续优于许多先进的压缩方法。通过大量实验，我们得出以下观察：(i) 当前基准对于视觉令牌压缩任务存在噪声。(ii) 下采样能够作为数据过滤器，评估视觉令牌压缩任务中样本的难度。基于这些发现，我们引入了VTC-Bench，这是一个包含数据过滤机制的评估框架，旨在去噪现有基准，从而实现对视觉令牌压缩方法更公平、更准确的评估。所有数据与代码均可访问https://github.com/Chenfei-Liao/VTC-Bench。

English

Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.

我们是否选对了基准：视觉令牌压缩方法的评估框架

Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

摘要

Support