올바른 벤치마크를 사용하고 있는가: 시각적 토큰 압축 방법에 대한 평가 프레임워크

초록

최근 멀티모달 대형 언어 모델(MLLMs)의 추론 속도를 가속화하기 위한 노력은 주로 시각적 토큰 압축에 초점을 맞추고 있다. 이러한 방법의 효과는 일반적으로 기존 벤치마크에서의 정확도 하락을 측정하여 압축 전후의 모델 성능을 비교함으로써 평가된다. 그러나 이러한 벤치마크는 원래 MLLMs의 인지 및 추론 능력을 평가하기 위해 설계되었으며, 압축 기법을 평가하기 위한 목적으로는 설계되지 않았다. 결과적으로, 이를 시각적 토큰 압축에 직접 적용할 경우 작업 불일치가 발생한다. 흥미롭게도, 우리의 조사에 따르면 단순한 이미지 다운샘플링이 여러 널리 사용되는 벤치마크에서 많은 고급 압축 방법들을 일관되게 능가하는 것으로 나타났다. 광범위한 실험을 통해 우리는 다음과 같은 관찰을 얻었다: (i) 현재의 벤치마크는 시각적 토큰 압축 작업에 대해 노이즈가 많다. (ii) 다운샘플링은 시각적 토큰 압축 작업에서 샘플의 난이도를 평가하기 위한 데이터 필터로 기능할 수 있다. 이러한 발견에 동기를 받아, 우리는 VTC-Bench라는 평가 프레임워크를 소개한다. 이 프레임워크는 데이터 필터링 메커니즘을 통합하여 기존 벤치마크의 노이즈를 제거함으로써 시각적 토큰 압축 방법을 보다 공정하고 정확하게 평가할 수 있도록 한다. 모든 데이터와 코드는 https://github.com/Chenfei-Liao/VTC-Bench에서 확인할 수 있다.

English

Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.

올바른 벤치마크를 사용하고 있는가: 시각적 토큰 압축 방법에 대한 평가 프레임워크

Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

초록

Support