视觉输入能否被压缩?面向大型多模态模型的视觉令牌压缩基准研究
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
November 4, 2025
作者: Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui
cs.AI
摘要
大型多模态模型(LMMs)常因图像编码器产生的大量视觉标记而面临严重的推理效率问题。尽管近期出现的剪枝与融合等标记压缩方法在减少冗余方面展现出潜力,但其评估体系仍存在碎片化与不一致的问题。本研究提出UniPruneBench——一个面向多模态大模型视觉标记剪枝的统一可扩展基准框架。该框架在六大能力维度和十大数据集上建立标准化评估协议,涵盖十种代表性压缩算法及三大LMM家族(LLaVA-v1.5、Intern-VL3和Qwen2.5-VL)。除任务精度外,该基准还引入运行时间和预填充延迟等系统级指标,以提供全景视角。实验揭示若干关键发现:(1)随机剪枝作为基线方法表现出惊人强度;(2)尚无单一方法能在所有场景中持续领先;(3)不同任务对剪枝的敏感度差异显著,其中OCR最易受影响;(4)剪枝比率是性能衰减的主导因素。我们相信UniPruneBench将为高效多模态建模的未来研究提供可靠基础。
English
Large multimodal models (LMMs) often suffer from severe inference
inefficiency due to the large number of visual tokens introduced by image
encoders. While recent token compression methods, such as pruning and merging,
have shown promise in reducing redundancy, their evaluation remains fragmented
and inconsistent. In this work, we present UniPruneBench, a unified and
extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench
provides standardized protocols across six ability dimensions and ten datasets,
covering ten representative compression algorithms and three families of LMMs
(LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates
system-level metrics such as runtime and prefilling latency to provide a
holistic view. Our experiments uncover several key findings: (1) random pruning
is a surprisingly strong baseline, (2) no single method consistently
outperforms others across scenarios, (3) pruning sensitivity varies
significantly across tasks, with OCR being most vulnerable, and (4) pruning
ratio is the dominant factor governing performance degradation. We believe
UniPruneBench will serve as a reliable foundation for future research on
efficient multimodal modeling.