视觉输入能否被压缩？面向大型多模态模型的视觉标记压缩基准研究

摘要

大型多模态模型（LMMs）常因图像编码器引入的大量视觉标记而存在严重的推理效率问题。虽然近期提出的标记压缩方法（如剪枝与融合）在减少冗余性方面展现出潜力，但其评估体系仍存在碎片化与不一致的问题。本研究提出UniPruneBench——一个面向多模态大模型视觉标记剪枝的统一可扩展基准框架。该框架在六大能力维度和十大数据集上建立了标准化评估协议，涵盖十种代表性压缩算法及三大LMM家族（LLaVA-v1.5、Intern-VL3和Qwen2.5-VL）。除任务精度外，基准还纳入运行时间和预填充延迟等系统级指标，提供全景化评估视角。实验揭示若干关键发现：（1）随机剪枝作为基线方法表现出意料之外的强健性；（2）尚无单一方法能在所有场景中持续领先；（3）不同任务对剪枝的敏感度差异显著，其中OCR任务最易受影响；（4）剪枝比率是性能衰减的主导因素。我们相信UniPruneBench将为高效多模态建模的未来研究提供可靠基础。

English

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

视觉输入能否被压缩？面向大型多模态模型的视觉标记压缩基准研究

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

摘要

Support