VisionTrim:面向免训练多模态大模型加速的统一视觉令牌压缩技术
VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
January 30, 2026
作者: Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu
cs.AI
摘要
多模态大语言模型(MLLMs)在处理高分辨率图像和视频场景时,常因视觉标记数量过多而面临高昂的计算成本。现有的标记缩减方法通常聚焦于孤立流程组件,且往往忽视文本对齐,导致性能下降。本文提出VisionTrim——一种无需训练的统一加速框架,集成两大即插即用模块:1)主导视觉标记选择(DVTS)模块,通过全局-局部视角保留关键视觉标记;2)文本引导视觉补全(TGVC)模块,基于文本线索实现上下文感知的标记融合。在多样化的图像与视频多模态基准测试中,大量实验证明了VisionTrim的性能优势,为实际应用中的MLLM部署提供了有效解决方案。代码已开源:https://github.com/hanxunyu/VisionTrim。
English
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.