脚本:面向多模态大语言模型的图结构与查询条件化语义令牌剪枝
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
December 1, 2025
作者: Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan
cs.AI
摘要
多模态大语言模型(MLLMs)中视觉令牌的快速增长导致内存消耗和推理延迟激增,尤其在处理高分辨率图像和视频时更为显著。令牌剪枝技术通过消除冗余来缓解该问题,但现有方法往往忽略与用户查询的相关性,或受限于注意力机制,降低了适应性与有效性。为此,我们提出Script——一种即插即用、无需重新训练且能泛化至多种MLLMs的剪枝方法。该方法包含两个模块:图结构剪枝模块用于消除视觉冗余令牌,查询条件语义剪枝模块则保留与查询相关的视觉信息。二者协同工作可提升多模态任务性能。在图像与视频理解任务的14个基准测试中,Script相比现有剪枝方法持续实现更高的模型效率与预测准确率。在LLaVA-NeXT-7B模型上,该方法实现了最高6.8倍的前向计算加速和10倍浮点运算量减少,同时保持原模型96.88%的性能表现。
English
The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.