ChatPaper.aiChatPaper

脚本:面向多模态大語言模型的圖結構化與查詢條件語義標記剪枝

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

December 1, 2025
作者: Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan
cs.AI

摘要

多模态大语言模型(MLLMs)中视觉标记的快速增长导致内存消耗和推理延迟激增,尤其在处理高分辨率图像和视频时更为显著。标记剪枝技术通过消除冗余来缓解该问题,但现有方法往往忽略与用户查询的相关性,或受限于注意力机制,导致适应性和有效性降低。为此,我们提出Script——一种即插即用的剪枝方法,无需重新训练即可泛化至多种MLLMs。该方法包含两个模块:图结构剪枝模块用于消除视觉冗余标记,查询条件语义剪枝模块则保留与查询相关的视觉信息。二者协同提升多模态任务性能。在涵盖图像与视频理解任务的14个基准测试中,Script相比现有剪枝方法持续实现更高的模型效率与预测准确率。在LLaVA-NeXT-7B模型上,该方法实现了最高6.8倍的前向加速和10倍浮点运算量削减,同时保持原模型96.88%的性能表现。
English
The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.
PDF61December 3, 2025