统一时空令牌评分：高效视频视觉语言模型新方法

摘要

令牌剪枝对于提升视觉语言模型（VLMs）的计算效率至关重要，尤其在存在时间冗余的视频任务中。现有方法通常仅在视觉变换器（ViT）内针对单模态感知任务（如行为识别和物体分割）进行令牌剪枝，未适配下游视觉语言任务；或仅在大型语言模型（LLM）内剪枝而保持ViT输出完整，往往需要复杂的文本条件令牌选择机制。本文提出时空令牌评分（STTS），该轻量级模块无需文本条件或令牌融合即可在ViT和LLM中同步剪枝视觉令牌，并完全兼容端到端训练。通过辅助损失实现时间维度评分、借助LLM下游梯度实现空间维度评分，结合高效打包算法，STTS能在整个架构中剪除50%的视觉令牌，在13项长短视频问答任务上仅平均性能下降0.7%的同时，训练与推理效率提升62%。随着视频采样帧数增加，效率增益更为显著。针对长视频问答任务的应用时缩放策略还可较基线额外获得0.5-1%的性能提升。总体而言，STTS为架构级统一视觉令牌剪枝提供了一种新颖、简洁而有效的技术路径。

English

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.