统一时空令牌评分:高效视频视觉语言模型新方法
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
March 18, 2026
作者: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
cs.AI
摘要
令牌剪枝對於提升視覺語言模型(VLMs)的計算效率至關重要,尤其在時間冗餘現象普遍存在的視頻任務中。現有方法通常僅在以下兩種場景進行剪枝:(1) 僅在視覺Transformer(ViT)內部針對動作識別、物體分割等單模態感知任務,未適配下游視覺語言任務;(2) 僅在大型語言模型(LLM)內操作而保持ViT輸出完整,往往需要複雜的文本條件化令牌選擇機制。本文提出時空令牌評分(STTS),這是一種簡單輕量的模塊,無需文本條件化或令牌合併即可在ViT和LLM中同步剪裁視覺令牌,並完全兼容端到端訓練。通過輔助損失函數學習時序評分,並借助LLM下游梯度實現空間評分,結合我們設計的高效封裝算法,STTS能在整個架構中剪裁50%的視覺令牌,在13項長短視頻問答任務中平均性能僅下降0.7%的同時,使訓練和推理效率提升62%。當視頻採樣幀數增加時,效率增益更為顯著。針對長視頻問答任務應用測試時縮放策略,可進一步獲得相比基準模型0.5-1%的性能提升。總體而言,STTS開創了一種新穎、簡潔而有效的統一化全架構視覺令牌剪枝技術。
English
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.