効率的なビデオVLMのための統一的時空間トークンスコアリング

要旨

トークン剪定は、視覚言語モデル（VLM）の計算効率向上に不可欠であり、時間的な冗長性が顕著なビデオタスクにおいて特に重要である。従来のアプローチでは、トークンの剪定は通常、(1) 行動認識や物体セグメンテーションなどの単一モダリティ知覚タスクに特化して視覚Transformer（ViT）内でのみ行われるか、下流の視覚言語タスクに適応していない、あるいは(2) ViTの出力をそのままにLLM内でのみ行われ、複雑なテキスト条件付きトークン選択機構を必要とすることが多い。本論文では、時空間トークンスコアリング（STTS）を提案する。これは、テキスト条件付けやトークン統合を行わずにViTとLLMの両方にわたって視覚トークンを剪定する、シンプルで軽量なモジュールであり、エンドツーエンド学習に完全に対応している。補助損失による時間的なスコアリングと、LLMの下流勾配による空間的なスコアリングを学習し、効率的なパッキングアルゴリズムによって支援されるSTTSは、アーキテクチャ全体で視覚トークンの50%を剪定し、13の短編・長編ビデオQAタスクにおける平均性能の低下がわずか0.7%であるにもかかわらず、学習と推論の両方で効率を62%向上させる。効率向上の度合いは、ビデオあたりのサンプリングフレーム数が増えるほど大きくなる。長編ビデオQAに対して推論時スケーリングを適用すると、ベースラインと比較して0.5-1%の性能向上がさらに得られる。全体として、STTSは、アーキテクチャ全体にわたる統一的な視覚トークン剪定のための新規かつシンプルでありながら効果的な技術を代表するものである。

English

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

効率的なビデオVLMのための統一的時空間トークンスコアリング

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

要旨

Support