Unified Spatiotemporele Tokenbeoordeling voor Efficiënte Video-VLM's

Samenvatting

Token pruning is essentieel voor het verbeteren van de computationele efficiëntie van vision-language modellen (VLMs), met name voor videogebaseerde taken waarbij temporele redundantie veel voorkomt. Eerdere benaderingen snoeien tokens doorgaans (1) uitsluitend binnen de vision transformer (ViT) voor unimodale perceptietaken zoals actieherkenning en objectsegmentatie, zonder aanpassing aan downstream vision-language taken; of (2) alleen binnen het LLM terwijl de ViT-output intact blijft, wat vaak complexe, op tekst geconditioneerde tokenselectiemechanismen vereist. In dit artikel introduceren we Spatio-Temporele Token Scoring (STTS), een eenvoudige en lichtgewicht module die vision tokens snoeit in zowel de ViT als het LLM zonder tekstconditionering of tokensamenvoeging, en volledig compatibel is met end-to-end training. Door te leren hoe temporeel moet worden gescoord via een auxiliary loss en ruimtelijk via LLM downstream gradients, ondersteund door ons efficiënte packing-algoritme, snoeit STTS 50% van de vision tokens in de gehele architectuur, wat resulteert in een efficiëntieverbetering van 62% tijdens zowel training als inference met slechts een prestatieverlies van 0,7% gemiddeld over 13 korte en lange video QA-taken. De efficiëntiewinst neemt toe bij meer bemonsterde frames per video. Toepassing van test-time scaling voor lange-video QA levert verder prestatieverbeteringen op van 0,5-1% vergeleken met de baseline. Over het geheel genomen vertegenwoordigt STTS een nieuwe, eenvoudige maar effectieve techniek voor uniforme, architectuurbrede vision token pruning.

English

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

Unified Spatiotemporele Tokenbeoordeling voor Efficiënte Video-VLM's

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Samenvatting

Support