Tokenreductie via optimalisatie van lokale en globale contexten voor efficiënte videogrote-taalmodelen

Samenvatting

Video Large Language Models (VLLM's) vertonen een sterk begrip van video, maar lijden onder inefficiëntie door redundante visuele tokens. Bestaande pruningmethoden richten zich voornamelijk op intra-frame ruimtelijke redundantie of werken binnen het LLM met overhead in ondiepe lagen, wat resulteert in een suboptimale spatiotemporele reductie en een onderbenutting van de compressiemogelijkheden van lange contexten. Bovendien worden subtiele maar informatieve contexten vaak verworpen bij het samenvoegen of verwijderen van tokens. In dit artikel introduceren we een nieuw perspectief dat token-ankers binnen en tussen frames uitwerkt om informatieve contexten uitgebreid te aggregeren via lokaal-globaal Optimal Transport (AOT). Concreet stellen we eerst lokaal en globaal bewuste token-ankers vast binnen elk frame onder leiding van attention, die vervolgens via optimal transport informatieve contexten van verwijderde tokens aggregeren, waardoor intra-frame token-ankers worden geconstrueerd. Vervolgens worden, gebaseerd op temporele frameclips, de eerste frames binnen elke clip beschouwd als keyframe-ankers om vergelijkbare informatie uit opeenvolgende frames samen te voegen via optimal transport, terwijl onderscheidende tokens behouden blijven om temporele dynamiek te representeren. Dit leidt tot efficiënte tokenreductie zonder training. Uitgebreide evaluaties tonen aan dat onze voorgestelde AOT competitieve prestaties behaalt op diverse kort- en langvideo-benchmarks voor toonaangevende video-LLM's, met aanzienlijke rekeneficiëntie en behoud van temporele en visuele nauwkeurigheid. Projectwebpagina: https://tyroneli.github.io/AOT{AOT}.

English

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT{AOT}.

Tokenreductie via optimalisatie van lokale en globale contexten voor efficiënte videogrote-taalmodelen

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Samenvatting

Support