マルチグラニュラリティ時空間トークンマージによるトレーニング不要のビデオLLM高速化

要旨

ビデオ大規模言語モデル（LLMs）は、多数の時空間トークンを活用することで強力なビデオ理解を実現しますが、トークン数に応じて計算量が二次的に増加するという課題を抱えています。この問題に対処するため、我々は訓練不要の時空間トークン統合手法、STTM（Spatio-Temporal Token Merging）を提案します。我々の重要な洞察は、これまでの研究で見過ごされてきたビデオデータ内の局所的な空間的および時間的な冗長性を活用することです。STTMはまず、各フレームを四分木構造を用いた粗から細への探索によって多粒度の空間トークンに変換し、その後、時間次元にわたって指向性のあるペアワイズ統合を行います。この分解された統合アプローチは、6つのビデオQAベンチマークにおいて既存のトークン削減手法を上回ります。特に、STTMは50%のトークン予算下でわずか0.5%の精度低下で2倍の高速化を実現し、30%の予算下では2%の低下で3倍の高速化を達成します。さらに、STTMはクエリに依存しないため、同じビデオに対する異なる質問間でKVキャッシュの再利用が可能です。プロジェクトページはhttps://www.jshyun.me/projects/sttmで公開されています。

English

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

マルチグラニュラリティ時空間トークンマージによるトレーニング不要のビデオLLM高速化

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

要旨

Support