多粒度時空令牌合併：實現視頻大語言模型的無訓練加速

摘要

視頻大型語言模型（LLMs）通過利用大量時空標記實現了強大的視頻理解能力，但卻面臨著計算量隨標記數量呈二次方增長的問題。為解決這一問題，我們提出了一種無需訓練的時空標記合併方法，名為STTM。我們的核心洞察是挖掘視頻數據中未被先前工作重視的局部空間和時間冗餘。STTM首先通過在四叉樹結構上進行從粗到細的搜索，將每一幀轉換為多粒度空間標記，然後在時間維度上進行定向成對合併。這種分解式的合併方法在六個視頻問答基準測試中均優於現有的標記縮減技術。值得注意的是，STTM在50%的標記預算下實現了2倍的加速，且準確率僅下降0.5%；在30%的預算下，加速比達到3倍，準確率下降僅為2%。此外，STTM與查詢無關，允許對同一視頻的不同問題重複使用KV緩存。項目頁面詳見https://www.jshyun.me/projects/sttm。

English

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

多粒度時空令牌合併：實現視頻大語言模型的無訓練加速

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

摘要

Support