多粒度时空令牌融合:实现视频大语言模型的无训练加速
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
July 10, 2025
作者: Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
cs.AI
摘要
视频大语言模型(LLMs)通过利用大量时空标记实现了强大的视频理解能力,但其计算复杂度随标记数量呈二次方增长。为解决这一问题,我们提出了一种无需训练的时空标记合并方法,命名为STTM。我们的核心洞见在于挖掘视频数据中先前被忽视的局部空间与时间冗余。STTM首先通过四叉树结构的从粗到细搜索将每帧转换为多粒度空间标记,随后在时间维度上进行定向成对合并。这种分解式合并方法在六个视频问答基准测试中均优于现有的标记缩减技术。值得注意的是,在50%的标记预算下,STTM实现了2倍加速且仅带来0.5%的准确率下降;在30%的预算下,加速比达到3倍,准确率仅降低2%。此外,STTM与查询无关,允许对同一视频的不同问题重复使用KV缓存。项目页面详见https://www.jshyun.me/projects/sttm。
English
Video large language models (LLMs) achieve strong video understanding by
leveraging a large number of spatio-temporal tokens, but suffer from quadratic
computational scaling with token count. To address this, we propose a
training-free spatio-temporal token merging method, named STTM. Our key insight
is to exploit local spatial and temporal redundancy in video data which has
been overlooked in prior work. STTM first transforms each frame into
multi-granular spatial tokens using a coarse-to-fine search over a quadtree
structure, then performs directed pairwise merging across the temporal
dimension. This decomposed merging approach outperforms existing token
reduction methods across six video QA benchmarks. Notably, STTM achieves a
2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and
a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is
query-agnostic, allowing KV cache reuse across different questions for the same
video. The project page is available at https://www.jshyun.me/projects/sttm.