多粒度时空令牌融合：实现视频大语言模型的无训练加速

摘要

视频大语言模型（LLMs）通过利用大量时空标记实现了强大的视频理解能力，但其计算复杂度随标记数量呈二次方增长。为解决这一问题，我们提出了一种无需训练的时空标记合并方法，命名为STTM。我们的核心洞见在于挖掘视频数据中先前被忽视的局部空间与时间冗余。STTM首先通过四叉树结构的从粗到细搜索将每帧转换为多粒度空间标记，随后在时间维度上进行定向成对合并。这种分解式合并方法在六个视频问答基准测试中均优于现有的标记缩减技术。值得注意的是，在50%的标记预算下，STTM实现了2倍加速且仅带来0.5%的准确率下降；在30%的预算下，加速比达到3倍，准确率仅降低2%。此外，STTM与查询无关，允许对同一视频的不同问题重复使用KV缓存。项目页面详见https://www.jshyun.me/projects/sttm。

English

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

多粒度时空令牌融合：实现视频大语言模型的无训练加速

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

摘要

Support