HoliTom：面向快速视频大语言模型的全方位令牌融合技术

摘要

视频大语言模型（video LLMs）在视频理解方面表现出色，但由于冗余的视频标记，面临显著的计算效率低下问题。现有的标记剪枝方法提供了解决方案。然而，在大语言模型内部操作的剪枝方法（如FastV）在浅层会带来固有的计算开销。相比之下，在大语言模型之前进行标记剪枝的方法（外部LLM剪枝）主要处理单个帧内或有限时间窗口内的空间冗余，忽视了跨较长视频序列的关键全局时间动态和相关性。这导致了次优的时空缩减，未能充分利用视频的可压缩性。至关重要的是，结合这些策略的协同潜力和相互影响仍未被探索。为了进一步减少冗余，我们引入了HoliTom，一种新颖的无训练整体标记合并框架。HoliTom通过全局冗余感知的时间分割进行外部LLM剪枝，随后进行时空合并，将视觉标记减少超过90%，显著减轻了LLM的计算负担。作为补充，我们引入了一种基于内部LLM标记相似性的稳健合并方法，旨在实现卓越性能并与外部LLM剪枝兼容。评估结果表明，我们的方法在LLaVA-OneVision-7B上实现了有前景的效率-性能权衡，将计算成本降至FLOPs的6.9%，同时保持了99.1%的原始性能。此外，我们实现了首次标记时间（TTFT）2.28倍的减少和解码吞吐量1.32倍的加速，凸显了我们集成剪枝方法在高效视频LLM推理中的实际优势。

English

Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

HoliTom：面向快速视频大语言模型的全方位令牌融合技术

HoliTom: Holistic Token Merging for Fast Video Large Language Models

摘要

Support