基于局部与全局上下文优化的令牌削减技术助力高效视频大语言模型

摘要

视频大语言模型（VLLMs）虽展现出强大的视频理解能力，却因冗余视觉标记存在效率低下的问题。现有剪枝方法主要针对帧内空间冗余或在LLM浅层进行内部剪枝，导致时空维度缩减欠佳，且未能充分利用长上下文压缩潜力。这些方法往往在合并或剪枝过程中丢弃了细微但富含信息的上下文。本文提出新视角，通过局部-全局最优传输（AOT）在帧内与帧间精细构建标记锚点，以全面聚合信息上下文。具体而言，我们首先在注意力机制引导下建立每帧内局部与全局感知的标记锚点，再通过最优传输从被剪枝标记中聚合信息上下文，构建帧内标记锚点。接着基于时序视频片段，将每个片段首帧设为关键帧锚点，通过最优传输聚合连续帧中的相似信息，同时保留差异化标记以表征时序动态，从而实现无需训练的高效标记缩减。大量实验表明，我们提出的AOT方法在主流视频LLMs的各类长短视频基准测试中均取得优异性能，在保持时序与视觉保真度的同时显著提升计算效率。项目页面：https://tyroneli.github.io/AOT{AOT}。

English

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT{AOT}.

基于局部与全局上下文优化的令牌削减技术助力高效视频大语言模型

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

摘要

Support