LLaVA-Scissor：基于语义连通组件的视频大语言模型令牌压缩技术

摘要

本文提出了LLaVA-Scissor，一种专为视频多模态大语言模型设计的无需训练的令牌压缩策略。以往的方法主要基于注意力分数进行令牌压缩，但未能有效捕捉所有语义区域，常导致令牌冗余。与之不同，我们采用语义连通组件（SCC）方法，将令牌分配到令牌集中的不同语义区域，确保全面的语义覆盖。由此产生了一种两步时空令牌压缩策略，该策略在空间和时间域均利用SCC。此策略通过用一组非重叠的语义令牌表示整个视频，能有效压缩令牌。我们在多种视频理解基准上对LLaVA-Scissor的令牌压缩能力进行了广泛评估，包括视频问答、长视频理解及综合多选基准。实验结果表明，所提出的LLaVA-Scissor在多种视频理解基准上优于其他令牌压缩方法，尤其在低令牌保留率下表现卓越。项目页面：https://github.com/HumanMLLM/LLaVA-Scissor。

English

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor：基于语义连通组件的视频大语言模型令牌压缩技术

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

摘要

Support