LLaVA-Scissor：基於語義連通組件的令牌壓縮技術應用於視頻大語言模型

摘要

本文提出了LLaVA-Scissor，一種針對視頻多模態大語言模型的無訓練令牌壓縮策略。以往的方法主要基於注意力分數來壓縮令牌，但未能有效捕捉所有語義區域，且常導致令牌冗餘。與此不同，我們提出利用語義連通組件（Semantic Connected Components, SCC）方法，將令牌分配至令牌集中的不同語義區域，確保全面的語義覆蓋。其結果是一種兩步的時空令牌壓縮策略，該策略在空間和時間域中均利用SCC。此策略能通過用一組不重疊的語義令牌來表示整個視頻，從而有效壓縮令牌。我們在包括視頻問答、長視頻理解及綜合多選題基準在內的多樣化視頻理解基準上，對LLaVA-Scissor的令牌壓縮能力進行了廣泛評估。實驗結果顯示，所提出的LLaVA-Scissor在多種視頻理解基準上優於其他令牌壓縮方法，特別是在低令牌保留率下表現尤為突出。項目頁面：https://github.com/HumanMLLM/LLaVA-Scissor。

English

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor：基於語義連通組件的令牌壓縮技術應用於視頻大語言模型

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

摘要

Support