LLaVA-Scissor: ビデオLLMのための意味的連結コンポーネントを用いたトークン圧縮

要旨

本論文では、ビデオマルチモーダル大規模言語モデル向けに設計された、トレーニング不要なトークン圧縮戦略「LLaVA-Scissor」を提案する。従来の手法は主にアテンションスコアに基づいてトークンを圧縮しようとするが、全ての意味領域を効果的に捉えることができず、トークンの冗長性を引き起こすことが多い。これに対して我々は、Semantic Connected Components（SCC）アプローチを活用し、トークンセット内の異なる意味領域にトークンを割り当てることで、包括的な意味的カバレッジを確保することを提案する。その結果、空間的および時間的領域の両方でSCCを利用する二段階の時空間トークン圧縮戦略が得られる。この戦略により、ビデオ全体を重複しない意味トークンの集合で表現することで、効果的にトークンを圧縮することが可能となる。我々は、ビデオ質問応答、長尺ビデオ理解、包括的な多肢選択ベンチマークなど、多様なビデオ理解ベンチマークにおいてLLaVA-Scissorのトークン圧縮能力を広範に評価した。実験結果は、提案したLLaVA-Scissorが他のトークン圧縮手法を上回り、特に低いトークン保持率において、様々なビデオ理解ベンチマークで優れた性能を達成することを示している。プロジェクトページ: https://github.com/HumanMLLM/LLaVA-Scissor。

English

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor: ビデオLLMのための意味的連結コンポーネントを用いたトークン圧縮

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

要旨

Support