LLaVA-Scissor: Tokencompressie met semantisch verbonden componenten voor Video LLM's

Samenvatting

In dit artikel presenteren we LLaVA-Scissor, een trainingsvrije tokencompressiestrategie die is ontworpen voor multimodale grote taalmodellen voor video. Eerdere methoden proberen meestal tokens te comprimeren op basis van aandachtsscores, maar slagen er niet in om alle semantische regios effectief vast te leggen en leiden vaak tot tokenredundantie. In plaats daarvan stellen we voor om de Semantic Connected Components (SCC)-benadering te benutten, die tokens toewijst aan verschillende semantische regios binnen de tokenset, waardoor een uitgebreide semantische dekking wordt gegarandeerd. Het resultaat is een tweestaps spatio-temporele tokencompressiestrategie die SCC gebruikt in zowel ruimtelijke als temporele domeinen. Deze strategie kan tokens effectief comprimeren door de gehele video weer te geven met een set niet-overlappende semantische tokens. We voeren uitgebreide evaluaties uit van de tokencompressiecapaciteiten van LLaVA-Scissor op diverse video-begripsbenchmarks, waaronder video-vraagbeantwoording, lang video-begrip en uitgebreide meerkeuzebenchmarks. Experimentele resultaten tonen aan dat de voorgestelde LLaVA-Scissor andere tokencompressiemethoden overtreft en superieure prestaties behaalt in verschillende video-begripsbenchmarks, vooral bij lage tokenretentieverhoudingen. Projectpagina: https://github.com/HumanMLLM/LLaVA-Scissor.

English

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor: Tokencompressie met semantisch verbonden componenten voor Video LLM's

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Samenvatting

Support