LLaVA-Scissor: 비디오 LLM을 위한 의미론적 연결 요소 기반 토큰 압축

초록

본 논문에서는 비디오 다중모달 대형 언어 모델을 위한 학습 없이 적용 가능한 토큰 압축 전략인 LLaVA-Scissor를 제안한다. 기존 방법들은 주로 어텐션 점수를 기반으로 토큰을 압축하려 시도했으나, 모든 의미 영역을 효과적으로 포착하지 못하고 토큰 중복을 초래하는 경우가 많았다. 이와 달리, 본 연구에서는 토큰 집합 내에서 서로 다른 의미 영역에 토큰을 할당하는 의미 연결 컴포넌트(Semantic Connected Components, SCC) 접근법을 활용하여 포괄적인 의미 커버리지를 보장한다. 이를 통해 공간적 및 시간적 영역 모두에서 SCC를 활용하는 2단계 시공간 토큰 압축 전략을 제안한다. 이 전략은 비디오 전체를 중복되지 않는 의미 토큰 집합으로 표현함으로써 토큰을 효과적으로 압축할 수 있다. LLaVA-Scissor의 토큰 압축 능력을 비디오 질의응답, 장기 비디오 이해, 종합적인 다중 선택 벤치마크 등 다양한 비디오 이해 벤치마크에서 광범위하게 평가하였다. 실험 결과, 제안된 LLaVA-Scissor는 특히 낮은 토큰 유지 비율에서 다른 토큰 압축 방법들을 능가하며 다양한 비디오 이해 벤치마크에서 우수한 성능을 달성함을 보여준다. 프로젝트 페이지: https://github.com/HumanMLLM/LLaVA-Scissor.

English

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

LLaVA-Scissor: 비디오 LLM을 위한 의미론적 연결 요소 기반 토큰 압축

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

초록

Support