다중 세분화 시공간 토큰 병합을 통한 비디오 LLM의 학습 없이 가속화

초록

비디오 대형 언어 모델(LLMs)은 다수의 시공간 토큰을 활용하여 강력한 비디오 이해 능력을 달성하지만, 토큰 수에 따른 이차 계산 복잡도 문제를 겪습니다. 이를 해결하기 위해, 우리는 훈련이 필요 없는 시공간 토큰 병합 방법인 STTM을 제안합니다. 우리의 핵심 통찰은 기존 연구에서 간과되었던 비디오 데이터의 지역적 공간 및 시간적 중복성을 활용하는 것입니다. STTM은 먼저 각 프레임을 쿼드트리 구조에 대한 거친-세밀 탐색을 통해 다중 세분화 공간 토큰으로 변환한 다음, 시간 차원에서 방향성 있는 쌍별 병합을 수행합니다. 이 분해된 병합 접근법은 6개의 비디오 질의응답 벤치마크에서 기존 토큰 축소 방법들을 능가합니다. 특히, STTM은 50% 토큰 예산 하에서 단 0.5%의 정확도 하락으로 2배의 속도 향상을 달성하며, 30% 예산 하에서도 2%의 하락으로 3배의 속도 향상을 보입니다. 또한, STTM은 질의에 독립적이어서 동일 비디오에 대해 다른 질문들 간에 KV 캐시 재사용이 가능합니다. 프로젝트 페이지는 https://www.jshyun.me/projects/sttm에서 확인할 수 있습니다.

English

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

다중 세분화 시공간 토큰 병합을 통한 비디오 LLM의 학습 없이 가속화

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

초록

Support