효율적인 비디오 대규모 언어 모델을 위한 지역 및 전역 컨텍스트 최적화 기반 토큰 감소

초록

비디오 대규모 언어 모델(VLLM)은 강력한 비디오 이해 능력을 보여주지만, 중복된 시각 토큰으로 인해 비효율성이 발생합니다. 기존의 프루닝 방법은 주로 프레임 내 공간적 중복성을 대상으로 하거나, LLM 내부의 얕은 계층 오버헤드를 제거하는 방식으로 최적의 시공간적 축소를 달성하지 못하고 장문맥 압축 가능성을 충분히 활용하지 못합니다. 또한 이러한 방법들은 병합되거나 제거된 토큰의 미세하지만 중요한 맥락을 종종 버리게 됩니다. 본 논문에서는 프레임 내 및 프레임 간 토큰 앵커를 정교하게 설정하여 지역-전역 최적 수송(Local-Global Optimal Transport, AOT)을 통해 정보성 있는 맥락을 포괄적으로 집계하는 새로운 관점을 제안합니다. 구체적으로, 우리는 먼저 어텐션 지도를 기반으로 각 프레임 내에서 지역 및 전역 인식 토큰 앵커를 설정한 후, 최적 수송을 통해 제거될 토큰들의 정보성 있는 맥락을 집계하여 프레임 내 토큰 앵커를 구성합니다. 그런 다음, 시간적 프레임 클립을 기반으로 각 클립의 첫 번째 프레임을 키프레임 앵커로 지정하여 최적 수송을 통해 연속된 프레임들의 유사 정보를 통합하되, 시간적 역동성을 나타내는 독특한 토큰들은 보존함으로써 학습 없이도 효율적인 토큰 축소를 달성합니다. 광범위한 평가를 통해, 우리가 제안한 AOT 방식이 주요 비디오 LLM들의 다양한 단기 및 장기 비디오 벤치마크에서 경쟁력 있는 성능을 보이며, 시간적 및 시각적 정확도를 유지하면서도 상당한 계산 효율성을 얻음을 확인했습니다. 프로젝트 웹페이지: https://tyroneli.github.io/AOT{AOT}.

English

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT{AOT}.

효율적인 비디오 대규모 언어 모델을 위한 지역 및 전역 컨텍스트 최적화 기반 토큰 감소

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

초록

Support