EarlyTom: 초기 토큰 압축으로 빠른 비디오 이해 완성

초록

비디오 거대 언어 모델(Video-LLMs)은 비디오 이해 작업에서 강력한 성능을 입증해 왔다. 그러나 방대한 양의 시각적 토큰을 처리하는 데 따른 비효율성으로 인해 실제 배포는 여전히 제약을 받고 있다. 최근 접근 방식은 전체 토큰 기준선과 유사한 정확도를 유지하면서 극히 낮은 토큰 유지 비율을 달성하지만, 대부분은 프리필링의 후반 단계에서만 압축을 수행하여 비전 인코더의 효율성은 최적화되지 않은 상태로 남겨둔다. 본 논문에서는 먼저 비전 인코딩이 최초 토큰 생성 시간(TTFT)의 상당 부분을 차지함을 보인다. 따라서 비전 인코더 이후에만 시각적 토큰을 압축하는 대신, 인코더 내부에서 압축을 수행하는 것은 여전히 탐구의 여지가 많다. 이러한 통찰을 바탕으로, 우리는 비전 인코더 내부에서 초기 단계의 시각적 토큰 압축을 수행하는 학습 없는 토큰 압축 프레임워크인 EarlyTom을 제안한다. 이를 통해 TTFT 감소와 처리량 향상을 훨씬 더 효과적으로 달성할 수 있다. 또한, 전반적인 압축 효율성을 개선하는 분리된 공간 토큰 선택 전략을 도입한다. EarlyTom은 단일 NVIDIA A100 GPU에서 LLaVA-OneVision-7B 모델의 TTFT를 최대 2.65배, FLOPs를 최대 61%까지 줄이면서 전체 토큰 기준선과 유사한 정확도를 유지한다. 이러한 개선 사항은 실제 생산 환경에서 Video-LLMs를 배포하는 실용성을 크게 향상시킨다.

English

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.