장시간 비디오 이해를 위한 선형 확장 비디오 VLM

초록

비디오 비전-언어 모델(VLM)은 장기 및 스트리밍 환경에서 점점 더 많이 사용되고 있지만, 대부분의 비디오 인코더는 여전히 시공간적 자기-어텐션에 의존하여 연산 및 지연 시간이 프레임 수에 따라 제곱으로 증가합니다. 기존의 효율성 방법은 확장성을 개선하지만, 예를 들어 과도한 프레임/토큰 드롭핑 또는 거친 어텐션 근사화를 통해 완전한 자기-어텐션에 비해 정확도를 잃는 경우가 많습니다. 본 논문에서는 사전 훈련된 장기 비디오 VLM을 선형 시간 비디오 프리필로 변환하는 추론 시 방법인 StateKV를 소개합니다. 이 방법은 고정 용량의 중요도 기반 순환 상태에서 교차 프레임 컨텍스트를 전달하며, 디코딩에 사용되는 두 번째 전체 프레임별 캐시와 짝을 이룹니다. 세 가지 장기 비디오 벤치마크와 세 가지 패밀리 및 여러 규모에 걸친 일곱 가지 모델에서 StateKV는 완전한 자기-어텐션에 근접한 성능을 유지하며, 미세 조정이나 아키텍처 변경 없이 주류 슬라이딩 윈도우/최신성 기반 스트리밍 근사화를 일관되게 능가합니다. StateKV는 또한 FLOPs로 측정된 비디오 프리필 비용을 줄여, 더 큰 모델을 실행함으로써 고정된 연산 예산 내에서 더 높은 정확도를 가능하게 합니다. 이러한 결과는 확장 가능한 장기 비디오 이해를 위한 실용적인 단계를 시사합니다.

English

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.