線形スケーリングビデオVLMによる長編動画理解

要旨

ビデオ視覚言語モデル（VLM）は、長期的な時間範囲やストリーミング設定での使用が増加しているが、ほとんどのビデオエンコーダは依然として時空間自己注意に依存しており、フレーム数に応じて計算量とレイテンシが二次的に増加する。既存の効率化手法はスケーラビリティを向上させるものの、例えば過度なフレーム・トークンの削除や粗い注意近似により、完全自己注意と比較して精度が低下することが多い。本稿では、StateKVを提案する。これは、固定容量で重要度に基づくリカレント状態を介してフレーム間コンテキストを伝達し、それとデコード用の完全なフレーム単位キャッシュを併用することで、学習済みの長尺ビデオVLMを線形時間のビデオプリフィルに適応させる推論時手法である。3つの長尺ビデオベンチマークと、3つのファミリー・複数スケールにわたる7つのモデルにおいて、StateKVは完全自己注意に近い性能を維持し、ファインチューニングやアーキテクチャ変更なしに、支配的なスライディングウィンドウ／最近接ベースのストリーミング近似を一貫して上回る。また、StateKVはビデオプリフィルコスト（FLOPs）を削減し、固定計算予算内でより大規模なモデルを実行することで、より高い精度を実現する。これらの結果は、スケーラブルな長尺ビデオ理解への実用的な一歩を示唆する。

English

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.