StreamingVLM: 無限のビデオストリームに対するリアルタイム理解

要旨

ビジョン・ランゲージモデル（VLM）は、リアルタイムアシスタントや自律エージェントを駆動する可能性を秘めているが、重要な課題に直面している：遅延やメモリ使用量を増大させることなく、ほぼ無限のビデオストリームを理解することである。ビデオ全体に完全な注意を向けて処理すると、計算コストが二次的に増大し、長いビデオでは性能が低下する。一方、単純なスライディングウィンドウ手法も欠点があり、一貫性が損なわれるか、冗長な再計算により高遅延が発生する。本論文では、無限の視覚入力をリアルタイムで安定して理解するために設計されたモデル、StreamingVLMを紹介する。我々のアプローチは、トレーニングとストリーミング推論を整合させる統一フレームワークである。推論時には、注意シンクの状態、最近の視覚トークンの短いウィンドウ、および最近のテキストトークンの長いウィンドウを再利用することで、コンパクトなKVキャッシュを維持する。このストリーミング能力は、短い重複したビデオチャンクに完全な注意を適用する単純な教師ありファインチューニング（SFT）戦略を通じて注入され、過度に長いコンテキストでのトレーニングなしに、推論時の注意パターンを効果的に模倣する。評価のために、平均2時間以上のビデオを含む新しいベンチマーク、Inf-Streams-Evalを構築し、フレームとテキストの間の秒単位の密な整合を要求する。Inf-Streams-Evalにおいて、StreamingVLMはGPT-4O miniに対して66.18%の勝率を達成し、単一のNVIDIA H100で最大8 FPSの安定したリアルタイム性能を維持する。特に、我々のSFT戦略は、VQA固有のファインチューニングなしに一般的なVQA能力も向上させ、LongVideoBenchで+4.30、OVOBench Realtimeで+5.96の性能向上をもたらす。コードはhttps://github.com/mit-han-lab/streaming-vlmで公開されている。

English

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

StreamingVLM: 無限のビデオストリームに対するリアルタイム理解

StreamingVLM: Real-Time Understanding for Infinite Video Streams

要旨

Support