ストリーミング動画理解のためのシンプルなベースライン

要旨

近年のストリーミング映像理解手法では、長い映像ストリームを扱うために複雑なメモリ機構への依存が強まっている。我々はこの傾向に対し、単純な発見をもって異議を唱える：最新のNフレームのみを既存のVLMに入力するスライディングウィンドウベースラインが、既存のストリーミングモデルに匹敵あるいは凌駕する性能を示す。我々はこのベースラインをSimpleStreamとして定式化し、OVO-BenchとStreamingBenchにおいて13の主要なオフライン/オンライン映像LLMベースラインと比較評価した。その単純さにも関わらず、SimpleStreamは一貫して強力な性能を発揮する。わずか4フレームの最近傍データで、OVO-Benchでは平均67.7%、StreamingBenchでは80.59%の精度を達成した。制御されたアブレーション実験では、長いコンテキストの価値がモデル規模に比例して一律に増加するのではなく、バックボーンに依存することが示され、一貫した知覚-記憶トレードオフが明らかになった：より多くの履歴コンテキストを追加すると記憶想起は向上するが、リアルタイム知覚は往々にして弱体化する。これは、より強力なメモリ・検索・圧縮モジュールが、同一プロトコル下でSimpleStreamを明確に上回らない限り、進歩の証拠と見なすべきではないことを示唆する。したがって今後のストリーミングベンチマークでは、最近傍シーン知覚と長距離記憶を分離し、複雑性の追加による性能向上をより明確に評価できるようにすべきであると提言する。

English

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

ストリーミング動画理解のためのシンプルなベースライン

A Simple Baseline for Streaming Video Understanding

要旨

Support