串流影片理解的簡易基準方法

摘要

近期串流影片理解方法日益依賴複雜的記憶機制來處理長影片串流。我們透過一項簡單發現挑戰此趨勢：僅將最近N幀畫面輸入現成視覺語言模型（VLM）的滑動視窗基線方法，其表現已能媲美甚至超越已發表的串流模型。我們將此基線方法規範化為SimpleStream，並在OVO-Bench與StreamingBench基準測試中與13個主流離線及線上影片LLM基線模型進行比較。儘管設計簡潔，SimpleStream始終展現出強勁性能：僅使用最近4幀畫面時，其在OVO-Bench達到67.7%平均準確率，在StreamingBench更達到80.59%。受控對比實驗進一步表明，長上下文價值的體現取決於骨幹網路架構，而非隨模型規模單調遞增，同時揭示出穩定的感知-記憶權衡關係：增加歷史上下文雖能提升回憶能力，但往往會削弱即時感知表現。這意味著，除非在相同測試條件下明確超越SimpleStream，否則更強的記憶、檢索或壓縮模組不應被視為技術進步的證據。據此我們主張，未來串流基準測試應將近期場景感知與長程記憶能力分離評估，從而更清晰地衡量複雜度提升所帶來的性能增益。

English

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

串流影片理解的簡易基準方法

A Simple Baseline for Streaming Video Understanding

摘要

Support