流式视频理解的简单基准

摘要

近期流式视频理解方法日益依赖复杂记忆机制来处理长视频流。我们通过一个简单发现对这一趋势提出挑战：仅向现成视觉语言模型输入最近N帧的滑动窗口基线方法，其表现已能媲美甚至超越已发布的流式模型。我们将此基线形式化为SimpleStream，并在OVO-Bench和StreamingBench基准测试中与13个主流离线和在线视频大语言模型基线进行对比。尽管结构简单，SimpleStream始终展现出强劲性能：仅使用最近4帧即可在OVO-Bench达到67.7%平均准确率，在StreamingBench达到80.59%。受控消融实验进一步表明，长上下文的价值取决于骨干网络而非随模型规模均匀增长，并揭示出稳定的感知-记忆权衡规律：增加历史上下文能提升回忆能力，但往往会削弱实时感知性能。这意味着，除非在相同测试协议下明确超越SimpleStream，否则更强的记忆、检索或压缩模块不应被视为技术进步的证明。因此我们主张，未来流式基准测试应将近期场景感知与长程记忆能力分开评估，以便更清晰地衡量由复杂度提升带来的性能改进。

English

Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.