StreamingVLM：面向无限视频流的实时理解系统

摘要

视觉语言模型（VLMs）有望驱动实时助手和自主代理，但它们面临一个关键挑战：在无需增加延迟和内存消耗的情况下，理解近乎无限的视频流。对整个视频进行全注意力处理会导致计算成本呈二次方增长，并在长视频上表现不佳。同时，简单的滑动窗口方法也存在缺陷，它们要么破坏连贯性，要么因冗余的重复计算而遭受高延迟。本文中，我们提出了StreamingVLM，一个专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一的框架，将训练与流式推理对齐。在推理过程中，我们通过重用注意力汇聚点的状态、近期视觉标记的短窗口以及近期文本标记的长窗口，来维护一个紧凑的键值缓存。这种流式能力通过一种简单的监督微调（SFT）策略得以实现，该策略在短且重叠的视频片段上应用全注意力，有效模拟了推理时的注意力模式，而无需在过长的上下文上进行训练。为了评估，我们构建了Inf-Streams-Eval，一个新的基准测试，其视频平均超过两小时，要求帧与文本之间每秒的密集对齐。在Inf-Streams-Eval上，StreamingVLM以66.18%的胜率超越了GPT-4O mini，并在单个NVIDIA H100上保持了高达8 FPS的稳定实时性能。值得注意的是，我们的SFT策略还提升了通用视觉问答（VQA）能力，无需任何针对VQA的特定微调，在LongVideoBench上提升了+4.30，在OVOBench Realtime上提升了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

English

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

StreamingVLM：面向无限视频流的实时理解系统

StreamingVLM: Real-Time Understanding for Infinite Video Streams

摘要

Support