StreamingVLM:無限視頻流的實時理解
StreamingVLM: Real-Time Understanding for Infinite Video Streams
October 10, 2025
作者: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
cs.AI
摘要
視覺語言模型(VLMs)有望驅動實時助手和自主代理,但面臨一個關鍵挑戰:在處理近乎無限的視頻流時,如何在不增加延遲和內存使用的情況下實現理解。對整個視頻進行全注意力處理會導致計算成本呈二次方增長,且在長視頻上表現不佳。同時,簡單的滑動窗口方法也存在缺陷,它們要么破壞連貫性,要么因冗餘重計算而導致高延遲。本文介紹了StreamingVLM,這是一個專為實時、穩定理解無限視覺輸入而設計的模型。我們的方法是一個統一的框架,將訓練與流式推理對齊。在推理過程中,我們通過重用注意力匯聚點(attention sinks)的狀態、近期視覺標記的短窗口以及近期文本標記的長窗口,來維護一個緊湊的鍵值緩存(KV cache)。這種流式能力通過一種簡單的監督微調(SFT)策略來培養,該策略在短且重疊的視頻片段上應用全注意力,有效模擬了推理時的注意力模式,而無需在過長的上下文上進行訓練。為評估模型,我們構建了Inf-Streams-Eval,這是一個平均視頻時長超過兩小時的新基準,要求幀與文本之間進行密集的每秒對齊。在Inf-Streams-Eval上,StreamingVLM以66.18%的勝率擊敗了GPT-4O mini,並在單個NVIDIA H100上保持穩定的實時性能,最高可達8 FPS。值得注意的是,我們的SFT策略還增強了通用視覺問答(VQA)能力,無需任何VQA專用微調,在LongVideoBench上提升了+4.30,在OVOBench Realtime上提升了+5.96。代碼已開源於https://github.com/mit-han-lab/streaming-vlm。
English
Vision-language models (VLMs) could power real-time assistants and autonomous
agents, but they face a critical challenge: understanding near-infinite video
streams without escalating latency and memory usage. Processing entire videos
with full attention leads to quadratic computational costs and poor performance
on long videos. Meanwhile, simple sliding window methods are also flawed, as
they either break coherence or suffer from high latency due to redundant
recomputation. In this paper, we introduce StreamingVLM, a model designed for
real-time, stable understanding of infinite visual input. Our approach is a
unified framework that aligns training with streaming inference. During
inference, we maintain a compact KV cache by reusing states of attention sinks,
a short window of recent vision tokens, and a long window of recent text
tokens. This streaming ability is instilled via a simple supervised fine-tuning
(SFT) strategy that applies full attention on short, overlapped video chunks,
which effectively mimics the inference-time attention pattern without training
on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a
new benchmark with videos averaging over two hours that requires dense,
per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM
achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time
performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy
also enhances general VQA abilities without any VQA-specific fine-tuning,
improving performance on LongVideoBench by +4.30 and OVOBench Realtime by
+5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.