视频流思维:视频大语言模型可实现观看与思考同步进行
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
March 12, 2026
作者: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
cs.AI
摘要
在线视频大语言模型(VideoLLMs)在支持响应式实时交互中具有关键作用。现有方法主要关注流式感知,但缺乏同步的逻辑推理流。然而直接应用测试时缩放方法会导致不可接受的响应延迟。为解决这一权衡问题,我们提出视频流思维(VST)——一种创新的流式视频理解范式。该范式支持"边看边想"机制,在视频流传输过程中实时激活对输入视频片段的推理。通过将大语言模型推理延迟分摊至视频播放过程,该设计在保持实时响应性的同时,显著提升了即时理解能力与连贯认知水平。
进一步,我们构建了完整的训练后流程:通过VST-SFT结构化地适配离线VideoLLM至因果流式推理,并采用VST-RL在多轮视频交互环境中通过自主探索实现端到端优化。此外,我们开发了自动化训练数据合成流程,利用视频知识图谱生成高质量流式问答对,其中包含基于实体关系锚定的流式思维链,以强化多证据推理能力并维持对视频流的持续关注。
大量实验表明,VST-7B在在线基准测试中表现优异:StreamingBench达79.5%,OVO-Bench达59.3%。同时,VST在离线长视频推理基准上保持竞争力。相较于Video-R1,VST响应速度提升15.7倍,并在VideoHolmes基准上实现+5.4%的提升,展现出更高效率及跨视频理解任务的强泛化能力。代码、数据及模型将于https://github.com/1ranGuan/VST 发布。
English
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.