边看边思考:面向多轮视频推理的多模态大语言模型在线流式片段级记忆机制
Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
March 12, 2026
作者: Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao
cs.AI
摘要
多模态大语言模型(MLLMs)在离线视频理解任务中展现出强大性能,但多数方法仅限于离线推理或在线推理能力较弱,难以处理连续到达视频流的多轮交互。现有流式方法通常采用交替进行的感知-生成范式,这阻碍了感知与生成的并发执行,且随着视频流增长会导致早期记忆衰减,损害长程依赖关系建模。我们提出“边看边想”(Think While Watching)框架——一种基于记忆锚定的流式视频推理方法,可在多轮交互过程中保持连续的片段级记忆。我们构建了包含三阶段多轮思维链的数据集,采用阶段匹配的训练策略,并通过片段级流式因果掩码与流式位置编码确保严格因果性。推理阶段引入高效流水线机制,实现观看与思考过程的重叠执行,并自适应选择最佳注意力后端。在单轮与多轮流式输入协议下,我们的方法均取得优异结果:基于Qwen3-VL模型,在StreamingBench上单轮准确率提升2.6%,在OVO-Bench上提升3.79%;多轮场景下在保持性能的同时将输出标记减少56%。代码已开源:https://github.com/wl666hhh/Think_While_Watching/
English
Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/