边看边思考：面向多轮视频推理的多模态大语言模型在线流式片段级记忆机制

摘要

多模态大语言模型（MLLMs）在离线视频理解任务中展现出强大性能，但多数方法仅限于离线推理或在线推理能力较弱，难以处理连续到达视频流的多轮交互。现有流式方法通常采用交替进行的感知-生成范式，这阻碍了感知与生成的并发执行，且随着视频流增长会导致早期记忆衰减，损害长程依赖关系建模。我们提出“边看边想”（Think While Watching）框架——一种基于记忆锚定的流式视频推理方法，可在多轮交互过程中保持连续的片段级记忆。我们构建了包含三阶段多轮思维链的数据集，采用阶段匹配的训练策略，并通过片段级流式因果掩码与流式位置编码确保严格因果性。推理阶段引入高效流水线机制，实现观看与思考过程的重叠执行，并自适应选择最佳注意力后端。在单轮与多轮流式输入协议下，我们的方法均取得优异结果：基于Qwen3-VL模型，在StreamingBench上单轮准确率提升2.6%，在OVO-Bench上提升3.79%；多轮场景下在保持性能的同时将输出标记减少56%。代码已开源：https://github.com/wl666hhh/Think_While_Watching/

English

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

边看边思考：面向多轮视频推理的多模态大语言模型在线流式片段级记忆机制

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

摘要

Support