STRIDE：语音时机与序列去噪在流媒体视频理解中的融合

摘要

近期视频大语言模型（Video-LLMs）的研究进展已实现对长复杂视频的强效离线推理。然而现实场景部署日益需要流式感知与主动交互能力，即视频帧在线到达时系统不仅需决策回应内容，还需确定回应时机。本文基于流式视频中时间转换自然形成片段化激活模式的观察，将流式视频的主动激活问题重新定义为结构化序列建模任务。为捕捉这种片段级结构特征，我们在滑动时间窗口上联合建模激活信号，并随着新帧到达进行迭代更新。我们提出STRIDE（基于迭代去噪的结构化时序优化）方法，通过在激活接口部署轻量级掩码扩散模块，实现窗口内激活信号的联合预测与渐进式优化。在多样化流式基准测试与下游模型上的大量实验表明，STRIDE能产生更可靠且时序一致的主动响应，显著提升在线流式场景中"何时发言"的决策质量。

English

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

STRIDE：语音时机与序列去噪在流媒体视频理解中的融合

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

摘要

Support