STRIDE：语音触发时序去噪的流式视频理解方法

摘要

视频大语言模型（Video-LLMs）的最新进展已实现对长复杂视频的强离线推理能力。然而现实应用日益需要流式感知与主动交互——视频帧在线到达时，系统不仅需决定回应内容，更要确定回应时机。本文从结构化序列建模角度重新审视流式视频中的主动激活问题，其动机在于观察到流式视频中的时序转换天然形成跨度结构的激活模式。为捕捉这种跨度层级结构，我们在滑动时间窗口上联合建模激活信号，并随着新帧到达进行迭代更新。我们提出STRIDE（基于迭代去噪的结构化时序优化）方法，通过在激活接口部署轻量级掩码扩散模块，联合预测并渐进优化窗口内的激活信号。在多样化流式基准测试与下游模型上的大量实验表明，STRIDE能产生更可靠且时序一致的主动响应，显著提升在线流式场景中"何时发言"的决策质量。

English

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

STRIDE：语音触发时序去噪的流式视频理解方法

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

摘要

Support