STRIDE: ストリーミング映像理解における発話タイミングとシーケンスデノイジングの融合

要旨

近年、ビデオ大規模言語モデル（Video-LLM）の進歩により、長く複雑なビデオに対する強力なオフライン推論が可能となった。しかし、現実世界での展開では、ストリーミング知覚と能動的対話の必要性が高まっており、ビデオフレームがオンラインで到着する状況下で、システムは何を応答するかだけでなく、いつ応答するかも決定しなければならない。本研究では、ストリーミングビデオにおける能動的活性化を、構造化系列モデリング問題として再検討する。この動機は、ストリーミングビデオにおける時間的遷移が自然とスパン構造を持つ活性化パターンを形成するという観察に基づく。このスパンレベルの構造を捉えるため、我々はスライディング時間ウィンドウ上で活性化信号を共同でモデル化し、新しいフレームが到着するたびに反復的に更新する。我々はSTRIDE（Structured Temporal Refinement with Iterative DEnoising）を提案する。これは、活性化インターフェースに軽量なマスク拡散モジュールを採用し、ウィンドウ全体の活性化信号を共同で予測し、段階的に洗練させる。多様なストリーミングベンチマークと下流モデルを用いた大規模な実験により、STRIDEがより信頼性が高く時間的一貫性のある能動的応答を示し、オンラインストリーミングシナリオにおける「いつ発話するか」の決定品質を大幅に改善することが実証された。

English

Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

STRIDE: ストリーミング映像理解における発話タイミングとシーケンスデノイジングの融合

STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

要旨

Support