OmniPro: 全方位プロアクティブストリーミング動画理解のための包括的ベンチマーク

要旨

全方位プロアクティブストリーミングビデオ理解、すなわち連続的な音声・映像ストリームからいつ発話すべきか、何を言うべきかを自律的に判断する能力は、全方位モーダル大規模言語モデルの新たな機能として登場している。既存のベンチマークには三つの主要な欠点がある：主に視覚信号に依存していること、ポーリング方式や固定タイムスタンプ方式を採用しており真のプロアクティブ評価ではないこと、カバーするタスクの範囲が限られていることである。これにより、全方位プロアクティブストリーミングモデルの信頼性のある評価と差別化が妨げられている。本論文では、全方位モーダル知覚、プロアクティブ応答、多様なビデオ理解タスクを共同で評価する初のベンチマークであるOmniProを提案する。OmniProは、9つのサブタスクと3つの認知レベルにわたる2,700の人間確認済みサンプルで構成され、6つの基本的なビデオ理解能力をカバーする。特筆すべきは、サンプルの84%が音声信号（発話または非発話）を必要とし、各サンプルにはモダリティ分離ラベルが付与されており、詳細なマルチモーダル分析が可能である。さらに、デュアルモード評価プロトコルを導入する：プローブモードでは、各グラウンドトゥルーストリガーの前後でモデルにクエリを実行してコンテンツ理解を評価し、オンラインモードでは、ストリーム入力においてモデルが自律的に応答タイミングを決定する完全なプロアクティブ能力を評価する。11の代表的なモデルを評価した結果、三つの主要な知見が得られた：(1)音声は一貫した性能向上をもたらすが、モデル間での活用度に大きなばらつきがある、(2)時間経過とともに性能が著しく低下し、長期的なロバスト性が限定的である、(3)非発話音声知覚が最も弱い次元である。

English

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.