OmniPro:面向全主動串流影片理解的綜合性基準
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
May 18, 2026
作者: Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li
cs.AI
摘要
全方位主動流式視訊理解,亦即從連續的音視頻串流中自主決定何時發言以及發言內容,是全模態大型語言模型的一項新興能力。現有基準在三個關鍵面向存在不足:主要依賴視覺訊號、採用輪詢或固定時間戳協議而非真正的主動評估,以及涵蓋的任務範圍有限,導致無法可靠評估與區分全方位主動流式模型。我們提出 OmniPro,這是首個同時評估全模態感知、主動回應及多樣化視訊理解任務的基準。該基準包含 2,700 個人類驗證樣本,涵蓋 9 個子任務與 3 個認知層級,覆蓋 6 種基礎視訊理解能力。值得注意的是,84% 的樣本需要音頻訊號(語音或非語音),且每個樣本皆標註了模態隔離標籤,以實現細粒度的多模態分析。我們進一步引入了雙模式評估協議:探測模式在每個真實觸發點前後查詢模型,以評估內容理解能力;線上模式則要求模型在串流輸入中自主決定回應時機,以評估完整的自主能力。對 11 個具代表性模型的評估揭示了三項關鍵發現:(1)音頻帶來一致的性能提升,但各模型對其利用率差異顯著;(2)性能隨時間推移明顯下降,顯示長期穩健性有限;(3)非語音音頻感知仍是最薄弱的維度。
English
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.