OmniPro：面向全主動串流影片理解的綜合性基準

摘要

全方位主動流式視訊理解，亦即從連續的音視頻串流中自主決定何時發言以及發言內容，是全模態大型語言模型的一項新興能力。現有基準在三個關鍵面向存在不足：主要依賴視覺訊號、採用輪詢或固定時間戳協議而非真正的主動評估，以及涵蓋的任務範圍有限，導致無法可靠評估與區分全方位主動流式模型。我們提出 OmniPro，這是首個同時評估全模態感知、主動回應及多樣化視訊理解任務的基準。該基準包含 2,700 個人類驗證樣本，涵蓋 9 個子任務與 3 個認知層級，覆蓋 6 種基礎視訊理解能力。值得注意的是，84% 的樣本需要音頻訊號（語音或非語音），且每個樣本皆標註了模態隔離標籤，以實現細粒度的多模態分析。我們進一步引入了雙模式評估協議：探測模式在每個真實觸發點前後查詢模型，以評估內容理解能力；線上模式則要求模型在串流輸入中自主決定回應時機，以評估完整的自主能力。對 11 個具代表性模型的評估揭示了三項關鍵發現：（1）音頻帶來一致的性能提升，但各模型對其利用率差異顯著；（2）性能隨時間推移明顯下降，顯示長期穩健性有限；（3）非語音音頻感知仍是最薄弱的維度。

English

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.