OmniPro:面向全主动流式视频理解的综合基准
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
May 18, 2026
作者: Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li
cs.AI
摘要
全主动流式视频理解,即根据连续的视听流自主决定何时说话以及说什么,是全模态大语言模型的一项新兴能力。现有基准测试在三个关键方面存在不足:它们主要依赖视觉信号,采用轮询或固定时间戳协议而非真正的主动评估,且仅涵盖有限的任务范围,从而无法对全主动流式模型进行可靠的评估和区分。我们提出OmniPro,这是首个联合评估全模态感知、主动响应以及多样化视频理解任务的基准测试。它包含2,700个人工验证样本,涵盖9个子任务和3个认知层级,覆盖6种基本视频理解能力。值得注意的是,84%的样本需要音频信号(语音或非语音),每个样本都标注了模态隔离标签,以实现细粒度的多模态分析。我们进一步引入了双模式评估协议:探测模式通过在每个真实触发点前后查询模型来评估内容理解,而在线模式则通过要求模型在流式输入中自主决定何时响应来评估全面主动能力。对11个代表性模型的评估揭示了三个关键发现:(1) 音频能带来持续提升,但不同模型对其利用率差异很大;(2) 性能随时间显著下降,表明长期鲁棒性有限;(3) 非语音音频感知仍然是最薄弱的维度。
English
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.