Video-BrowseComp:开放网络视频智能体研究基准评测
Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
December 28, 2025
作者: Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Paolo Rota, Nicu Sebe, Zheng Liu, Lizi Liao
cs.AI
摘要
自主智能体的演进正在重新定义信息获取方式,使其从被动检索转向主动的开放式网络研究。然而,尽管文本与静态多模态智能体已取得快速进展,但在处理网络中最具动态性的模态——视频时,仍存在显著的模态鸿沟。现有视频基准主要聚焦于被动感知,将精心筛选的视频片段直接输入模型而无需外部检索,未能评估需要主动查询视频时间线、交叉参考分散证据、在开放网络中验证主张的智能体视频研究能力。为弥补这一空白,我们推出Video-BrowseComp——一个包含210个问题的挑战性基准,专为开放网络环境下的智能体视频推理设计。与先前基准不同,该基准强制要求模型依赖时序视觉证据,确保答案无法仅通过文本搜索获得,而必须通过导航视频时间线来验证外部主张。我们对前沿模型的评估揭示了一个关键瓶颈:即便是GPT-5.1(带搜索功能)等先进搜索增强模型,准确率也仅为15.24%。分析表明这些模型主要依赖文本代理,在元数据丰富的领域(如带有剧情摘要的电视剧)表现优异,但在视觉定位至关重要的元数据稀疏的动态环境(如体育赛事、游戏实况)中则完全失效。作为首个开放网络视频研究基准,Video-BrowseComp推动该领域从被动感知向主动视频推理迈进。
English
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present Video-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.