ChatPaper.aiChatPaper

观察、推理与搜索:面向智能体视频推理的开放网络视频深度研究基准 (注:译文通过"智能体视频推理"准确传达"Agentic Video Reasoning"的学术概念,采用"开放网络"对应"Open Web"的互联网特性,同时通过"深度研究基准"保持学术严谨性。标题结构采用中文常见的四字格排比,既保留原文三个核心动作的并列关系,又符合中文标题的韵律美感。)

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

January 11, 2026
作者: Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, Hao Peng, Chengwei Qin, Xiaobin Hu, Hong Peng, Ronghao Chen, Huacan Wang
cs.AI

摘要

在实际视频问答场景中,视频往往仅提供局部视觉线索,而可验证答案广泛分布于开放网络;模型因此需要协同执行跨帧线索提取、迭代检索和基于多步推理的验证。为弥合这一差距,我们构建了首个视频深度研究基准VideoDR。该基准聚焦视频条件化的开放域视频问答,要求进行跨帧视觉锚点提取、交互式网络检索,以及对视频-网络联合证据的多步推理;通过严格的人工标注与质量控制,我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们分别在流程式与智能体式两种范式下评估了多个闭源与开源多模态大语言模型,结果表明智能体式并非始终优于流程式:其优势取决于模型在长检索链中保持初始视频锚点的能力。进一步分析指出目标漂移与长程一致性是核心瓶颈。总体而言,VideoDR为研究开放网络环境下的视频智能体提供了系统性基准,并揭示了下一代视频深度研究智能体面临的关键挑战。
English
In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
PDF2097January 31, 2026