ChatPaper.aiChatPaper

觀看、推理與搜尋:基於開放網絡的智慧影片深度研究基準測試

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

January 11, 2026
作者: Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Jianheng Hou, Hao Peng, Chengwei Qin, Xiaobin Hu, Hong Peng, Ronghao Chen, Huacan Wang
cs.AI

摘要

在真實世界的影片問答場景中,影片通常僅提供局部視覺線索,而可驗證的答案則分散於開放網絡中;因此模型需要同時執行跨幀線索提取、迭代式檢索以及基於多跳推理的驗證。為彌合這一差距,我們構建了首個影片深度研究基準 VideoDR。該基準以影片條件下的開放領域影片問答為核心,要求進行跨幀視覺錨點提取、互動式網絡檢索,並對影片與網絡的聯合證據進行多跳推理;通過嚴格的人工標注與質量控制,我們獲得了涵蓋六個語義領域的高質量影片深度研究樣本。我們分別在工作流與智能體兩種範式下評估了多個閉源與開源多模態大語言模型,結果表明智能體模式並非始終優於工作流:其效能增益取決於模型在長檢索鏈中維持初始影片錨點的能力。進一步分析指出,目標漂移與長程一致性是核心瓶頸。總而言之,VideoDR 為開放網絡環境下的影片智能體研究提供了系統性基準,並揭示了下一代影片深度研究智能體的關鍵挑戰。
English
In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
PDF2097January 31, 2026