WebWatcher:開拓視覺語言深度研究代理的新疆界
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
August 7, 2025
作者: Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI
摘要
诸如深度研究(Deep Research)等网络代理已展现出超越人类的认知能力,能够解决极具挑战性的信息检索问题。然而,大多数研究仍主要集中于文本领域,忽视了现实世界中的视觉信息。这使得多模态深度研究面临巨大挑战,因为此类代理在感知、逻辑、知识运用及使用更复杂工具方面,相较于纯文本代理,需要具备更强的推理能力。为应对这一局限,我们引入了WebWatcher,一款具备增强视觉-语言推理能力的多模态深度研究代理。它利用高质量合成多模态轨迹进行高效的冷启动训练,借助多种工具进行深度推理,并通过强化学习进一步提升泛化能力。为了更好地评估多模态代理的能力,我们提出了BrowseComp-VL,一个采用BrowseComp风格的基准测试,要求涉及视觉与文本信息的复杂信息检索。实验结果表明,WebWatcher在四项具有挑战性的视觉问答(VQA)基准测试中,显著优于专有基线、RAG工作流程及开源代理,为解决复杂的多模态信息检索任务铺平了道路。
English
Web agents such as Deep Research have demonstrated superhuman cognitive
abilities, capable of solving highly challenging information-seeking problems.
However, most research remains primarily text-centric, overlooking visual
information in the real world. This makes multimodal Deep Research highly
challenging, as such agents require much stronger reasoning abilities in
perception, logic, knowledge, and the use of more sophisticated tools compared
to text-based agents. To address this limitation, we introduce WebWatcher, a
multi-modal Agent for Deep Research equipped with enhanced visual-language
reasoning capabilities. It leverages high-quality synthetic multimodal
trajectories for efficient cold start training, utilizes various tools for deep
reasoning, and further enhances generalization through reinforcement learning.
To better evaluate the capabilities of multimodal agents, we propose
BrowseComp-VL, a benchmark with BrowseComp-style that requires complex
information retrieval involving both visual and textual information.
Experimental results show that WebWatcher significantly outperforms proprietary
baseline, RAG workflow and open-source agents in four challenging VQA
benchmarks, which paves the way for solving complex multimodal
information-seeking tasks.