WebWatcher:开拓视觉-语言深度研究智能体的新前沿
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
August 7, 2025
作者: Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI
摘要
诸如深度研究(Deep Research)等网络智能体已展现出超越人类的认知能力,能够解决极具挑战性的信息检索问题。然而,大多数研究仍主要聚焦于文本,忽视了现实世界中的视觉信息。这使得多模态深度研究面临巨大挑战,因为这类智能体在感知、逻辑、知识运用及使用更复杂工具方面,相较于纯文本智能体,需要具备更强大的推理能力。为克服这一局限,我们推出了WebWatcher,一款具备增强视觉-语言推理能力的多模态深度研究智能体。它利用高质量合成多模态轨迹进行高效的冷启动训练,借助多种工具进行深度推理,并通过强化学习进一步提升泛化能力。为了更好地评估多模态智能体的能力,我们提出了BrowseComp-VL,一个BrowseComp风格的基准测试,要求进行涉及视觉与文本信息的复杂信息检索。实验结果表明,WebWatcher在四项具有挑战性的视觉问答(VQA)基准测试中显著优于专有基线、RAG工作流及开源智能体,为解决复杂的多模态信息检索任务开辟了道路。
English
Web agents such as Deep Research have demonstrated superhuman cognitive
abilities, capable of solving highly challenging information-seeking problems.
However, most research remains primarily text-centric, overlooking visual
information in the real world. This makes multimodal Deep Research highly
challenging, as such agents require much stronger reasoning abilities in
perception, logic, knowledge, and the use of more sophisticated tools compared
to text-based agents. To address this limitation, we introduce WebWatcher, a
multi-modal Agent for Deep Research equipped with enhanced visual-language
reasoning capabilities. It leverages high-quality synthetic multimodal
trajectories for efficient cold start training, utilizes various tools for deep
reasoning, and further enhances generalization through reinforcement learning.
To better evaluate the capabilities of multimodal agents, we propose
BrowseComp-VL, a benchmark with BrowseComp-style that requires complex
information retrieval involving both visual and textual information.
Experimental results show that WebWatcher significantly outperforms proprietary
baseline, RAG workflow and open-source agents in four challenging VQA
benchmarks, which paves the way for solving complex multimodal
information-seeking tasks.