WebWatcher：開拓視覺語言深度研究代理的新疆界

摘要

诸如深度研究（Deep Research）等网络代理已展现出超越人类的认知能力，能够解决极具挑战性的信息检索问题。然而，大多数研究仍主要集中于文本领域，忽视了现实世界中的视觉信息。这使得多模态深度研究面临巨大挑战，因为此类代理在感知、逻辑、知识运用及使用更复杂工具方面，相较于纯文本代理，需要具备更强的推理能力。为应对这一局限，我们引入了WebWatcher，一款具备增强视觉-语言推理能力的多模态深度研究代理。它利用高质量合成多模态轨迹进行高效的冷启动训练，借助多种工具进行深度推理，并通过强化学习进一步提升泛化能力。为了更好地评估多模态代理的能力，我们提出了BrowseComp-VL，一个采用BrowseComp风格的基准测试，要求涉及视觉与文本信息的复杂信息检索。实验结果表明，WebWatcher在四项具有挑战性的视觉问答（VQA）基准测试中，显著优于专有基线、RAG工作流程及开源代理，为解决复杂的多模态信息检索任务铺平了道路。

English

Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

WebWatcher：開拓視覺語言深度研究代理的新疆界

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

摘要

Support