WebWatcher：开拓视觉-语言深度研究智能体的新前沿

摘要

诸如深度研究（Deep Research）等网络智能体已展现出超越人类的认知能力，能够解决极具挑战性的信息检索问题。然而，大多数研究仍主要聚焦于文本，忽视了现实世界中的视觉信息。这使得多模态深度研究面临巨大挑战，因为这类智能体在感知、逻辑、知识运用及使用更复杂工具方面，相较于纯文本智能体，需要具备更强大的推理能力。为克服这一局限，我们推出了WebWatcher，一款具备增强视觉-语言推理能力的多模态深度研究智能体。它利用高质量合成多模态轨迹进行高效的冷启动训练，借助多种工具进行深度推理，并通过强化学习进一步提升泛化能力。为了更好地评估多模态智能体的能力，我们提出了BrowseComp-VL，一个BrowseComp风格的基准测试，要求进行涉及视觉与文本信息的复杂信息检索。实验结果表明，WebWatcher在四项具有挑战性的视觉问答（VQA）基准测试中显著优于专有基线、RAG工作流及开源智能体，为解决复杂的多模态信息检索任务开辟了道路。

English

Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

WebWatcher：开拓视觉-语言深度研究智能体的新前沿

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

摘要

Support