網海航行者:駕馭超人類推理能力的網絡代理
WebSailor: Navigating Super-human Reasoning for Web Agent
July 3, 2025
作者: Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI
摘要
超越人類認知限制,已成為大型語言模型(LLM)訓練中的一個關鍵前沿。諸如DeepResearch等專有代理系統,在BrowseComp等極其複雜的信息檢索基準測試中展現了超乎人類的能力,這一成就此前難以企及。我們認為,其成功關鍵在於開源模型所不具備的一種高級推理模式:在浩瀚信息海洋中航行時,系統性地降低極端不確定性的能力。基於這一洞見,我們推出了WebSailor,這是一套完整的後訓練方法論,旨在培養這一至關重要的能力。我們的方法包括通過結構化採樣與信息模糊化生成新穎的高不確定性任務、RFT冷啟動,以及一種高效的代理強化學習訓練算法——複製採樣策略優化(DUPO)。憑藉這一整合流程,WebSailor在複雜信息檢索任務中顯著超越了所有開源代理,與專有代理的性能比肩,縮小了能力差距。
English
Transcending human cognitive limitations represents a critical frontier in
LLM training. Proprietary agentic systems like DeepResearch have demonstrated
superhuman capabilities on extremely complex information-seeking benchmarks
such as BrowseComp, a feat previously unattainable. We posit that their success
hinges on a sophisticated reasoning pattern absent in open-source models: the
ability to systematically reduce extreme uncertainty when navigating vast
information landscapes. Based on this insight, we introduce WebSailor, a
complete post-training methodology designed to instill this crucial capability.
Our approach involves generating novel, high-uncertainty tasks through
structured sampling and information obfuscation, RFT cold start, and an
efficient agentic RL training algorithm, Duplicating Sampling Policy
Optimization (DUPO). With this integrated pipeline, WebSailor significantly
outperforms all opensource agents in complex information-seeking tasks,
matching proprietary agents' performance and closing the capability gap.