WebSailor-V2:通过合成数据与可扩展强化学习弥合与专有代理的鸿沟
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
September 16, 2025
作者: Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
cs.AI
摘要
超越人類認知局限代表著大型語言模型(LLM)訓練中的一個關鍵前沿。專有的代理系統如DeepResearch已在極其複雜的資訊搜尋基準測試(如BrowseComp)上展現出超乎人類的能力,這是一項先前無法達成的成就。我們認為,其成功關鍵在於開源模型所缺乏的一種精妙推理模式:在探索廣闊資訊領域時,系統性地降低極端不確定性的能力。基於這一洞見,我們推出了WebSailor,這是一套完整的後訓練方法論,旨在培養這一關鍵能力。我們的方法包括通過結構化採樣與資訊模糊化生成新穎的高不確定性任務、RFT冷啟動,以及一種高效的代理強化學習訓練算法——複製採樣策略優化(DUPO)。憑藉這一整合流程,WebSailor在複雜的資訊搜尋任務中顯著超越了所有開源代理,與專有代理的表現相當,縮小了能力差距。
English
Transcending human cognitive limitations represents a critical frontier in
LLM training. Proprietary agentic systems like DeepResearch have demonstrated
superhuman capabilities on extremely complex information-seeking benchmarks
such as BrowseComp, a feat previously unattainable. We posit that their success
hinges on a sophisticated reasoning pattern absent in open-source models: the
ability to systematically reduce extreme uncertainty when navigating vast
information landscapes. Based on this insight, we introduce WebSailor, a
complete post-training methodology designed to instill this crucial capability.
Our approach involves generating novel, high-uncertainty tasks through
structured sampling and information obfuscation, RFT cold start, and an
efficient agentic RL training algorithm, Duplicating Sampling Policy
Optimization (DUPO). With this integrated pipeline, WebSailor significantly
outperforms all open-source agents in complex information-seeking tasks,
matching proprietary agents' performance and closing the capability gap.