WebSailor-V2：通过合成数据与可扩展强化学习弥合与专有代理的鸿沟

摘要

超越人類認知局限代表著大型語言模型（LLM）訓練中的一個關鍵前沿。專有的代理系統如DeepResearch已在極其複雜的資訊搜尋基準測試（如BrowseComp）上展現出超乎人類的能力，這是一項先前無法達成的成就。我們認為，其成功關鍵在於開源模型所缺乏的一種精妙推理模式：在探索廣闊資訊領域時，系統性地降低極端不確定性的能力。基於這一洞見，我們推出了WebSailor，這是一套完整的後訓練方法論，旨在培養這一關鍵能力。我們的方法包括通過結構化採樣與資訊模糊化生成新穎的高不確定性任務、RFT冷啟動，以及一種高效的代理強化學習訓練算法——複製採樣策略優化（DUPO）。憑藉這一整合流程，WebSailor在複雜的資訊搜尋任務中顯著超越了所有開源代理，與專有代理的表現相當，縮小了能力差距。

English

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

WebSailor-V2：通过合成数据与可扩展强化学习弥合与专有代理的鸿沟

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

摘要

Support