網海航行者：駕馭超人類推理能力的網絡代理

摘要

超越人類認知限制，已成為大型語言模型（LLM）訓練中的一個關鍵前沿。諸如DeepResearch等專有代理系統，在BrowseComp等極其複雜的信息檢索基準測試中展現了超乎人類的能力，這一成就此前難以企及。我們認為，其成功關鍵在於開源模型所不具備的一種高級推理模式：在浩瀚信息海洋中航行時，系統性地降低極端不確定性的能力。基於這一洞見，我們推出了WebSailor，這是一套完整的後訓練方法論，旨在培養這一至關重要的能力。我們的方法包括通過結構化採樣與信息模糊化生成新穎的高不確定性任務、RFT冷啟動，以及一種高效的代理強化學習訓練算法——複製採樣策略優化（DUPO）。憑藉這一整合流程，WebSailor在複雜信息檢索任務中顯著超越了所有開源代理，與專有代理的性能比肩，縮小了能力差距。

English

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.

網海航行者：駕馭超人類推理能力的網絡代理

WebSailor: Navigating Super-human Reasoning for Web Agent

摘要

Support