WebExplorer：探索與進化——訓練長時序網路代理的框架

摘要

大型語言模型（LLMs）的範式已逐漸轉向代理應用，其中網路瀏覽能力對於從多樣化的線上資源中檢索資訊至關重要。然而，現有的開源網路代理在複雜任務上展現出有限的信息搜尋能力，或缺乏透明的實現方式。在本研究中，我們發現關鍵挑戰在於缺乏具有挑戰性的信息搜尋數據。為解決這一限制，我們引入了WebExplorer：一種基於模型探索和迭代式長短查詢演進的系統化數據生成方法。該方法創造了需要多步推理和複雜網路導航的挑戰性查詢-答案對。通過利用我們精心策劃的高質量數據集，我們成功開發了高級網路代理WebExplorer-8B，該模型通過監督微調後進行強化學習訓練。我們的模型支持128K的上下文長度和最多100次工具調用，實現了長時程問題解決。在各種信息搜尋基準測試中，WebExplorer-8B在其規模上達到了最先進的性能。值得注意的是，作為一個8B大小的模型，WebExplorer-8B在強化學習訓練後能夠有效進行平均16次搜索，在BrowseComp-en/zh上比WebSailor-72B獲得更高的準確率，並在WebWalkerQA和FRAMES上達到100B參數以下模型的最佳性能。除了這些信息搜尋任務外，我們的模型在HLE基準測試上也展現出強大的泛化能力，儘管它僅在知識密集型QA數據上進行了訓練。這些結果凸顯了我們的方法作為實現長時程網路代理的實用途徑。

English

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

WebExplorer：探索與進化——訓練長時序網路代理的框架

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

摘要

Support