WebExplorer：面向长程网络智能体训练的探索与进化框架

摘要

大型语言模型（LLMs）的应用范式正日益向代理化方向发展，其中网络浏览能力对于从多样化的在线资源中检索信息至关重要。然而，现有的开源网络代理要么在复杂任务上表现出有限的信息搜索能力，要么缺乏透明的实现机制。在本研究中，我们发现关键挑战在于缺乏具有挑战性的信息搜索数据。为解决这一局限，我们引入了WebExplorer：一种基于模型探索和迭代式、由长到短查询演进的系统性数据生成方法。该方法创建了需要多步推理和复杂网络导航的查询-答案对。通过利用我们精心策划的高质量数据集，我们成功开发了先进的网络代理WebExplorer-8B，该模型通过监督微调后接强化学习训练而成。我们的模型支持128K上下文长度和最多100次工具调用轮次，能够实现长时程问题解决。在多样化的信息搜索基准测试中，WebExplorer-8B在其规模上达到了最先进的性能。值得注意的是，作为一个8B大小的模型，WebExplorer-8B在强化学习训练后平均能有效搜索超过16轮次，在BrowseComp-en/zh上比WebSailor-72B获得了更高的准确率，并在WebWalkerQA和FRAMES上达到了100B参数以下模型中的最佳表现。除了这些信息搜索任务外，我们的模型在HLE基准测试上也展现了强大的泛化能力，尽管它仅在知识密集型QA数据上进行了训练。这些成果凸显了我们的方法作为实现长时程网络代理的实用路径。

English

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

WebExplorer：面向长程网络智能体训练的探索与进化框架

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

摘要

Support