WebExplorer: 長期的なWebエージェントのトレーニングのための探索と進化

要旨

大規模言語モデル（LLMs）のパラダイムは、エージェント的な応用に向けてますますシフトしており、ウェブブラウジング能力は多様なオンラインソースから情報を取得するための基盤となっている。しかし、既存のオープンソースのウェブエージェントは、複雑なタスクにおける情報探索能力が限られているか、透明性のある実装を欠いている。本研究では、この課題の核心が、情報探索のための挑戦的なデータの不足にあることを明らかにした。この制約を克服するため、我々はWebExplorerを導入する：モデルベースの探索と反復的な長文から短文へのクエリ進化を用いた体系的なデータ生成手法である。この方法は、多段階の推論と複雑なウェブナビゲーションを必要とする挑戦的なクエリと回答のペアを作成する。我々がキュレートした高品質なデータセットを活用し、教師ありファインチューニングと強化学習を経て、高度なウェブエージェントWebExplorer-8Bを開発することに成功した。我々のモデルは128Kのコンテキスト長と最大100回のツール呼び出しをサポートし、長期的な問題解決を可能にする。多様な情報探索ベンチマークにおいて、WebExplorer-8Bはその規模において最先端の性能を達成した。特に、8BサイズのモデルであるWebExplorer-8Bは、強化学習トレーニング後に平均16回の探索を効果的に実行し、BrowseComp-en/zhにおいてWebSailor-72Bよりも高い精度を達成し、WebWalkerQAとFRAMESにおいて100Bパラメータまでのモデルの中で最高の性能を発揮した。これらの情報探索タスクを超えて、我々のモデルは知識集約的なQAデータのみでトレーニングされているにもかかわらず、HLEベンチマークにおいても強力な汎化性能を示した。これらの結果は、我々のアプローチが長期的なウェブエージェントに向けた実践的な道筋であることを強調している。

English

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

WebExplorer: 長期的なWebエージェントのトレーニングのための探索と進化

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

要旨

Support