WebShaper: 情報探索によるエージェント的データ合成の形式化

要旨

大規模言語モデル（LLM）を基盤としたエージェントの登場は、ウェブベースの情報探索（IS）能力を通じて複雑で開放的な課題に対する解決策を可能にし、人工知能に革命をもたらしました。しかし、高品質な訓練データの不足がISエージェントの開発を制限してきました。既存のアプローチでは、一般的に情報駆動型のパラダイムを採用し、まずウェブデータを収集し、その後その検索結果に基づいて質問を生成します。しかし、これでは情報構造と推論構造、質問と回答の間に不整合が生じる可能性があります。これを緩和するため、我々はデータセットを構築するための形式化駆動型ISデータ合成フレームワーク「WebShaper」を提案します。WebShaperは、集合論を通じてISタスクを体系的に形式化します。この形式化の中心となるのは「知識投影（KP）」の概念であり、KP操作の合成によって推論構造を精密に制御することが可能です。合成プロセスでは、まずシードタスクを作成し、その後多段階の拡張プロセスを経ます。各段階では、エージェント的な「Expander」が現在の形式的質問を、我々の形式化に基づいた検索と検証ツールを用いてより複雑に拡張します。我々は、この合成されたデータセットでモデルを訓練します。実験結果は、WebShaperがGAIAおよびWebWalkerQAベンチマークにおいて、オープンソースのISエージェントの中で最先端の性能を達成することを示しています。

English

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

WebShaper: 情報探索によるエージェント的データ合成の形式化

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

要旨

Support