WebShaper：基于信息寻求形式化的智能数据合成代理

摘要

大型语言模型（LLM）驱动的智能体问世，通过基于网络的信息检索（IS）能力解决复杂开放性问题，彻底革新了人工智能领域。然而，高质量训练数据的匮乏限制了IS智能体的发展。现有方法通常采用信息驱动范式，即先收集网络数据，再基于检索结果生成问题。但这种方式可能导致信息结构与推理结构、问题与答案之间出现不一致。为解决这一问题，我们提出了一个形式化驱动的IS数据合成框架WebShaper，用于构建数据集。WebShaper通过集合论系统地对IS任务进行形式化，其核心是知识投影（KP）概念，通过KP操作组合实现对推理结构的精确控制。在合成过程中，我们首先创建种子任务，随后采用多步扩展流程。每一步中，一个扩展器智能体基于我们的形式化框架，利用检索与验证工具，将当前形式化问题扩展得更为复杂。我们在合成数据集上训练模型，实验结果表明，WebShaper在GAIA和WebWalkerQA基准测试中，在开源IS智能体中达到了最先进的性能水平。

English

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

WebShaper：基于信息寻求形式化的智能数据合成代理

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

摘要

Support