基于渐进难度增强机制的网络代理合成代理数据
Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
October 15, 2025
作者: Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty
cs.AI
摘要
基于网络的“深度研究”智能体旨在通过与在线工具的长期交互来解决复杂的问答任务。这些任务仍然具有挑战性,因为底层的语言模型往往未针对长期推理和探索进行优化。先前的研究提出了构建指令调优数据集的工作流程,通常利用知识图谱。然而,这些方法通常缺乏对难度和质量的精细控制,生成的合成数据难以捕捉长期推理所需的复杂性。此外,许多研究通过比较在不同优化方案下训练的模型,混淆了数据和训练效果,使得难以单独评估数据本身的有效性。我们引入了一种双管齐下的数据合成管道,通过逐步增加任务复杂性生成问答对,直到一个前沿的基线网络智能体失败。该基线智能体在此过程中扮演多重角色:尝试回答问题、验证事实性、检查替代答案并执行过滤。为了评估我们合成方法的有效性,我们采用了一种基于从强大网络智能体蒸馏的受控训练设置。在多个基于网络的基准测试中的实验表明,尽管我们的数据集规模较小,但能够训练出比现有数据集更有效的网络智能体。特别是,我们的数据在工具使用动作上展现出两倍的多样性,使得基于其训练的模型在避免重复工具调用行为的同时,实现了更强的性能。
English
Web-based 'deep research' agents aim to solve complex question - answering
tasks through long-horizon interactions with online tools. These tasks remain
challenging, as the underlying language models are often not optimized for
long-horizon reasoning and exploration. Prior work has proposed workflows for
constructing instruction-tuning datasets, often leveraging knowledge graphs.
However, such methods typically lack fine-grained control over difficulty and
quality, yielding synthetic data that falls short of capturing the complexity
required for long-horizon reasoning. Furthermore, many studies conflate data
and training effects by comparing models trained under different optimization
recipes, making it difficult to isolate and evaluate the effectiveness of the
data itself. We introduce a two-pronged data synthesis pipeline that generates
question - answer pairs by progressively increasing task complexity until a
frontier baseline web agent fails. The baseline agent plays multiple roles in
this process: attempting the questions, validating factuality, checking for
alternative answers, and enforcing filtering. To evaluate the effectiveness of
our synthesis methods, we adopt a controlled training setup based on
distillation from strong web agents. Experiments across multiple web-based
benchmarks show that our dataset - despite being smaller - enables the training
of more effective web agents than existing datasets. In particular, our data
exhibits twice the diversity in tool-use actions, allowing models trained on it
to achieve stronger performance while avoiding repetitive tool-calling
behaviors.