ChatPaper.aiChatPaper

透過漸進式難度增強機制合成具自主性的網頁代理數據

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

October 15, 2025
作者: Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty
cs.AI

摘要

基於網路的「深度研究」代理旨在通過與線上工具的長期互動來解決複雜的問答任務。這些任務仍然具有挑戰性,因為底層的語言模型通常未針對長期推理和探索進行優化。先前的研究提出了構建指令微調數據集的工作流程,通常利用知識圖譜。然而,這些方法通常缺乏對難度和品質的精細控制,生成的合成數據未能捕捉到長期推理所需的複雜性。此外,許多研究通過比較在不同優化方案下訓練的模型,混淆了數據和訓練效果的影響,使得難以獨立評估數據本身的有效性。我們引入了一種雙管齊下的數據合成管道,通過逐步增加任務複雜度來生成問答對,直到一個前沿的基準網路代理失敗。該基準代理在此過程中扮演多種角色:嘗試回答問題、驗證事實性、檢查替代答案並執行過濾。為了評估我們合成方法的有效性,我們採用了基於強網路代理蒸餾的受控訓練設置。在多個基於網路的基準測試中的實驗表明,儘管我們的數據集較小,但能夠訓練出比現有數據集更有效的網路代理。特別是,我們的數據在工具使用動作上展現出兩倍的多元性,使得基於其訓練的模型能夠在避免重複工具調用行為的同時,實現更強的表現。
English
Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
PDF32December 21, 2025