ChatPaper.aiChatPaper

探索以进化:通过主动在线探索扩展深度研究代理的进化聚合逻辑

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

October 16, 2025
作者: Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong
cs.AI

摘要

深度研究网络代理不仅能够从网页环境、文件及多模态输入等多种来源检索信息,更重要的是,它们需严谨地分析与整合知识,以进行深入的研究。然而,现有的开源深度研究代理主要集中于提升网络代理在定位特定信息方面的能力,却忽视了信息聚合这一核心需求,这限制了它们支持深度研究的能力。我们提出了一种“探索至进化”的范式,旨在为网络代理构建可验证的训练数据。该范式始于主动的在线探索,代理通过探索真实网络获取有据可依的信息。随后,利用收集到的证据,代理通过从12种高级逻辑类型中选择、组合并优化操作,自我进化出一个聚合程序,从而合成可验证的问答对。这种从高级指导到具体操作的进化过程,使我们能够规模化地生成WebAggregatorQA数据集,该数据集包含10K样本,覆盖50K个网站及11个领域。基于开源代理框架SmolAgents,我们收集了监督微调轨迹,开发了一系列基础模型——WebAggregator。其中,WebAggregator-8B在性能上媲美GPT-4.1,而32B版本在GAIA-text上超越GPT-4.1超过10%,并接近Claude-3.7-sonnet的水平。此外,鉴于评估网络代理信息聚合能力的基准测试有限,我们构建了WebAggregatorQA的人工标注评估子集作为一项挑战性测试集。在此基准上,Claude-3.7-sonnet仅得28%,GPT-4.1得分为25.8%。即便代理成功检索到所有参考资料,它们在WebAggregatorQA上仍表现不佳,凸显了强化网络代理基础信息聚合能力的必要性。
English
Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
PDF112October 20, 2025