ChatPaper.aiChatPaper

探索以進化:通過主動在線探索擴展深度研究代理的進化聚合邏輯

Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

October 16, 2025
作者: Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong
cs.AI

摘要

深度研究網絡代理不僅能從多樣化的來源(如網絡環境、文件和多模態輸入)中檢索信息,更重要的是,它們需要嚴格分析和整合知識以進行深入的研究。然而,現有的開源深度研究代理主要專注於增強網絡代理的信息搜索能力以定位特定信息,而忽視了信息聚合這一核心需求,這將限制其支持深度研究的能力。我們提出了一種“探索進化”範式,可擴展地構建網絡代理的可驗證訓練數據。該範式始於主動的在線探索,代理通過探索真實網絡獲取基於事實的信息。利用收集到的證據,代理隨後通過從12種高層次邏輯類型中選擇、組合和精煉操作,自我進化出一個聚合程序,以合成可驗證的問答對。這種從高層次指導到具體操作的進化過程,使我們能夠擴展性地生成WebAggregatorQA數據集,該數據集包含10K個樣本,覆蓋50K個網站和11個領域。基於開源代理框架SmolAgents,我們收集了監督微調軌跡,開發了一系列基礎模型WebAggregator。其中,WebAggregator-8B的性能與GPT-4.1相當,而32B版本在GAIA-text上超越了GPT-4.1超過10%,並接近Claude-3.7-sonnet。此外,考慮到評估網絡代理信息聚合能力的基準有限,我們構建了WebAggregatorQA的人工註釋評估分集作為一個具有挑戰性的測試集。在該基準上,Claude-3.7-sonnet僅達到28%,GPT-4.1得分為25.8%。即使代理成功檢索到所有參考資料,它們在WebAggregatorQA上仍表現不佳,這凸顯了加強網絡代理基礎信息聚合能力的必要性。
English
Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
PDF112October 20, 2025