DeepResearchEval:深度研究任務自動建構與智能體評估框架
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
January 14, 2026
作者: Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
cs.AI
摘要
深度研究系統廣泛應用於多步驟網絡研究、分析與跨來源資訊整合,但其評估仍面臨挑戰。現有基準測試往往需要耗費大量標註資源來構建任務、依賴靜態評估維度,或在引用缺失時無法可靠驗證事實。為彌補這些不足,我們提出DeepResearchEval——一個用於深度研究任務構建與能動式評估的自動化框架。在任務構建方面,我們設計基於人物角色的生成流程,透過多樣化用戶畫像生成具真實性的複雜研究任務,並應用「任務資格審查」與「檢索必要性」雙階段篩選機制,僅保留需要整合多源證據與外部檢索的任務。在評估方面,我們開發能動式評估流程包含兩個組件:適應性點狀質量評估能根據生成任務動態推導出任務專屬的評估維度、標準與權重;主動事實核查則能在引用缺失時,透過網絡搜索自主提取並驗證報告中的陳述。
English
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.