ChatPaper.aiChatPaper

DeepResearchEval:面向深度研究任务构建与智能体评估的自动化框架

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

January 14, 2026
作者: Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
cs.AI

摘要

深度研究系统广泛应用于多步骤网络调研、分析与跨源信息整合,但其评估仍面临挑战。现有基准测试往往需要大量标注任务构建、依赖静态评估维度,或在引证缺失时难以可靠验证事实。为弥补这些不足,我们推出DeepResearchEval——一个面向深度研究任务的自动化构建与智能体评估框架。在任务构建方面,我们提出角色驱动的流程,基于多样化用户画像生成真实且复杂的研究任务,并应用任务资质与搜索必要性两阶段筛选机制,仅保留需要多源证据整合与外部检索的任务。在评估层面,我们设计了包含双组件的智能体流程:自适应点式质量评估能根据生成任务动态推导任务专属的评估维度、标准与权重;主动事实核查则通过网络搜索自主提取并验证报告陈述,即使在引证缺失时也能有效运作。
English
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
PDF901January 16, 2026