深度研究竞技场:基于研讨会任务的LLMs研究能力首次测评
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks
September 1, 2025
作者: Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou
cs.AI
摘要
深度研究智能体因其在协调多阶段研究流程中的潜力而日益受到关注,这些流程涵盖文献综述、方法设计及实证验证。尽管取得了这些进展,由于难以收集真正能捕捉研究者关注与智力好奇的前沿研究问题,忠实评估其研究能力仍颇具挑战。为填补这一空白,我们引入了基于学术研讨会的DeepResearch Arena基准,该基准捕捉了丰富的专家讨论与互动,更好地反映了现实世界的研究环境,并降低了数据泄露的风险。为自动构建DeepResearch Arena,我们提出了一种多智能体分层任务生成(MAHTG)系统,该系统从研讨会记录中提取具有研究价值的灵感。MAHTG系统进一步将这些灵感转化为高质量的研究任务,确保研究任务制定的可追溯性,同时过滤噪声。借助MAHTG系统,我们从超过200场学术研讨会中精选了涵盖文学、历史、科学等12个学科的10,000多项高质量研究任务,构建了DeepResearch Arena。我们的广泛评估显示,DeepResearch Arena对当前最先进的智能体构成了重大挑战,不同模型间存在明显的性能差距。
English
Deep research agents have attracted growing attention for their potential to
orchestrate multi-stage research workflows, spanning literature synthesis,
methodological design, and empirical verification. Despite these strides,
evaluating their research capability faithfully is rather challenging due to
the difficulty of collecting frontier research questions that genuinely capture
researchers' attention and intellectual curiosity. To address this gap, we
introduce DeepResearch Arena, a benchmark grounded in academic seminars that
capture rich expert discourse and interaction, better reflecting real-world
research environments and reducing the risk of data leakage. To automatically
construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task
Generation (MAHTG) system that extracts research-worthy inspirations from
seminar transcripts. The MAHTG system further translates research-worthy
inspirations into high-quality research tasks, ensuring the traceability of
research task formulation while filtering noise. With the MAHTG system, we
curate DeepResearch Arena with over 10,000 high-quality research tasks from
over 200 academic seminars, spanning 12 disciplines, such as literature,
history, and science. Our extensive evaluation shows that DeepResearch Arena
presents substantial challenges for current state-of-the-art agents, with clear
performance gaps observed across different models.