DeepResearch Arena：首屆基於研討會任務的大語言模型研究能力測驗

摘要

深度研究智能体因其在协调多阶段研究流程（包括文献综述、方法设计及实证验证）方面的潜力而日益受到关注。尽管取得了这些进展，由于难以收集真正能引起研究者关注和激发其求知欲的前沿研究问题，准确评估其研究能力仍颇具挑战。为填补这一空白，我们引入了DeepResearch Arena，这是一个基于学术研讨会的基准测试平台，这些研讨会捕捉了丰富的专家讨论与互动，更好地反映了现实世界的研究环境，并降低了数据泄露的风险。为了自动构建DeepResearch Arena，我们提出了一个多智能体层次任务生成（MAHTG）系统，该系统从研讨会记录中提取具有研究价值的灵感。MAHTG系统进一步将这些灵感转化为高质量的研究任务，确保研究任务制定的可追溯性，同时过滤掉噪声。借助MAHTG系统，我们从超过200场学术研讨会中精选出涵盖文学、历史、科学等12个学科的10,000多个高质量研究任务，构建了DeepResearch Arena。我们的广泛评估显示，DeepResearch Arena对当前最先进的智能体构成了重大挑战，不同模型之间表现出明显的性能差距。

English

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

DeepResearch Arena：首屆基於研討會任務的大語言模型研究能力測驗

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

摘要

Support