ChatPaper.aiChatPaper

维基实时挑战:用专家级维基百科文章考验深度研究智能体

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

February 2, 2026
作者: Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
cs.AI

摘要

深度研究智能体(DRAs)在自主信息检索与报告生成方面展现出卓越能力,为辅助人类完成复杂研究任务提供了巨大潜力。当前评估框架主要依赖大语言模型生成的参考内容或衍生的评估维度,虽然这类方法具备可扩展性,但往往缺乏专家验证内容的可靠性,且难以对关键维度进行客观细致的评估。为弥补这一缺陷,我们推出维基实时挑战赛(WLC),该动态基准测试平台以最新的维基百科优质条目(GAs)作为专家级参考标准。维基百科对中立性、全面性和可验证性的严苛要求对DRAs构成重大挑战,而优质条目正是这些标准的巅峰体现。我们精选了100篇近期优质条目构建数据集,并提出维基评估体系——包含39项写作质量细粒度评估标准的综合评价框架,以及严谨的事实可验证性指标。针对多种DRA系统的实验表明,当前DRAs与人类专家级维基百科文章之间存在显著差距,验证了WLC在推进智能体研究方面的有效性。我们的基准测试平台已发布于https://github.com/WangShao2000/Wiki_Live_Challenge。
English
Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge
PDF333March 12, 2026