維基實時挑戰：以專家級維基百科文章考驗深度研究型智能體

摘要

深度研究智能體（DRA）在自主資訊檢索與報告生成方面展現出卓越能力，顯示出協助人類完成複雜研究任務的巨大潛力。現有的評估框架主要依賴大型語言模型生成的參考內容或衍生的評估維度，雖然這些方法具備可擴展性，但往往缺乏專家驗證內容的可靠性，且難以對關鍵維度提供客觀細緻的評估。為彌補這一不足，我們推出維基實時挑戰（WLC），這項動態基準測試以最新的維基百科優良條目（GA）作為專家級參照標準。維基百科對中立性、全面性與可驗證性的嚴格要求，對深度研究智能體構成重大挑戰，而優良條目正是這些標準的典範。我們精選100篇近期優良條目建構資料集，並提出維基評估框架——包含39項寫作質量細部評判標準的評估方法，以及嚴謹的事實可驗證性指標。針對各類深度研究智能體系統的廣泛實驗表明，當前系統與人類專家級維基條目間存在顯著差距，證實了WLC在推進智能體研究方面的有效性。我們已將基準測試開源於：https://github.com/WangShao2000/Wiki_Live_Challenge

English

Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge

維基實時挑戰：以專家級維基百科文章考驗深度研究型智能體

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

摘要

Support