ChatPaper.aiChatPaper

LiveResearchBench:一個用於實境中用戶導向深度研究的即時基準平台

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

October 16, 2025
作者: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty
cs.AI

摘要

深度研究——通過搜索和綜合來自數百個實時網絡源的信息,生成基於引用的全面報告——標誌著代理系統的一個重要前沿。為了嚴格評估這一能力,四項原則至關重要:任務應(1)以用戶為中心,反映現實的信息需求,(2)動態,要求超越參數知識的最新信息,(3)明確,確保用戶間的一致解釋,(4)多面且搜索密集,需要對眾多網絡源進行搜索和深入分析。現有的基準測試未能滿足這些原則,往往聚焦於狹窄領域或提出模糊問題,阻礙了公平比較。基於這些原則,我們引入了LiveResearchBench,這是一個包含100個專家策劃任務的基準測試,涵蓋日常生活、企業和學術界,每個任務都需要廣泛、動態、實時的網絡搜索和綜合。通過超過1,500小時的人工勞動構建,LiveResearchBench為系統評估提供了嚴格的基礎。為了評估基於引用的長篇報告,我們引入了DeepEval,這是一個全面的評估套件,涵蓋內容和報告層面的質量,包括覆蓋範圍、呈現方式、引用準確性和關聯性、一致性和分析深度。DeepEval整合了四種互補的評估協議,每種協議都旨在確保穩定評估並與人類判斷高度一致。利用LiveResearchBench和DeepEval,我們對17個前沿深度研究系統進行了全面評估,包括單代理網絡搜索、單代理深度研究和多代理系統。我們的分析揭示了當前的優勢、重複出現的故障模式以及推進可靠、有洞察力的深度研究所需的關鍵系統組件。
English
Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
PDF112October 17, 2025