LiveResearchBench:面向真实场景的用户中心化深度研究实时基准平台
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild
October 16, 2025
作者: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty
cs.AI
摘要
深度研究——通过搜索并综合来自数百个实时网络来源的信息,生成基于引用的全面报告——标志着智能代理系统的一个重要前沿。为了严格评估这一能力,四项原则至关重要:任务应(1)以用户为中心,反映现实的信息需求;(2)动态化,要求获取超越参数化知识的最新信息;(3)明确无误,确保不同用户间的一致理解;(4)多维度且搜索密集,需对众多网络来源进行搜索并深入分析。现有基准测试未能充分体现这些原则,往往局限于狭窄领域或提出模糊问题,阻碍了公平比较。基于这些原则,我们引入了LiveResearchBench,一个包含100项专家策划任务的基准测试,涵盖日常生活、企业及学术领域,每项任务均需进行广泛、动态、实时的网络搜索与综合。经过超过1,500小时的人工投入,LiveResearchBench为系统评估提供了严谨的基础。为了评估基于引用的长篇报告,我们推出了DeepEval,一个全面覆盖内容与报告质量的评估套件,包括覆盖范围、呈现方式、引用准确性及关联性、分析的一致性与深度。DeepEval整合了四种互补的评估协议,每种设计都旨在确保评估的稳定性并与人类判断高度一致。利用LiveResearchBench和DeepEval,我们对17个前沿深度研究系统进行了全面评估,包括单代理网络搜索、单代理深度研究及多代理系统。我们的分析揭示了当前的优势、常见的失败模式以及推进可靠、洞察力强的深度研究所需的关键系统组件。
English
Deep research -- producing comprehensive, citation-grounded reports by
searching and synthesizing information from hundreds of live web sources --
marks an important frontier for agentic systems. To rigorously evaluate this
ability, four principles are essential: tasks should be (1) user-centric,
reflecting realistic information needs, (2) dynamic, requiring up-to-date
information beyond parametric knowledge, (3) unambiguous, ensuring consistent
interpretation across users, and (4) multi-faceted and search-intensive,
requiring search over numerous web sources and in-depth analysis. Existing
benchmarks fall short of these principles, often focusing on narrow domains or
posing ambiguous questions that hinder fair comparison. Guided by these
principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated
tasks spanning daily life, enterprise, and academia, each requiring extensive,
dynamic, real-time web search and synthesis. Built with over 1,500 hours of
human labor, LiveResearchBench provides a rigorous basis for systematic
evaluation. To evaluate citation-grounded long-form reports, we introduce
DeepEval, a comprehensive suite covering both content- and report-level
quality, including coverage, presentation, citation accuracy and association,
consistency and depth of analysis. DeepEval integrates four complementary
evaluation protocols, each designed to ensure stable assessment and high
agreement with human judgments. Using LiveResearchBench and DeepEval, we
conduct a comprehensive evaluation of 17 frontier deep research systems,
including single-agent web search, single-agent deep research, and multi-agent
systems. Our analysis reveals current strengths, recurring failure modes, and
key system components needed to advance reliable, insightful deep research.