DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

摘要

搜索代理（SA）通常借助大语言模型（LLM），通过自主探索网络资源并整合信息生成综合回答，以支持复杂的信息检索任务。在SA评估方面，现有基准测试主要聚焦于现实用户场景中极少出现的专业化任务。此外，这些测试依赖粗粒度的任务级评分标准，往往限制了评估的可解释性。为解决这一问题，我们提出DailyReport——一个面向日常搜索任务的开放式基准测试，用于评估SA能力。该基准包含150个开放式任务及3546条关联评分细则，捕捉了真实用户广泛讨论且具有时效性的信息需求。每个任务被分解为子任务，并通过解耦维度上的级联评分标准进行评估。通过级联性能归因与以用户为中心的聚合方法，我们为每个维度推导出高可解释性得分，并生成用户偏好得分。在17个智能系统上的实验结果表明，当前系统仍未能达到用户预期。为促进未来研究，我们的数据集与代码已在https://github.com/AGI-Eval-Official/DailyReport 公开。

English

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.