DailyReport: 日常検索タスクにおける検索エージェント評価のためのオープンエンドベンチマーク

要旨

検索エージェント（SAs）は通常、大規模言語モデル（LLMs）を活用し、Webソースを自律的に探索して情報を包括的な応答に統合することにより、複雑な情報探索タスクを支援します。SAの評価において、従来のベンチマークは主に実世界のユーザーシナリオでは発生しにくい専門的なタスクに焦点を当てています。さらに、粗いタスクレベルのルーブリックに依存していることが、評価の解釈可能性を制限することがよくあります。このギャップを埋めるために、私たちはDailyReportを導入します。これは、日常的な検索タスクにおけるSAの能力を評価するオープンエンドのベンチマークです。150のオープンエンドタスクと3,546の関連ルーブリックを含み、実世界のユーザーの広く議論されタイムリーな情報需要を捉えています。各タスクはサブタスクに分解され、分離された次元にわたってカスケードルーブリックで評価されます。カスケード性能帰属とユーザー中心の集約を通じて、ユーザー嗜好スコアとともに、各次元の高度に解釈可能なスコアを導き出します。17のエージェントシステムに対する結果は、現在のシステムが依然としてユーザーの期待に及ばないことを示しています。将来の研究を促進するため、データセットとコードをhttps://github.com/AGI-Eval-Official/DailyReportで公開しています。

English

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.