DailyReport: 일상 검색 작업에 대한 검색 에이전트 평가를 위한 개방형 벤치마크

초록

검색 에이전트(Search Agents, SAs)는 일반적으로 대규모 언어 모델(LLMs)을 활용하여 웹 소스를 자율적으로 탐색하고 정보를 종합하여 포괄적인 응답을 제공함으로써 복잡한 정보 탐색 작업을 지원합니다. SA 평가를 위해 기존 벤치마크는 주로 실제 사용자 시나리오에서 발생하기 어려운 특수 작업에 초점을 맞추고 있습니다. 또한, 거친 작업 수준의 루브릭에 의존하기 때문에 평가의 해석 가능성이 제한되는 경우가 많습니다. 이러한 격차를 해소하기 위해, 우리는 일상적인 검색 작업에 대한 SA 성능을 평가하기 위한 개방형 벤치마크인 DailyReport를 소개합니다. 이는 150개의 개방형 작업과 관련된 3,546개의 루브릭을 포함하며, 실제 사용자들의 널리 논의되고 시의적절한 정보 요구를 포착합니다. 각 작업은 하위 작업으로 분해되고, 분리된 차원에 걸쳐 계단식 루브릭(cascade rubrics)으로 평가됩니다. 계단식 성능 귀인(cascade performance attribution)과 사용자 중심 집계를 통해 각 차원에 대한 높은 해석 가능성의 점수와 함께 사용자 선호도 점수를 도출합니다. 17개의 에이전트 시스템에 대한 실험 결과는 현재 시스템이 사용자의 기대에 미치지 못함을 보여줍니다. 후속 연구를 지원하기 위해, 우리의 데이터셋과 코드는 https://github.com/AGI-Eval-Official/DailyReport에서 공개적으로 이용 가능합니다.

English

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.