迈向个性化深度研究:基准与评估
Towards Personalized Deep Research: Benchmarks and Evaluations
September 29, 2025
作者: Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou
cs.AI
摘要
深度研究代理(DRAs)能够自主开展复杂调查并生成全面报告,展现出强大的现实应用潜力。然而,现有评估主要依赖封闭式基准测试,而开放式深度研究基准则相对稀缺,且通常忽视了个性化场景。为填补这一空白,我们推出了个性化深度研究基准(Personalized Deep Research Bench),这是首个用于评估DRAs个性化能力的基准。该基准将10个领域的50项多样化研究任务与25个真实用户档案配对,这些档案结合了结构化的人物属性与动态的现实情境,从而产生了250个真实用户-任务查询。为评估系统性能,我们提出了PQR评估框架,该框架同时衡量(P)个性化对齐度、(Q)内容质量及(R)事实可靠性。通过对一系列系统的实验,我们揭示了当前在处理个性化深度研究方面的能力与局限。本工作为开发和评估下一代真正个性化的AI研究助手奠定了严谨的基础。
English
Deep Research Agents (DRAs) can autonomously conduct complex investigations
and generate comprehensive reports, demonstrating strong real-world potential.
However, existing evaluations mostly rely on close-ended benchmarks, while
open-ended deep research benchmarks remain scarce and typically neglect
personalized scenarios. To bridge this gap, we introduce Personalized Deep
Research Bench, the first benchmark for evaluating personalization in DRAs. It
pairs 50 diverse research tasks across 10 domains with 25 authentic user
profiles that combine structured persona attributes with dynamic real-world
contexts, yielding 250 realistic user-task queries. To assess system
performance, we propose the PQR Evaluation Framework, which jointly measures
(P) Personalization Alignment, (Q) Content Quality, and (R) Factual
Reliability. Our experiments on a range of systems highlight current
capabilities and limitations in handling personalized deep research. This work
establishes a rigorous foundation for developing and evaluating the next
generation of truly personalized AI research assistants.