迈向个性化深度研究：基准与评估

摘要

深度研究代理（DRAs）能够自主开展复杂调查并生成全面报告，展现出强大的现实应用潜力。然而，现有评估主要依赖封闭式基准测试，而开放式深度研究基准则相对稀缺，且通常忽视了个性化场景。为填补这一空白，我们推出了个性化深度研究基准（Personalized Deep Research Bench），这是首个用于评估DRAs个性化能力的基准。该基准将10个领域的50项多样化研究任务与25个真实用户档案配对，这些档案结合了结构化的人物属性与动态的现实情境，从而产生了250个真实用户-任务查询。为评估系统性能，我们提出了PQR评估框架，该框架同时衡量（P）个性化对齐度、（Q）内容质量及（R）事实可靠性。通过对一系列系统的实验，我们揭示了当前在处理个性化深度研究方面的能力与局限。本工作为开发和评估下一代真正个性化的AI研究助手奠定了严谨的基础。

English

Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

迈向个性化深度研究：基准与评估

Towards Personalized Deep Research: Benchmarks and Evaluations

摘要

Support