邁向個性化深度研究：基準與評估

摘要

深度研究代理（DRAs）能夠自主進行複雜的調查並生成全面的報告，展現出強大的現實應用潛力。然而，現有的評估大多依賴於封閉式基準測試，而開放式的深度研究基準測試仍然稀缺，且通常忽視了個性化場景。為彌補這一差距，我們引入了個性化深度研究基準（Personalized Deep Research Bench），這是首個用於評估DRAs個性化能力的基準測試。它將10個領域中的50個多樣化研究任務與25個真實用戶檔案配對，這些檔案結合了結構化的個人屬性與動態的現實世界情境，從而產生了250個真實的用戶-任務查詢。為了評估系統性能，我們提出了PQR評估框架，該框架綜合衡量（P）個性化對齊、（Q）內容質量以及（R）事實可靠性。我們在一系列系統上的實驗揭示了當前處理個性化深度研究的能力與限制。這項工作為開發和評估下一代真正個性化的AI研究助手奠定了嚴謹的基礎。

English

Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

邁向個性化深度研究：基準與評估

Towards Personalized Deep Research: Benchmarks and Evaluations

摘要

Support