개인화된 딥 리서치를 향한 여정: 벤치마크와 평가

초록

딥 리서치 에이전트(DRA)는 복잡한 조사를 자율적으로 수행하고 포괄적인 보고서를 생성할 수 있어 현실 세계에서 강력한 잠재력을 보여줍니다. 그러나 기존 평가는 주로 폐쇄형 벤치마크에 의존하는 반면, 개방형 딥 리서치 벤치마크는 여전히 부족하며 일반적으로 개인화된 시나리오를 간과합니다. 이러한 격차를 해소하기 위해 우리는 DRA의 개인화를 평가하기 위한 첫 번째 벤치마크인 '개인화 딥 리서치 벤치(Personalized Deep Research Bench)'를 소개합니다. 이 벤치마크는 10개 도메인에 걸친 50가지 다양한 리서치 과제를 구조화된 개인 속성과 동적인 현실 세계 맥락을 결합한 25개의 실제 사용자 프로필과 짝지어, 총 250개의 현실적인 사용자-과제 쿼리를 생성합니다. 시스템 성능을 평가하기 위해 우리는 (P) 개인화 정렬, (Q) 콘텐츠 품질, (R) 사실적 신뢰도를 종합적으로 측정하는 PQR 평가 프레임워크를 제안합니다. 다양한 시스템에 대한 실험을 통해 개인화된 딥 리서치를 처리하는 현재의 능력과 한계를 부각시켰습니다. 이 작업은 진정으로 개인화된 차세대 AI 리서치 어시스턴트를 개발하고 평가하기 위한 엄격한 기반을 마련합니다.

English

Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

개인화된 딥 리서치를 향한 여정: 벤치마크와 평가

Towards Personalized Deep Research: Benchmarks and Evaluations

초록

Support