ChatPaper.aiChatPaper

透過報告深入理解深度研究

Understanding DeepResearch via Reports

October 9, 2025
作者: Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang
cs.AI

摘要

深度研究代理代表了一種變革性的人工智慧範式,通過複雜的推理和多工具整合來進行專家級別的研究。然而,由於開放式的研究情境和現有基準主要關注孤立能力而非整體表現,評估這些系統仍然極具挑戰性。與傳統的大型語言模型(LLM)任務不同,深度研究系統必須綜合多樣化的來源、生成見解並呈現連貫的發現,這些能力難以通過簡單的驗證來衡量。為解決這一差距,我們引入了DeepResearch-ReportEval,這是一個全面的框架,旨在通過深度研究系統最具代表性的輸出——研究報告來評估其表現。我們的方法系統性地衡量了三個維度:質量、冗餘性和事實性,並採用了一種創新的LLM-as-a-Judge方法,實現了與專家意見的高度一致。我們貢獻了一個包含100個精選查詢的標準化基準,涵蓋12個現實世界類別,從而實現了系統能力的系統性比較。我們對四種領先的商業系統進行了評估,揭示了不同的設計理念和性能權衡,為深度研究從信息助手向智能研究夥伴的演進奠定了基礎見解。源代碼和數據可在以下網址獲取:https://github.com/HKUDS/DeepResearch-Eval。
English
DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.
PDF62October 13, 2025