研究評量基準:深度研究代理評估的提示與評分標準基準測試
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
November 10, 2025
作者: Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu
cs.AI
摘要
深度研究(Deep Research,DR)是一種新興的智能體應用,利用大型語言模型(LLMs)處理開放式問題。該技術需要整合多項能力,包括多步驟推理、跨文獻綜合分析,以及生成具證據支持的長篇回答。由於DR的回答具有篇幅長、多樣性高、允許多種有效解決方案,且常依賴動態資訊源等特點,其評估仍面臨挑戰。我們提出ResearchRubrics——一個耗時超過2,800小時人工構建的DR標準化基準,將真實且領域多元的提示詞與2,500餘條專家撰寫的細粒度評分量規相結合,用於評估事實依據、推理嚴謹性和表達清晰度。同時,我們提出一個新的複雜度框架,沿三個維度(概念廣度、邏輯嵌套深度和探索強度)對DR任務進行分類。此外,開發了基於人類與模型的評估方案,用以衡量DR智能體對評分量規的遵循程度。在對多個前沿DR系統的評估中,我們發現即使如Gemini DR和OpenAI DR等頂尖智能體,對評分量規的平均符合率也低於68%,主要問題在於對隱含語境的遺漏以及對檢索資訊的推理不足。這一結果凸顯了對深度研究能力進行魯棒且可擴展評估的必要性。為推動具備充分論證能力的研究助手發展,我們公開釋出ResearchRubrics(包含所有提示詞、評分量規及評估代碼)。
English
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.