ChatPaper.aiChatPaper

研究评估标准集:面向深度研究智能体评估的提示与标准基准

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

November 10, 2025
作者: Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu
cs.AI

摘要

深度研究(DR)是一种新兴的智能体应用,它利用大语言模型(LLM)处理开放式查询。该技术需要整合多项能力,包括多步推理、跨文档综合以及生成有证据支持的长篇答案。由于回答内容冗长多样、存在多种有效解决方案且常依赖动态信息源,DR的评估仍具挑战性。我们推出ResearchRubrics——一个基于2800+小时人工标注构建的DR标准化基准,该基准将真实且领域多样的提示与2500+专家编写的细粒度评估标准配对,用于检验事实依据、推理严谨性和表达清晰度。我们还提出新的复杂度框架,从概念广度、逻辑嵌套和探索深度三个维度对DR任务进行分类。此外,开发了人工与模型结合的双重评估方案,用以衡量DR智能体对评估标准的遵循程度。在对多种前沿DR系统进行评估后,我们发现即使如Gemini DR和OpenAI DR等领先智能体,其平均符合度也低于68%,主要问题在于遗漏隐含上下文及对检索信息的推理不足。这一结果凸显了对深度研究能力进行稳健可扩展评估的必要性。为此我们开源ResearchRubrics(含全部提示词、评估标准及代码),以推动具有充分论证能力的研究助手的发展。
English
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.
PDF94December 1, 2025