ChatPaper.aiChatPaper

杜鲁博士:基于演进式评估准则的深度研究强化学习

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

November 24, 2025
作者: Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
cs.AI

摘要

深度研究模型通过多步骤研究过程生成具有充分引证的长篇答案。然而当前多数开源深度研究模型基于可验证奖励的强化学习(RLVR)在易于验证的短问答任务上训练,这种模式难以扩展到现实中的长篇任务。我们提出"基于演化评估标准的强化学习"(RLER)解决方案,通过构建与策略模型协同演进的评估体系,使评估标准能够整合模型新探索的信息并提供具有区分度的同策略反馈。基于RLER方法,我们开发出Deep Research Tulu(DR Tulu-80亿参数),这是首个专门针对开放式长篇深度研究任务直接训练的开源模型。在科学、医疗和通用领域的四大长篇深度研究基准测试中,DR Tulu显著超越现有开源深度研究模型,与商用深度研究系统持平或更优,而模型体积和单次查询成本显著降低。为促进后续研究,我们完整开源所有数据、模型及代码,包括基于MCP的新型深度研究系统智能体架构。
English
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
PDF613February 7, 2026