ChatPaper.aiChatPaper

杜魯博士:基於演化評分標準的深度研究強化學習

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

November 24, 2025
作者: Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
cs.AI

摘要

深度研究模型透過多步驟研究來產生長篇且具備完善引證的答案。然而,多數開源深度研究模型是透過帶有可驗證獎勵的強化學習(RLVR)在易於驗證的短問答任務上訓練而成,這種方法無法擴展到現實中的長篇任務。我們提出「動態評量標準強化學習」(RLER)來解決此問題:在訓練過程中建立並維護與策略模型共同演進的評量標準,使評量標準能整合模型新探索的資訊,並提供具區分度的同策略反饋。運用RLER技術,我們開發出Deep Research Tulu(DR Tulu-8B),這是首個直接針對開放式長篇深度研究任務訓練的開源模型。在科學、醫療和通用領域的四項長篇深度研究基準測試中,DR Tulu不僅大幅超越現有開源深度研究模型,更達到或超越專有深度研究系統的表現,同時模型體積更小且單次查詢成本顯著降低。為推動未來研究,我們公開所有資料、模型與程式碼,包含專為深度研究系統設計的新型MCP代理基礎架構。
English
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
PDF613February 7, 2026