ChatPaper.aiChatPaper

RubricEM:超越可驗證獎勵的評分準則引導策略分解元強化學習

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

May 11, 2026
作者: Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister
cs.AI

摘要

訓練深度研究代理(即能夠規劃、搜尋、評估證據並撰寫長篇報告的系統),推動了強化學習超越可驗證獎勵的範疇。這類系統的輸出缺乏真實答案,其軌跡涉及大量工具增強決策,而標準的後訓練流程也幾乎無法將過往嘗試轉化為可重複利用的經驗。在本研究中,我們主張評估準則不僅應作為最終答案的評判工具,更應作為一個共享介面,用以結構化策略執行、裁判反饋與代理記憶。基於此觀點,我們提出 RubricEM——一個以評估準則引導的強化學習框架,結合分階段策略分解與基於反思的元策略演化。RubricEM 首先透過讓規劃、證據收集、審查與綜合階段依賴於自生成的評估準則,使研究軌跡具備階段感知能力。接著,它採用階段結構化 GRPO 進行信用分配,利用階段性的評估準則判斷,為長程優化提供更密集的語義反饋。與此同時,RubricEM 訓練一個共享骨幹的反思元策略,將經過評判的軌跡提煉為可重複使用的、以評估準則為基礎的指導,應用於未來的嘗試。最終的 RubricEM-8B 在四個長篇研究基準測試中展現出強勁的表現,超越同類開放模型,並接近專有的深度研究系統。除了最終性能,我們還進行了深入分析,以釐清 RubricEM 的關鍵要素。
English
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
PDF661May 14, 2026