RubricEM：超越可验证奖励的评分引导策略分解元强化学习

摘要

训练深度研究智能体——即能够规划、搜索、评估证据并生成长篇报告的系统——将强化学习推向了可验证奖励机制的边界。这些智能体的输出缺乏真实答案，其决策轨迹涉及大量工具增强的步骤，而标准的后训练方法几乎无法将过往尝试转化为可复用的经验。在这项工作中，我们认为评价准则不应仅作为最终答案的评估工具，而应成为串联策略执行、评判反馈和智能体记忆的共享接口。基于这一观点，我们提出了RubricEM——一种基于准则引导的强化学习框架，它将分阶段策略分解与基于反思的元策略进化相结合。RubricEM首先通过让规划、证据收集、审阅和合成环节依据自生成准则进行，使研究轨迹具备阶段感知能力。然后，它采用阶段结构化GRPO进行信用分配，通过分阶段准则判断为长程优化提供更密集的语义反馈。同时，RubricEM训练一个共享主干结构的反思元策略，将经过评判的轨迹提炼为可复用的准则指导，供后续尝试使用。最终的RubricEM-8B模型在四项长篇研究基准测试中展现出强劲性能，超越了同类开源模型，并接近专有深度研究系统的水平。在关注最终性能之外，我们还通过深入分析揭示了RubricEM的核心构成要素。

English

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

RubricEM：超越可验证奖励的评分引导策略分解元强化学习

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

摘要

Support