ExpRL：面向LLM中期训练的探索性强化学习

摘要

稀疏奖励强化学习已成为提升大语言模型推理能力的标准工具，但其成功与否关键取决于基础模型中的覆盖范围。在实践中，模型通常通过中期训练——基于精选的推理轨迹（这些轨迹教授分解、验证或自我修正等有用原始技能）——为强化学习做好准备。尽管这种方法有效，但需要手动指定模型应学习的内容，且尚不清楚此类原始覆盖是否足以应对更难的问题——这些问题需要将这些技能组合成更广泛的解题策略。我们研究了一种更自动化的方法：基于强化学习的中期训练，利用大规模人工编写的问答数据。我们的方法ExpRL并非将参考解答视为模仿目标，而是将其用作奖励脚手架：参考解答对策略隐藏，仅用于构建针对具体问题的评分标准，以评判在线策略产生的推理轨迹。策略从原始问题提示中采样，同时一个大语言模型评判器将采样得到的推理轨迹与参考解答进行比较，并分配结果级或过程级密集奖励。这使ExpRL能够强化部分进展、有用的中间简化步骤以及富有成效的推理行为——而这些往往是稀疏最终答案奖励难以加权的。在具有挑战性的数学推理任务中，ExpRL相比SFT、稀疏奖励GRPO和自蒸馏产生了更强的强化学习预激活效果，并为后续稀疏奖励强化学习提供了更好的初始化。此外，跨混合领域的实验表明，ExpRL能够扩展至最初的纯数学场景之外。

English

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through mid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: RL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as reward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.