ExpRL：面向LLM中間訓練的探索性強化學習

摘要

稀疏奖励强化学习已成为提升大型语言模型推理能力的标准工具，但其成功高度依赖于基础模型中的覆盖范围。实践中，模型通常通过中间训练（mid-training）在精心策划的推理轨迹上进行初始化，以学习分解、验证或自我修正等有用的基本技能。尽管这种方法有效，但需要人工指定模型应学习的内容，且尚不清楚此类基本技能覆盖是否足以应对更复杂的问题——这些问题需要将这些技能整合为更广泛的解题策略。我们研究了一种更自动化的方法：基于强化学习的中间训练，利用大规模人工撰写的问答数据。我们的方法ExpRL不将参考答案视为待模仿的目标，而是将其用作奖励支架：参考答案对策略模型隐藏，仅用于构建针对特定问题的评分标准，以评判策略模型生成的推理轨迹。策略模型从原始问题提示中采样，同时一个大型语言模型评判者将采样的推理轨迹与参考答案进行比对，并分配结果级或过程级密集奖励。这使得ExpRL能够强化部分进展、有用的中间简化步骤以及高效的推理行为——这些往往被稀疏的最终答案奖励所忽略。在具有挑战性的数学推理任务中，ExpRL在强化学习初始化方面优于SFT、稀疏奖励GRPO和自蒸馏方法，并为后续的稀疏奖励强化学习提供了更优的初始化。此外，跨领域混合实验表明，ExpRL能够扩展到原本纯数学场景之外。

English

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through mid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: RL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as reward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.