通过可扩展的中期强化学习将推理能力抽象为动作表示

摘要

大型语言模型在强化学习（RL）中表现出色，但充分释放这一潜力需要一个中期训练阶段。一个有效的中期训练阶段应识别出一组紧凑的有用动作，并通过在线RL实现快速选择。我们通过提出首个关于中期训练如何塑造后期训练的理论结果，将这一直觉形式化：它刻画了一个动作子空间，该空间最小化剪枝带来的价值近似误差以及后续规划中的RL误差。我们的分析揭示了中期训练有效性的两个关键决定因素：剪枝效率，它塑造了初始RL策略的先验；以及其对RL收敛的影响，这决定了该策略通过在线交互可改进的程度。这些结果表明，当决策空间紧凑且有效视野较短时，中期训练最为有效，强调了在动作抽象空间而非原始动作空间操作的重要性。基于这些洞见，我们提出了“推理作为动作抽象”（RA3），一种可扩展的中期训练算法。具体而言，我们推导出一个序列变分下界，并通过RL迭代发现时间上一致的潜在结构，随后在自举数据上进行微调，来优化该下界。代码生成任务的实验验证了我们方法的有效性。在多个基础模型上，RA3在HumanEval和MBPP上的平均性能分别比基础模型和下一词预测基线提高了8分和4分。此外，RA3在HumanEval+、MBPP+、LiveCodeBench和Codeforces上的RLVR中实现了更快的收敛速度和更高的渐近性能。

English

Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

通过可扩展的中期强化学习将推理能力抽象为动作表示

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

摘要

Support