通过可扩展的中期强化学习将推理能力抽象为动作表示
Learning to Reason as Action Abstractions with Scalable Mid-Training RL
September 30, 2025
作者: Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang
cs.AI
摘要
大型语言模型在强化学习(RL)中表现出色,但充分释放这一潜力需要一个中期训练阶段。一个有效的中期训练阶段应识别出一组紧凑的有用动作,并通过在线RL实现快速选择。我们通过提出首个关于中期训练如何塑造后期训练的理论结果,将这一直觉形式化:它刻画了一个动作子空间,该空间最小化剪枝带来的价值近似误差以及后续规划中的RL误差。我们的分析揭示了中期训练有效性的两个关键决定因素:剪枝效率,它塑造了初始RL策略的先验;以及其对RL收敛的影响,这决定了该策略通过在线交互可改进的程度。这些结果表明,当决策空间紧凑且有效视野较短时,中期训练最为有效,强调了在动作抽象空间而非原始动作空间操作的重要性。基于这些洞见,我们提出了“推理作为动作抽象”(RA3),一种可扩展的中期训练算法。具体而言,我们推导出一个序列变分下界,并通过RL迭代发现时间上一致的潜在结构,随后在自举数据上进行微调,来优化该下界。代码生成任务的实验验证了我们方法的有效性。在多个基础模型上,RA3在HumanEval和MBPP上的平均性能分别比基础模型和下一词预测基线提高了8分和4分。此外,RA3在HumanEval+、MBPP+、LiveCodeBench和Codeforces上的RLVR中实现了更快的收敛速度和更高的渐近性能。
English
Large language models excel with reinforcement learning (RL), but fully
unlocking this potential requires a mid-training stage. An effective
mid-training phase should identify a compact set of useful actions and enable
fast selection among them through online RL. We formalize this intuition by
presenting the first theoretical result on how mid-training shapes
post-training: it characterizes an action subspace that minimizes both the
value approximation error from pruning and the RL error during subsequent
planning. Our analysis reveals two key determinants of mid-training
effectiveness: pruning efficiency, which shapes the prior of the initial RL
policy, and its impact on RL convergence, which governs the extent to which
that policy can be improved via online interactions. These results suggest that
mid-training is most effective when the decision space is compact and the
effective horizon is short, highlighting the importance of operating in the
space of action abstractions rather than primitive actions. Building on these
insights, we propose Reasoning as Action Abstractions (RA3), a scalable
mid-training algorithm. Specifically, we derive a sequential variational lower
bound and optimize it by iteratively discovering temporally-consistent latent
structures via RL, followed by fine-tuning on the bootstrapped data.
Experiments on code generation tasks demonstrate the effectiveness of our
approach. Across multiple base models, RA3 improves the average performance on
HumanEval and MBPP by 8 and 4 points over the base model and the next-token
prediction baseline. Furthermore, RA3 achieves faster convergence and higher
asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and
Codeforces.