將推理學習作為可擴展的中期訓練強化學習中的動作抽象
Learning to Reason as Action Abstractions with Scalable Mid-Training RL
September 30, 2025
作者: Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang
cs.AI
摘要
大型語言模型在強化學習(RL)方面表現卓越,但充分釋放這一潛力需要一個中期訓練階段。一個有效的中期訓練階段應識別出一組精簡且有用的動作,並通過在線強化學習實現快速選擇。我們通過提出首個關於中期訓練如何塑造後期訓練的理論結果來形式化這一直覺:它描述了一個動作子空間,該子空間最小化了剪枝帶來的價值近似誤差以及後續規劃中的強化學習誤差。我們的分析揭示了中期訓練有效性的兩個關鍵決定因素:剪枝效率,它塑造了初始強化學習策略的先驗;以及其對強化學習收斂的影響,這決定了該策略通過在線交互能夠改進的程度。這些結果表明,當決策空間緊湊且有效視野較短時,中期訓練最為有效,這凸顯了在動作抽象空間而非原始動作空間中操作的重要性。基於這些見解,我們提出了“推理作為動作抽象”(RA3),一種可擴展的中期訓練算法。具體而言,我們推導出一個序列變分下界,並通過強化學習迭代發現時間上一致的潛在結構來優化它,隨後在引導數據上進行微調。在代碼生成任務上的實驗證明了我們方法的有效性。在多個基礎模型上,RA3在HumanEval和MBPP上的平均性能分別比基礎模型和下一個詞預測基線提高了8分和4分。此外,RA3在HumanEval+、MBPP+、LiveCodeBench和Codeforces上的RLVR中實現了更快的收斂速度和更高的漸近性能。
English
Large language models excel with reinforcement learning (RL), but fully
unlocking this potential requires a mid-training stage. An effective
mid-training phase should identify a compact set of useful actions and enable
fast selection among them through online RL. We formalize this intuition by
presenting the first theoretical result on how mid-training shapes
post-training: it characterizes an action subspace that minimizes both the
value approximation error from pruning and the RL error during subsequent
planning. Our analysis reveals two key determinants of mid-training
effectiveness: pruning efficiency, which shapes the prior of the initial RL
policy, and its impact on RL convergence, which governs the extent to which
that policy can be improved via online interactions. These results suggest that
mid-training is most effective when the decision space is compact and the
effective horizon is short, highlighting the importance of operating in the
space of action abstractions rather than primitive actions. Building on these
insights, we propose Reasoning as Action Abstractions (RA3), a scalable
mid-training algorithm. Specifically, we derive a sequential variational lower
bound and optimize it by iteratively discovering temporally-consistent latent
structures via RL, followed by fine-tuning on the bootstrapped data.
Experiments on code generation tasks demonstrate the effectiveness of our
approach. Across multiple base models, RA3 improves the average performance on
HumanEval and MBPP by 8 and 4 points over the base model and the next-token
prediction baseline. Furthermore, RA3 achieves faster convergence and higher
asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and
Codeforces.