スケーラブルな中期訓練強化学習を用いた行動抽象化としての推論学習

要旨

大規模言語モデルは強化学習（RL）において優れた性能を発揮するが、その潜在能力を完全に引き出すためには、中間訓練段階が必要である。効果的な中間訓練フェーズは、有用なアクションのコンパクトなセットを特定し、オンラインRLを通じてそれらを迅速に選択できるようにするべきである。本論文では、この直感を理論的に形式化し、中間訓練が訓練後の結果にどのように影響を与えるかについて初めての理論的結果を示す。具体的には、プルーニングによる価値近似誤差とその後の計画中のRL誤差の両方を最小化するアクション部分空間を特徴付ける。我々の分析から、中間訓練の効果を決定する2つの重要な要因が明らかになった。1つは、初期RLポリシーの事前分布を形成するプルーニング効率であり、もう1つは、オンライン相互作用を通じてそのポリシーを改善できる範囲を決定するRL収束への影響である。これらの結果は、決定空間がコンパクトで有効地平線が短い場合に中間訓練が最も効果的であることを示しており、原始的なアクションではなく、アクション抽象化の空間で操作することの重要性を強調している。これらの知見に基づいて、我々はスケーラブルな中間訓練アルゴリズムである「Reasoning as Action Abstractions（RA3）」を提案する。具体的には、逐次的な変分下限を導出し、RLを通じて時間的に一貫した潜在構造を反復的に発見し、その後ブートストラップデータで微調整することで最適化する。コード生成タスクにおける実験は、我々のアプローチの有効性を実証している。複数のベースモデルにおいて、RA3はHumanEvalとMBPPの平均性能をベースモデルと次トークン予測ベースラインに対してそれぞれ8ポイントと4ポイント向上させた。さらに、RA3はHumanEval+、MBPP+、LiveCodeBench、およびCodeforcesにおいて、RLVRでの収束速度と漸近的性能の向上を達成した。

English

Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

スケーラブルな中期訓練強化学習を用いた行動抽象化としての推論学習

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

要旨

Support