推論連鎖から検証可能な部分問題へ：カリキュラム強化学習によるLLM推論の信用割り当て

要旨

検証可能な報酬からの強化学習（RLVR）は、LLMの推論において大きな可能性を示しているが、結果ベースのRLVRは難しい問題に対しては非効率的である。なぜなら、正しい最終回答に到達するロールアウトが稀であり、サンプルレベルのクレジット割り当てでは失敗した試みにおける部分的な進捗を活用できないからである。本論文では、SCRL（サブ問題カリキュラム強化学習）を導入する。これは、参照推論連鎖から検証可能なサブ問題を導出し、最終サブ問題を元の問題に固定するカリキュラム強化学習フレームワークである。これにより、難しい問題における部分的な進捗が検証可能な学習信号に変換される。アルゴリズム的には、SCRLはサブ問題レベルの正規化を使用する。これは各サブ問題の位置で独立に報酬を正規化し、得られたアドバンテージを対応する回答スパンに割り当てることで、外部の評価基準や報酬モデルなしでより細かいクレジット割り当てを可能にする。我々の分析は、サブ問題カリキュラムが難しい問題を勾配のデッドゾーンから引き上げ、元の問題が難しくなるほど相対的な利得が大きくなることを示している。7つの数学的推論ベンチマークにおいて、SCRLは強力なカリキュラム学習ベースラインを上回り、GRPOと比較して平均精度をQwen3-4B-Baseで+4.1ポイント、Qwen3-14B-Baseで+1.9ポイント改善した。AIME24、AIME25、IMO-Benchにおいて、SCRLはQwen3-4B-Base上でpass@1を+3.7ポイント、pass@64を+4.6ポイントさらに改善し、難しい推論問題でのより良い探索を示している。

English

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.