強化学習はLLMに長期的推論を教えられるか？表現力が鍵となる

要旨

強化学習（RL）は大規模言語モデル（LLM）の推論能力向上に応用されているが、タスクの難易度に伴う学習のスケーリングに関する体系的な研究は、制御可能で拡張性のある環境の不足により妨げられてきた。本研究では、難易度の二つの軸（必要な証明計画の深さ、すなわちホライズン、および基盤となる論理の表現力）を独立に制御可能な合成論理推論フレームワーク、ScaleLogicを提案する。提案フレームワークは、単純な含意のみの論理（「もし～ならば」）から、連言（「かつ」）、選言（「または」）、否定（「ではない」）、全称量化（「すべての」）を含むより表現豊かな一階述語論理に至るまで、幅広い論理体系をサポートする。このフレームワークを用いて、RLの学習計算量Tが推論深度Dに対して冪乗則（T ∝ D^γ, R² > 0.99）に従うこと、およびスケーリング指数γが論理の表現力の増加に伴って単調に1.04から2.60まで増大することを示す。下流タスクである数学および一般推論ベンチマークでは、表現力の低い設定と比較して、表現力の高い学習設定は、より大きな性能向上（最大+10.66ポイント）と、より計算効率の良い転移をもたらし、モデルの性能が「どれだけ学習したか」だけでなく、「何を学習したか」によって形成されることを実証する。さらに、この冪乗則の関係が複数のRL手法で成り立ち、カリキュラム学習に基づく訓練がスケーリング効率を大幅に改善することを示す。

English

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute T follows a power law with respect to reasoning depth D (T propto D^γ, R^{2} > 0.99), and that the scaling exponent γ increases monotonically with logical expressiveness, from 1.04 to 2.60. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to +10.66 points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.

強化学習はLLMに長期的推論を教えられるか？表現力が鍵となる

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

要旨

Support