LLM推論における強化学習の再考：能力学習ではなくスパースな方策選択

要旨

強化学習は大規模言語モデルにおける推論能力向上の標準的手法となっているが、RLが新たな戦略を教えるのではなく、ベースモデルが既に内包する解の集合上で確率質量を再配分していることを示す証拠が増えつつある。本研究では、RLが単にモデルを既知の経路へ誘導するだけであるならば、RL最適化ループ自体は本当に必要かという問いを立てる。複数のモデルファミリとRLアルゴリズムにわたるトークンレベルの分析を通じて、RLの有益な影響が、モデルがいずれの分岐を取るべきか不確かな高エントロピー決定点に集中した、スパースで予測可能な修正であることを見出した。影響を受けるトークン位置はわずか1～3%であり、促進されるトークンは常にベースモデルのトップ5候補内に存在し、それら少数の位置へのターゲット修正がRLによる精度向上の大部分を因果的に再現する一方、ランダムな修正では失敗する。ベースモデル自身のエントロピーは、RL訓練モデルを一切用いずにこれらの位置を特定し、修正全体は低次元であり、モデルパラメータのごく一部で表現可能である。これらの知見は、推論向上を能力獲得ではなくスパースな方策選択として再定義する。この洞察をReasonMaxxerへと具体化した。これは、エントロピーでゲートされた決定点のみに対比損失を適用する、最小限のRLフリー手法であり、数百のベースモデルロールアウトとオンライン生成を必要としない。3つのモデルファミリ、6つのスケール、6つの数学推論ベンチマークにおいて、ReasonMaxxerは完全なRL性能と同等またはそれを上回りながら、わずか数十の問題と単一GPUで数分の訓練を要するに過ぎず、訓練コストを約3桁削減する。

English

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.