LLM推論の境界を押し広げる

要旨

現在のオンライン強化学習（RL）アルゴリズム、例えばGRPOは、大規模言語モデル（LLM）の推論において重要な制限を共有している。すなわち、モデルにとって「解けない」問題から学習することができない。言い換えると、モデルが正解を探索できる問題においてのみ性能を向上させることができる。その結果、RLトレーニング後もモデルの「上限」は変わらず、解ける問題の解決可能性が増加する一方で、難しいサンプルはトレーニングに貢献できない。なぜなら、ロールアウトが報酬を生まず、勾配が生成されないためである。これらの難しいサンプルから学習を可能にするために、我々はNuRLを提案する。これは、自己生成されたヒント、すなわち問題の難易度を下げるための抽象的な手がかりを用いて、LLM推論の上限を押し上げることを目指す「ナッジング」手法である。質問とその正解が与えられると、モデルはCoT（Chain-of-Thought）を生成し、問題を解決するために必要な核心的な知識を含むヒントを生成する。トレーニング中、基本ポリシーからG回のロールアウトを生成し、通過率に基づいてヒントを注入するかどうかを決定する。通過率が0%の難しいサンプルに対しては、ヒントを注入し、新たなバッチの軌跡を再生成する。これにより2つの利点が得られる：(1) ヒントが通過率を向上させ（0%から非ゼロへ）、これまで解けなかったサンプルに対してトレーニング信号を導入し、(2) ヒントが自己生成されるため、分布のシフトを避け、外部モデルに依存しない。NuRLは、6つのベンチマークと3つのモデルにおいて一貫した改善を達成し、テスト時のスケーリングと補完的である。特に、NuRLはモデルの上限を引き上げることができるが、GRPOは基本モデルと比較してpass@1024を変化させない。さらに、効果的なヒントの条件と、ヒントが最も有用なタイミングについて体系的な研究を提示する。興味深いことに、最良のヒントは抽象的で高レベルであり、必要に応じて、かつGRPOが収束した後に適用される場合に最も有益である。

English

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

LLM推論の境界を押し広げる

Nudging the Boundaries of LLM Reasoning

要旨

Support