強化学習のためのヒント学習

要旨

Group Relative Policy Optimization（GRPO）は検証可能な報酬を用いた強化学習に広く利用されているが、advantage collapse（優位性の崩壊）に悩まされることが多い。すなわち、グループ内の全てのロールアウトが同じ報酬を受け取ると、グループの相対的優位性がゼロとなり、学習信号が得られなくなる。例えば、推論器にとって問題が難しすぎる場合、サンプリングされた全てのロールアウトが不正解となり、報酬がゼロとなることがある。最近の研究では、この問題に対処するため、このような難問に対してヒントや補助的な足場を追加し、推論器が混合した結果を生成することで非ゼロの更新を回復させる手法が提案されている。しかし、既存のヒントは通常固定されており、現在の推論器に適応せず、ヒント付き入力下で学習信号を生み出すヒントが、テスト時に使用されるヒントなし方策の改善に必ずしも繋がるわけではない。この目的のために、我々は強化学習におけるヒント学習（Hint Learning for Reinforcement Learning, HiLL）を提案する。これは、RLの学習過程中にヒント生成方策（hinter policy）と推論器方策（reasoner policy）を共同で訓練するフレームワークである。各難問に対して、ヒント生成器は現在の推論器の不正解ロールアウトを条件としてオンラインでヒントを生成し、ヒント生成が推論器の変化する誤りに適応できるようにする。さらに、ヒント依存性（hint reliance）を導入する。これは、ヒント付き正解軌道がヒントにどの程度強く依存しているかを測定するものである。我々は、ヒント依存性が低いほど、ヒント付き成功からヒントなし成功への転移が強くなることを示す転移可能性の結果を導出し、この結果を用いてヒント生成器の訓練のための転移重み付き報酬を定義する。したがって、HiLLは、情報量のあるGRPOグループを回復するだけでなく、元のヒントなし方策の改善に繋がりやすい信号を生成するヒントを重視する。複数のベンチマークによる実験では、HiLLがGRPOおよび従来のヒントベースのベースラインを一貫して上回り、強化学習における適応的かつ転移を意識したヒント学習の価値を実証している。コードはhttps://github.com/Andree-9/HiLL で公開されている。

English

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.

強化学習のためのヒント学習

Learning to Hint for Reinforcement Learning

要旨

Support