为强化学习设计提示机制

摘要

群体相对策略优化（GRPO）在可验证奖励的强化学习中应用广泛，但其常面临优势坍缩问题：当组内所有轨迹获得相同奖励时，群体相对优势为零，导致学习信号缺失。例如当问题对推理器难度过高时，所有采样轨迹可能均出现错误而获得零奖励。近期研究通过向此类难题添加提示或辅助支架来应对该问题，使推理器产生差异化结果并恢复非零梯度更新。然而现有提示通常是固定预设的，未能适配当前推理器状态，且在提示输入下产生学习信号的提示未必能提升测试时使用的无提示策略。为此，我们提出提示学习强化框架（HiLL），在强化学习过程中联合训练提示策略与推理策略。针对每个难题，提示器会根据当前推理器的错误轨迹在线生成提示，使提示生成能动态适配推理器的演化错误。我们进一步提出提示依赖度指标，用于衡量正确提示轨迹对提示的依赖程度。通过推导可迁移性定理证明：较低的提示依赖度意味着从提示成功向无提示成功的更强迁移能力，并基于该结论构建用于训练提示器的迁移加权奖励。因此HiLL框架不仅青睐能产生有效GRPO分组的提示，更优先选择那些能提升原始无提示策略的迁移友好型提示。在多基准测试中的实验表明，HiLL持续优于GRPO及现有提示基线，验证了自适应且具备迁移意识的提示学习对强化学习的价值。代码已开源：https://github.com/Andree-9/HiLL。

English

Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.