ChatPaper.aiChatPaper

学会为强化学习提供提示

Learning to Hint for Reinforcement Learning

April 1, 2026
作者: Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He
cs.AI

摘要

群体相对策略优化(GRPO)在可验证奖励的强化学习中被广泛使用,但其常面临优势坍缩问题:当群体中所有推演轨迹获得相同奖励时,该群体产生的相对优势为零,从而导致学习信号缺失。例如,若问题对推理器而言过于困难,所有采样轨迹可能均出现错误而获得零奖励。近期研究通过向此类难题添加提示或辅助支架来解决该问题,使推理器产生差异化结果并恢复非零梯度更新。然而,现有提示通常是固定预设而非适配当前推理器,且能在线索输入下产生学习信号的提示,未必能提升测试时使用的无提示策略性能。 为此,我们提出强化学习中的提示学习框架(HiLL),在强化学习过程中联合训练提示策略与推理策略。针对每个难题,提示器会根据当前推理器的错误轨迹在线生成提示,使提示生成能动态适配推理器的演进错误。我们进一步提出提示依赖度指标,用于衡量正确提示轨迹对提示的依赖强度。通过推导可迁移性定理,我们证明较低的提示依赖度意味着从提示成功到无提示成功的更强迁移能力,并基于该结论设计了用于训练提示器的迁移加权奖励机制。因此,HiLL框架不仅青睐能恢复信息性GRPO群体的提示,更倾向于产生对原始无提示策略具可迁移性的学习信号。 在多个基准测试上的实验表明,HiLL始终优于GRPO及现有基于提示的基线方法,验证了自适应且具备可迁移意识的提示学习对强化学习的价值。代码已开源:https://github.com/Andree-9/HiLL。
English
Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.
PDF31April 10, 2026