自提示语言模型提升强化学习性能
Self-Hinting Language Models Enhance Reinforcement Learning
February 3, 2026
作者: Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian
cs.AI
摘要
群体相对策略优化(GRPO)作为一种基于可验证目标对齐大语言模型的实用方法近期崭露头角。然而在稀疏终端奖励场景下,由于组内推演常获得相同奖励导致相对优势坍缩和梯度更新消失,GRPO往往陷入停滞。我们提出具备特权监督的自提示对齐GRPO框架(SAGE),该在线强化学习框架通过在训练阶段注入特权提示来重塑相同终端验证器奖励下的推演分布。对于每个提示x,模型首先采样紧凑提示h(如规划或分解方案),继而基于(x,h)生成解决方案τ。关键设计在于任务奖励R(x,τ)保持不变;提示仅通过有限采样增强组内结果多样性,从而避免稀疏奖励下GRPO优势坍缩。测试阶段设置h=∅,直接部署无提示策略而无需任何特权信息。此外,多样自提示采样可形成自适应课程机制,相比初始策略或强外部模型的固定提示,能更有效追踪学习者的瓶颈环节。在3种大语言模型上的6个基准测试表明,SAGE持续优于GRPO:Llama-3.2-3B-Instruct平均提升2.0分,Qwen2.5-7B-Instruct提升1.2分,Qwen3-4B-Instruct提升1.3分。代码已开源:https://github.com/BaohaoLiao/SAGE。
English
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.