自提示语言模型强化学习能力

摘要

群体相对策略优化（GRPO）作为一种基于可验证目标的大语言模型对齐方法近期崭露头角。然而在稀疏终端奖励环境下，由于组内推演常获得相同奖励导致相对优势坍缩和更新失效，GRPO往往陷入停滞。我们提出具备特权监督的自提示对齐GRPO框架（SAGE），该在线强化学习框架通过注入特权提示来重塑相同终端验证器奖励下的推演分布。对于每个提示x，模型会采样紧凑提示h（如规划或分解方案），随后基于(x,h)生成解决方案τ。关键在于任务奖励R(x,τ)保持不变；提示仅通过有限采样增强组内结果多样性，防止GRPO在稀疏奖励下出现优势坍缩。测试时设置h=∅即可部署无需特权信息的无提示策略。此外，相比初始策略或更强外部模型的固定提示，多样化自提示采样能形成自适应课程，更有效追踪学习者的瓶颈阶段。在3种大语言模型上的6项基准测试表明，SAGE持续优于GRPO：Llama-3.2-3B-Instruct平均提升2.0分，Qwen2.5-7B-Instruct提升1.2分，Qwen3-4B-Instruct提升1.3分。代码已开源：https://github.com/BaohaoLiao/SAGE。

English

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

自提示语言模型强化学习能力

Self-Hinting Language Models Enhance Reinforcement Learning

摘要

Support