无提示不遗漏：通过熵引导优势塑造在LLM强化学习中挖掘零方差提示

摘要

基于可验证奖励的强化学习（RLVR）是提升大型语言模型（LLMs）推理能力的有力框架。然而，当前方法如GRPO仅依赖于模型对同一输入的不同响应在正确性上存在差异的问题，而忽略了所有响应获得相同奖励的情况——即所谓的零方差提示。在本研究中，我们认为这类提示并非无用，实际上能为策略优化提供有意义的反馈。为此，我们引入了零方差提示强化学习（RL-ZVP），这是一种新颖的算法，能够从零方差提示中提取学习信号。RL-ZVP直接奖励正确性并惩罚错误，即使在没有对比响应的情况下，也能通过词元级别的特征调节反馈，以保留信息丰富且细致的信号。在六个数学推理基准测试中，RL-ZVP相较于GRPO在准确率上实现了高达8.61分的显著提升，在通过率上提升了7.77分，同时持续优于其他过滤掉零方差提示的基线方法。这些结果凸显了在RLVR中从零方差提示学习的未开发潜力。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

无提示不遗漏：通过熵引导优势塑造在LLM强化学习中挖掘零方差提示

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

摘要

Support