無提示遺漏：通過熵引導優勢塑造在LLM強化學習中利用零方差提示

摘要

強化學習與可驗證獎勵（RLVR）是一種強大的框架，旨在提升大型語言模型（LLMs）的推理能力。然而，現有方法如GRPO僅依賴於模型對同一輸入產生不同正確性回應的問題，而忽略了所有回應獲得相同獎勵的情況——即所謂的零方差提示。在本研究中，我們主張此類提示並非無用，實際上能為策略優化提供有意義的反饋。基於此，我們引入了零方差提示強化學習（RL-ZVP），這是一種新穎的算法，能夠從零方差提示中提取學習信號。RL-ZVP直接獎勵正確性並懲罰錯誤，無需對比不同回應，通過調節基於詞元層次特徵的反饋來保留信息豐富且細膩的信號。在六個數學推理基準測試中，RL-ZVP相較於GRPO在準確率上實現了最高達8.61分的提升，在通過率上提升了7.77分，同時持續優於其他過濾掉零方差提示的基線方法。這些結果凸顯了在RLVR中從零方差提示學習的未開發潛力。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

無提示遺漏：通過熵引導優勢塑造在LLM強化學習中利用零方差提示

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

摘要

Support