プロンプトを見逃さない：エントロピー誘導型アドバンテージシェイピングによるLLM強化学習におけるゼロ分散プロンプトの活用

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLMs）の推論能力を向上させるための強力なフレームワークです。しかし、現在のGRPOなどの手法は、同じ入力に対するモデルの応答が正しさにおいて異なる問題にのみ依存し、すべての応答が同じ報酬を受けるいわゆる「ゼロ分散プロンプト」を無視しています。本研究では、このようなプロンプトが無駄ではなく、実際にはポリシー最適化に意味のあるフィードバックを提供できると主張します。この目的のために、ゼロ分散プロンプトから学習信号を抽出する新しいアルゴリズムであるRL-ZVPを導入します。RL-ZVPは、応答を対比することなく、正しさを直接報酬として与え、誤りを罰するものであり、トークンレベルの特性を用いてフィードバックを調整し、情報量が豊かで微妙な信号を保持します。6つの数学的推論ベンチマークにおいて、RL-ZVPはGRPOに対して精度で最大8.61ポイント、合格率で7.77ポイントの大幅な改善を達成し、ゼロ分散プロンプトを除外する他のベースラインを一貫して上回りました。これらの結果は、RLVRにおけるゼロ分散プロンプトからの学習の未開拓の可能性を強調しています。

English

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.

プロンプトを見逃さない：エントロピー誘導型アドバンテージシェイピングによるLLM強化学習におけるゼロ分散プロンプトの活用

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

要旨

Support