回归基础:通过生成概率重新审视强化学习在LLM推理中的探索策略
Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities
February 5, 2026
作者: Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets
cs.AI
摘要
可验证奖励强化学习(RLVR)已成为增强大语言模型推理能力的关键范式。然而,标准策略优化方法如分组相对策略优化(GRPO)常收敛至低熵策略,导致严重的模式坍塌和输出多样性受限。我们从采样概率动态视角分析该问题,发现标准目标函数会过度强化高似然路径,从而抑制有效的替代推理链。为此,我们提出新颖的优势重加权机制(ARM),通过将提示困惑度与答案置信度融入优势估计,动态重塑奖励信号以削弱过度自信推理路径的梯度更新,同时将概率质量重新分配给未被充分探索的正确解。实证结果表明,该方法在保持竞争力的准确率同时显著提升生成多样性和响应熵,有效实现推理任务中探索与利用的更优平衡。在Qwen2.5和DeepSeek模型上的数学与编程基准测试表明,改进型GRPO显著缓解了熵坍塌问题。具体而言,在Qwen2.5-7B模型上,本方法在Pass@1指标上超越GRPO 5.7%,在Pass@32指标上更是领先13.9%,凸显其生成多样化正确推理路径的卓越能力。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.