探索与利用:通过剪裁、熵与伪奖励重思RLVR
Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
December 18, 2025
作者: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin
cs.AI
摘要
本文研究了具有可验证奖励的强化学习(RLVR)中的探索-利用权衡问题,该框架旨在提升大语言模型(LLM)的推理能力。近期研究表明,RLVR可通过两种看似矛盾的机制激发LLM强大的数学推理能力:伪奖励(通过奖励与真实答案无关的结果来抑制利用行为)和熵最小化(通过推动模型产生更自信的确定性输出来抑制探索行为)。这一矛盾动态凸显出令人困惑的现象:抑制利用与抑制探索均能提升推理性能,但调和这两种效应的内在原理尚不明确。我们聚焦两个核心问题:(i)策略熵如何关联性能;(ii)伪奖励是否通过裁剪偏差与模型污染的相互作用产生增益。实验结果表明,伪奖励下的裁剪偏差会降低策略熵,从而产生更自信的确定性输出,而仅靠熵最小化不足以实现性能提升。我们进一步提出奖励错配模型,解释为何伪奖励在污染场景之外仍能提升性能。本研究阐明了伪奖励获益的内在机制,并为更有效的RLVR训练提供了理论依据。
English
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.