ChatPaper.aiChatPaper

隐形之链:为何RLVR或许难以摆脱其起源

The Invisible Leash: Why RLVR May Not Escape Its Origin

July 20, 2025
作者: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi
cs.AI

摘要

近期大型推理模型的进展凸显了带可验证奖励的强化学习(RLVR)作为一种增强AI能力的有前景方法,尤其在解决复杂逻辑任务方面。然而,RLVR是否真正扩展了模型的推理边界,还是仅仅放大了基础模型已知的高奖励输出以提高精度,仍不明确。本研究通过理论与实证探究,为RLVR的潜在局限提供了新见解。首先,我们提出一个新的理论视角,即RLVR受限于基础模型的支持集——无法采样初始概率为零的解决方案——并作为一种保守的权重调整机制运行,可能限制全新解决方案的发现。我们还识别出一种熵-奖励权衡:尽管RLVR可靠地提升了精度,但它可能逐步缩小探索范围,潜在地忽视了正确但代表性不足的解决方案。大量实证实验验证,虽然RLVR一致性地提高了pass@1指标,但在更大的采样预算下,经验支持集的收缩通常超过其扩展,未能恢复基础模型先前可访问的正确答案。有趣的是,我们还观察到,尽管RLVR有时增加了令牌级别的熵,导致每一步生成的不确定性增大,但答案级别的熵却下降,表明这些看似更不确定的路径最终收敛到更小的一组独特答案。综合来看,这些发现揭示了RLVR在扩展推理视野方面的潜在限制。打破这一隐形束缚,可能需要未来的算法创新,如显式探索机制或混合策略,将概率质量播种到代表性不足的解决方案区域。
English
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
PDF839July 22, 2025