ChatPaper.aiChatPaper

隱形的牽引:為何RLVR可能無法擺脫其起源

The Invisible Leash: Why RLVR May Not Escape Its Origin

July 20, 2025
作者: Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi
cs.AI

摘要

近期大型推理模型的進展凸顯了可驗證獎勵強化學習(RLVR)作為提升AI能力的一種有前景方法,特別是在解決複雜邏輯任務方面。然而,目前尚不清楚RLVR是否真正擴展了模型的推理邊界,還是僅僅放大了基礎模型已知的高獎勵輸出以提高精確度。本研究提供了理論與實證的雙重探討,為RLVR的潛在限制帶來了新的見解。首先,我們提出了一種新的理論觀點,認為RLVR受制於基礎模型的支持範圍——無法採樣初始概率為零的解決方案——並作為一種保守的重新加權機制,可能限制全新解決方案的發現。我們還識別了一種熵-獎勵權衡:儘管RLVR可靠地提升了精確度,但它可能逐漸縮小探索範圍,並可能忽視正確但代表性不足的解決方案。大量的實證實驗驗證了,雖然RLVR一致性地改善了pass@1,但在更大的採樣預算下,經驗支持的收縮通常超過了經驗支持的擴展,未能恢復基礎模型先前可訪問的正確答案。有趣的是,我們還觀察到,儘管RLVR有時增加了token層面的熵,導致每一步生成的不確定性增加,但答案層面的熵卻下降,表明這些看似更不確定的路徑最終收斂到更小的獨特答案集合。綜合來看,這些發現揭示了RLVR在擴展推理視野方面的潛在限制。打破這一無形的束縛可能需要未來的算法創新,如顯式探索機制或將概率質量播種到代表性不足的解決方案區域的混合策略。
English
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
PDF839July 22, 2025