無監督強化學習價值回饋能將大型語言模型訓練擴展到何種程度？

摘要

無監督強化學習與可驗證獎勵（URLVR）通過在無需真實標籤的情況下推導獎勵，為突破大規模語言模型訓練的監督瓶頸提供了可行路徑。近期研究利用模型內在信號已展現出早期潛力，但其發展前景與局限性仍不明朗。本研究重新審視URLVR框架，從分類體系、理論基礎到大量實驗進行了全面分析。我們首先根據獎勵來源將URLVR方法分為內在型與外部型，進而建立統一理論框架，揭示所有內在方法實質上都趨向於銳化模型的初始分佈——當初始置信度與正確性一致時，該銳化機制可成功運作；而當兩者錯位時則會引發災難性失敗。通過系統性實驗，我們發現內在獎勵在不同方法中均呈現先升後降的規律，其崩潰時機取決於模型先驗而非工程優化策略。儘管存在這些擴展限制，我們發現內在獎勵在小數據集上的測試時訓練中仍具價值，並提出「模型崩潰步數」作為衡量模型先驗的指標，為強化學習可訓練性提供實踐依據。最後，我們探索了基於計算不對稱性進行驗證的外部獎勵方法，初步證據表明其或能突破置信度-正確性的天花板。本研究既劃定了內在URLVR的能力邊界，也為構建可擴展替代方案指明了方向。

English

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.

無監督強化學習價值回饋能將大型語言模型訓練擴展到何種程度？

How Far Can Unsupervised RLVR Scale LLM Training?

摘要

Support