教師なしRLVRはLLMトレーニングをどこまで拡張できるか？

要旨

教師なし強化学習による検証可能な報酬（URLVR）は、正解ラベルなしで報酬を導出することで、LLM学習を教師データのボトルネックを超えてスケールさせる経路を提供する。最近の研究ではモデル内の内在的信号を活用する手法が提案され、初期段階で有望な成果を示しているが、その可能性と限界は未だ不明確である。本論文ではURLVRを再検討し、分類体系、理論、および広範な実験にわたる包括的分析を提供する。まずURLVR手法を報酬源に基づいて内在的手法と外部的手法に分類し、次に統一理論フレームワークを確立して、全ての内在的手法がモデルの初期分布のシャープ化に向けて収束することを明らかにする。このシャープ化メカニズムは、初期の信頼度が正しさと一致する場合には成功するが、不一致の場合には壊滅的に失敗する。体系的な実験を通じて、内在的報酬は手法を問わず一貫して「上昇後下降」のパターンに従い、崩壊のタイミングは工学的手法ではなくモデルの事前分布によって決定されることを示す。これらのスケーリング限界にもかかわらず、内在的報酬は小規模データセットに対するテスト時学習において依然として価値があり、モデルの事前分布を測定する「モデル崩壊段階」を提案して、RLの学習可能性に関する実用的指標として機能させる。最後に、計算的非対称性に基づく検証を実現する外部報酬手法を探求し、これらが信頼度-正確性の天井を回避し得る予備的証拠を示す。我々の知見は、内在的URLVRの境界を明示するとともに、スケーラブルな代替手法への道筋を示唆するものである。

English

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.

教師なしRLVRはLLMトレーニングをどこまで拡張できるか？

How Far Can Unsupervised RLVR Scale LLM Training?

要旨

Support