비지도 RLVR는 LLM 학습을 어디까지 확장할 수 있을까?

초록

검증 가능한 보상을 활용한 비지도 강화 학습(URLVR)은 정답 레이블 없이 보상을 도출함으로써 LLM 훈련을 감독 병목 현상을 넘어 확장할 수 있는 길을 제시합니다. 최근 연구들은 모델 내재 신호를 활용하여 초기 성과를 보여주지만, 그 잠재력과 한계는 여전히 불분명합니다. 본 연구에서는 URLVR을 재검토하고 분류체계, 이론, 광범위한 실험을 아우르는 종합적 분석을 제공합니다. 먼저 URLVR 방법을 보상 출처에 따라 내재적 방법과 외부적 방법으로 분류한 후, 모든 내재적 방법이 모델의 초기 분포를 선명하게 만드는 방향으로 수렴한다는 것을 밝히는 통일된 이론적 프레임워크를 정립합니다. 이러한 선명화 메커니즘은 초기 신뢰도가 정답과 일치할 때 성공하지만, 불일치할 경우 치명적으로 실패합니다. 체계적인 실험을 통해 우리는 내재적 보상이 방법론에 관계없이 일관되게 상승 후 하락 패턴을 따르며, 붕괴 시점이 엔지니어링 선택보다 모델 사전 지식에 의해 결정됨을 보여줍니다. 이러한 확장 한계에도 불구하고, 내재적 보상은 소규모 데이터셋의 테스트 타임 훈련에서 여전히 가치가 있으며, 우리는 모델 사전 지식을 측정하는 Model Collapse Step을 제안하여 RL 훈련 가능성의 실용적 지표로 활용합니다. 마지막으로, 계산 비대칭성에 기반한 검증을 수행하는 외부 보상 방법을 탐구하며, 이들이 신뢰도-정확도 한계를 벗어날 가능성을 보여주는 예비 증거를 제시합니다. 우리의 연구 결과는 내재적 URLVR의 경계를 규명하는 동시에 확장 가능한 대안을 위한 길을 제시합니다.

English

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.

비지도 RLVR는 LLM 학습을 어디까지 확장할 수 있을까?

How Far Can Unsupervised RLVR Scale LLM Training?

초록

Support