无监督强化学习视觉推理能在多大程度上扩展大语言模型的训练？

摘要

无监督可验证奖励强化学习（URLVR）通过无需真实标签的奖励机制，为突破大语言模型训练的监督瓶颈提供了可行路径。近期研究利用模型内在信号已展现出初步成效，但其潜力与局限尚不明确。本文重新审视URLVR框架，从分类体系、理论推演到系统实验展开全面分析。我们首先依据奖励来源将URLVR方法划分为内在型与外部型，进而建立统一理论框架，揭示所有内在型方法本质上都在强化模型的初始分布。这种锐化机制在初始置信度与正确性一致时有效，但二者偏离时会导致灾难性失效。通过系统实验发现，内在奖励在不同方法中均呈现先升后降的规律，其崩溃时机由模型先验决定而非工程优化。尽管存在扩展局限，我们发现内在奖励在小数据集测试时训练中仍具价值，并提出"模型崩溃步数"作为衡量模型先验的实用指标，为强化学习可训练性提供判据。最后我们探索基于计算不对称性的外部奖励方法，初步证据表明其可能突破置信度-正确性的天花板。本研究既划定了内在型URLVR的能力边界，也为可扩展替代方案指明了方向。

English

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.