无需外部奖励的推理学习

摘要

通过可验证奖励的强化学习（RLVR）训练大型语言模型（LLMs）进行复杂推理是有效的，但其依赖于昂贵且领域特定的监督，存在局限性。我们探索了内部反馈强化学习（RLIF），这一框架使LLMs能够从内在信号中学习，无需外部奖励或标注数据。我们提出了Intuitor，一种RLIF方法，它利用模型自身的置信度，即自我确定性，作为唯一的奖励信号。Intuitor在群体相对策略优化（GRPO）中用自我确定性评分替代了外部奖励，实现了完全无监督的学习。实验表明，Intuitor在数学基准测试中与GRPO表现相当，同时在代码生成等跨领域任务上展现出更优的泛化能力，且无需黄金解决方案或测试用例。我们的研究结果表明，模型的内在信号能够驱动跨领域的有效学习，为在无法获得可验证奖励的自主AI系统中提供了一种可扩展的替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor获取。

English

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

无需外部奖励的推理学习

Learning to Reason without External Rewards

摘要

Support