ChatPaper.aiChatPaper

无需外部奖励的推理学习

Learning to Reason without External Rewards

May 26, 2025
作者: Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
cs.AI

摘要

通过可验证奖励的强化学习(RLVR)训练大型语言模型(LLMs)进行复杂推理是有效的,但其依赖于昂贵且领域特定的监督,存在局限性。我们探索了内部反馈强化学习(RLIF),这一框架使LLMs能够从内在信号中学习,无需外部奖励或标注数据。我们提出了Intuitor,一种RLIF方法,它利用模型自身的置信度,即自我确定性,作为唯一的奖励信号。Intuitor在群体相对策略优化(GRPO)中用自我确定性评分替代了外部奖励,实现了完全无监督的学习。实验表明,Intuitor在数学基准测试中与GRPO表现相当,同时在代码生成等跨领域任务上展现出更优的泛化能力,且无需黄金解决方案或测试用例。我们的研究结果表明,模型的内在信号能够驱动跨领域的有效学习,为在无法获得可验证奖励的自主AI系统中提供了一种可扩展的替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor获取。
English
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Summary

AI-Generated Summary

PDF232May 27, 2025