ChatPaper.aiChatPaper

無需外部獎勵的推理學習

Learning to Reason without External Rewards

May 26, 2025
作者: Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
cs.AI

摘要

透過可驗證獎勵的強化學習(RLVR)來訓練大型語言模型(LLMs)進行複雜推理雖然有效,但其依賴於昂貴且特定領域的監督,存在一定限制。我們探索了基於內部反饋的強化學習(RLIF),這是一個讓LLMs能夠從內在信號中學習,而無需外部獎勵或標記數據的框架。我們提出了Intuitor,這是一種RLIF方法,它使用模型自身的信心(稱為自我確定性)作為唯一的獎勵信號。Intuitor在群組相對策略優化(GRPO)中用自我確定性分數取代了外部獎勵,實現了完全無監督的學習。實驗表明,Intuitor在數學基準測試中與GRPO的表現相當,同時在代碼生成等領域外任務上實現了更優的泛化能力,且無需黃金解決方案或測試案例。我們的研究結果表明,內在模型信號能夠驅動跨領域的有效學習,為無法獲得可驗證獎勵的自動化AI系統提供了一種可擴展的替代方案。代碼可在https://github.com/sunblaze-ucb/Intuitor獲取。
English
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Summary

AI-Generated Summary

PDF242May 27, 2025