無需外部獎勵的推理學習

摘要

透過可驗證獎勵的強化學習（RLVR）來訓練大型語言模型（LLMs）進行複雜推理雖然有效，但其依賴於昂貴且特定領域的監督，存在一定限制。我們探索了基於內部反饋的強化學習（RLIF），這是一個讓LLMs能夠從內在信號中學習，而無需外部獎勵或標記數據的框架。我們提出了Intuitor，這是一種RLIF方法，它使用模型自身的信心（稱為自我確定性）作為唯一的獎勵信號。Intuitor在群組相對策略優化（GRPO）中用自我確定性分數取代了外部獎勵，實現了完全無監督的學習。實驗表明，Intuitor在數學基準測試中與GRPO的表現相當，同時在代碼生成等領域外任務上實現了更優的泛化能力，且無需黃金解決方案或測試案例。我們的研究結果表明，內在模型信號能夠驅動跨領域的有效學習，為無法獲得可驗證獎勵的自動化AI系統提供了一種可擴展的替代方案。代碼可在https://github.com/sunblaze-ucb/Intuitor獲取。

English

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

無需外部獎勵的推理學習

Learning to Reason without External Rewards

摘要

Support