外部報酬なしで推論を学ぶ

要旨

大規模言語モデル（LLM）を複雑な推論のために強化学習（Reinforcement Learning with Verifiable Rewards, RLVR）を用いて訓練することは有効ですが、コストがかかり、ドメイン固有の監視に依存するという制約があります。本研究では、外部の報酬やラベル付きデータを必要とせず、内在的な信号から学習することを可能にする強化学習フレームワーク「Reinforcement Learning from Internal Feedback（RLIF）」を探求します。我々は、モデル自身の信頼度（自己確信度）を唯一の報酬信号として利用するRLIF手法「Intuitor」を提案します。Intuitorは、Group Relative Policy Optimization（GRPO）における外部報酬を自己確信度スコアに置き換えることで、完全に教師なしの学習を実現します。実験結果は、Intuitorが数学的ベンチマークにおいてGRPOと同等の性能を発揮しつつ、コード生成のようなドメイン外タスクに対して優れた汎化性能を達成することを示しています。これらは、正解データやテストケースを必要としません。我々の知見は、内在的なモデル信号がドメインを超えた効果的な学習を駆動し、検証可能な報酬が利用できない自律AIシステムにおいてRLVRのスケーラブルな代替手段を提供することを示しています。コードはhttps://github.com/sunblaze-ucb/Intuitorで公開されています。

English

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

外部報酬なしで推論を学ぶ

Learning to Reason without External Rewards

要旨

Support