외부 보상 없이 추론 학습하기

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)을 통해 복잡한 추론을 위한 대규모 언어 모델(LLM)을 훈련하는 것은 효과적이지만, 비용이 많이 들고 도메인 특화된 감독에 의존해야 한다는 한계가 있습니다. 우리는 외부 보상이나 레이블된 데이터 없이 내재적 신호로부터 학습할 수 있는 강화 학습 프레임워크인 RLIF(Reinforcement Learning from Internal Feedback)를 탐구합니다. 우리는 모델의 자신감, 즉 자기 확신(self-certainty)을 유일한 보상 신호로 사용하는 RLIF 방법인 Intuitor를 제안합니다. Intuitor는 그룹 상대 정책 최적화(GRPO)에서 외부 보상을 자기 확신 점수로 대체하여 완전히 비지도 학습을 가능하게 합니다. 실험 결과, Intuitor는 수학적 벤치마크에서 GRPO와 동등한 성능을 보이면서도 코드 생성과 같은 도메인 외 작업에서 더 우수한 일반화를 달성하며, 정답 솔루션이나 테스트 케이스가 필요하지 않음을 입증했습니다. 우리의 연구 결과는 내재적 모델 신호가 다양한 도메인에서 효과적인 학습을 이끌 수 있으며, 검증 가능한 보상을 사용할 수 없는 자율 AI 시스템을 위한 RLVR의 확장 가능한 대안을 제공함을 보여줍니다. 코드는 https://github.com/sunblaze-ucb/Intuitor에서 확인할 수 있습니다.

English

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

외부 보상 없이 추론 학습하기

Learning to Reason without External Rewards

초록

Support