추론과 신뢰도의 분리: 검증 가능한 보상에서 강화 학습의 보정 부활

초록

검증 가능한 보상 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시키지만, 모델이 잘못된 답변에 대해 지나치게 과도한 확신을 갖는 보정 악화(calibration degeneration) 문제를 심각하게 겪습니다. 기존 연구들은 보정 목표를 기존 최적화 대상에 직접 통합하는 데 주력해왔습니다. 그러나 우리의 이론적 분석에 따르면, 정책 정확도 극대화와 보정 오차 최소화를 위한 최적화 사이에는 근본적인 그래디언트 충돌이 존재합니다. 이러한 통찰을 바탕으로, 우리는 추론과 보정 목표를 체계적으로 분리하는 간단하면서도 효과적인 프레임워크인 DCPO를 제안합니다. 광범위한 실험을 통해 우리의 DCPO가 GRPO와 동등한 정확도를 유지할 뿐만 아니라 최고 수준의 보정 성능을 달성하고 과도한 확신 문제를 상당히 완화함을 입증했습니다. 본 연구는 보다 신뢰할 수 있는 LLM 배포를 위한 가치 있는 통찰과 실용적인 해결책을 제공합니다.

English

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

추론과 신뢰도의 분리: 검증 가능한 보상에서 강화 학습의 보정 부활

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

초록

Support