推論と信頼度の分離：検証可能な報酬からの強化学習における較正の復活

要旨

検証可能な報酬からの強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を大幅に強化するが、モデルが誤った回答に対して過度に自信過剰となる較正劣化（calibration degeneration）に深刻に悩まされている。従来研究は較正目標を既存の最適化目標に直接組み込むことに注力してきた。しかし、我々の理論分析により、方策精度の最大化と較正誤差の最小化を目指す最適化の間には根本的な勾配競合が存在することが明らかとなった。この知見に基づき、推論と較正の目標を体系的に分離する、簡潔かつ効果的なフレームワークDCPOを提案する。大規模な実験により、DCPOがGRPOと同等の精度を維持するだけでなく、最高の較正性能を達成し、過信問題を大幅に緩和することが実証された。本研究は、より信頼性の高いLLM展開のための貴重な知見と実用的な解決策を提供する。

English

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

推論と信頼度の分離：検証可能な報酬からの強化学習における較正の復活

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

要旨

Support