確実性の錯覚：方策オンライン蒸留における能力と較正の分離

要旨

オン方策蒸留（OPD）は、学習後言語モデルの重要なパラダイムとして注目を集めている。しかし我々は、普遍的な「誤較正のスケーリング則」を発見した：OPDはタスク精度を効果的に向上させる一方で、モデルを体系的な過信状態に陥らせるのである。この問題の根源は情報のミスマッチにある。教師の監督信号は学習時に利用可能な特権情報に基づいて形成されるのに対し、デプロイ時のモデルはデプロイ時のみの情報に基づいて信頼度を報告しなければならない。本論文ではこの視点を理論的に定式化し、教師条件付き成功率が一般にデプロイ時信頼度の有効な目標値ではなく、有益な特権情報がエントロピーの崩壊と体系的な楽観バイアスを引き起こすことを示す。この問題を解決するため、我々は較正を考慮したOPDフレームワーク「CaOPD」を提案する。CaOPDはモデルロールアウトから経験的信頼度を推定し、自己申告信頼度をこの学生モデルに基づく目標値で置き換え、修正された応答を同じ自己蒸留パイプラインを通じて蒸留する。様々なモデルとドメインでの実験により、CaOPDが競争力のある能力を維持しつつパレート最適な較正を達成し、分布外環境や継続学習下でも頑健に一般化することを示す。本研究は、能力の蒸留が較正された信頼度を保証するものではなく、信頼度を学習後の重要な目標として扱うべきであることを明らかにする。コード：https://github.com/SalesforceAIResearch/CaOPD

English

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD

確実性の錯覚：方策オンライン蒸留における能力と較正の分離

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

要旨

Support