确定性幻象:解构策略蒸馏中的能力与校准关系
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
April 18, 2026
作者: Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, Chien-Sheng Wu
cs.AI
摘要
在線蒸餾(OPD)已成為語言模型後訓練中日益重要的範式。然而,我們發現了普遍存在的「誤校準縮放定律」:雖然OPD能有效提升任務準確率,卻會系統性地使模型陷入嚴重過度自信。我們將此問題溯源至資訊錯配:教師監督是在訓練時特權上下文環境下形成的,而部署模型僅能基於運行時資訊回報置信度。我們從理論角度形式化這一觀點,證明以教師為條件的成功率通常無法作為部署時置信度的有效目標,且有益的特權上下文會引發熵崩塌與系統性樂觀偏差。為解決此問題,我們提出校準感知的OPD框架CaOPD,通過模型推演估計實證置信度,以學生模型為基礎的目標取代自回報置信度,並通過相同自蒸餾流程提煉修正後的回應。在多模型與多領域實驗中,CaOPD在保持競爭性能的同時實現帕累托最優校準,並在分佈外泛化與持續學習中展現穩健性。我們的研究表明,能力蒸餾並不意味著校準後的置信度,而置信度應被視為後訓練的核心目標。代碼地址:https://github.com/SalesforceAIResearch/CaOPD
English
On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD