Die Illusion der Gewissheit: Entkopplung von Fähigkeit und Kalibrierung beim On-Policy-Distillation

Zusammenfassung

On-policy Distillation (OPD) ist ein zunehmend wichtiges Paradigma für das Nachtraining von Sprachmodellen. Wir identifizieren jedoch ein weit verbreitetes Skalengesetz der Fehlkalibrierung: Während OPD die Aufgabengenauigkeit effektiv steigert, führt es systematisch zu schwerer Überkonfidenz der Modelle. Wir führen dieses Versagen auf eine Informationsasymmetrie zurück: Die Lehrer-Supervision wird unter dem privilegierten Kontext gebildet, der während des Trainings verfügbar ist, während das eingesetzte Modell sein Vertrauen nur mit den zur Laufzeit verfügbaren Informationen angeben muss. Wir formalisieren diese Perspektive theoretisch und zeigen, dass der lehrer-konditionierte Erfolg im Allgemeinen kein valides Ziel für das Laufzeit-Konfidenzniveau darstellt und dass hilfreicher privilegierter Kontext einen Entropie-Kollaps und eine systematische Optimismus-Verzerrung verursacht. Um dies zu beheben, schlagen wir ein kalibrationsbewusstes OPD-Framework, CaOPD, vor, das die empirische Konfidenz aus Modell-Rollouts schätzt, die selbstberichtete Konfidenz durch dieses auf den Schüler gegründete Ziel ersetzt und die überarbeitete Antwort durch die gleiche Self-Distillation-Pipeline distilliert. Experimente mit verschiedenen Modellen und Domänen zeigen, dass CaOPD eine pareto-optimale Kalibrierung erreicht und gleichzeitig wettbewerbsfähige Fähigkeiten beibehält, sowie robust unter Out-of-Distribution- und kontinuierlichem Lernen generalisiert. Unsere Ergebnisse unterstreichen, dass die Fähigkeitsdistillation keine kalibrierte Konfidenz impliziert und dass Konfidenz als wesentliches Ziel im Nachtraining behandelt werden sollte. Code: https://github.com/SalesforceAIResearch/CaOPD

English

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD

Die Illusion der Gewissheit: Entkopplung von Fähigkeit und Kalibrierung beim On-Policy-Distillation

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Zusammenfassung

Support