On-Policy Selbst-Distillation zur Komprimierung von Reasoning-Prozessen

Zusammenfassung

Reasoning-Modelle denken laut nach, doch ein Großteil ihrer Aussagen ist Rauschen. Wir stellen OPSDC (On-Policy Self-Distillation for Reasoning Compression) vor, eine Methode, die Modelle lehrt, prägnanter zu schlussfolgern, indem sie ihr eigenes prägnantes Verhalten wieder in sich selbst destilliert. Der gesamte Ansatz lässt sich auf eine einfache Idee reduzieren: Man konditioniert dasselbe Modell mit einer "Sei prägnant"-Anweisung, um Lehrer-Logits zu erhalten, und minimiert die reverse KL-Divergenz pro Token auf den eigenen Rollouts des Schülers. Keine Ground-Truth-Antworten, keine Token-Budgets, keine Schwierigkeitsschätzer. Einfach Selbst-Distillation. Doch diese Einfachheit verbirgt eine überraschende Raffinesse: OPSDC komprimiert einfache Probleme automatisch stark, bewahrt aber die notwendige Bedachtsamkeit für schwierige Probleme. Bei Qwen3-8B und Qwen3-14B erreichen wir eine Token-Reduktion von 57–59 % auf MATH-500 bei gleichzeitiger Steigerung der Genauigkeit um 9–16 Punkte absolut. Auf AIME 2024 verbessert sich das 14B-Modell um 10 Punkte bei 41 % Kompression. Das Geheimnis? Ein Großteil dessen, was Reasoning-Modelle produzieren, ist nicht nur redundant – es ist aktiv schädlich, da jeder unnötige Token Fehler verstärkt.

English

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.