基于点互信息的推理强化学习反自蒸馏

摘要

在策略自蒸馏方法中，学生模型会向一个基于特权上下文（如已验证的解答或反馈）的自身副本靠拢，这为无需更强外部教师模型即可提升推理能力提供了有前景的方向。然而在数学推理领域，即便相同方法在其他领域表现优异，其性能提升却并不稳定。点互信息分析揭示了失败根源在于特权上下文本身：它会过度提升教师模型对解题路径中已隐含的标记（如结构连接词、可验证断言）的置信度，同时压低对引导多步搜索的推敲标记（如"等等""假设""或许"）的置信度。本文提出反自蒸馏（AntiSD）方法，通过扩大而非缩小学生与教师模型之间的散度实现优化：该方法逐标记反转梯度符号，并在一阶优化中自然形成有界优势。配合基于熵值的触发门控机制（当教师模型熵值崩塌时禁用该项），AntiSD可作为默认自蒸馏的即插即用替代方案。在4B至30B参数规模的五个模型上，AntiSD仅需GRPO基线2至10分之一的训练步数即可达到其准确率，并将最终准确率最高提升11.5个百分点。AntiSD开辟了可扩展的自我改进路径，使语言模型能够通过自身训练信号实现推理能力的自举提升。

English

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.