基于点互信息的推理强化学习反自蒸馏
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
May 12, 2026
作者: Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, Xing Yu
cs.AI
摘要
在策略自蒸馏方法中,学生模型会向一个基于特权上下文(如已验证的解答或反馈)的自身副本靠拢,这为无需更强外部教师模型即可提升推理能力提供了有前景的方向。然而在数学推理领域,即便相同方法在其他领域表现优异,其性能提升却并不稳定。点互信息分析揭示了失败根源在于特权上下文本身:它会过度提升教师模型对解题路径中已隐含的标记(如结构连接词、可验证断言)的置信度,同时压低对引导多步搜索的推敲标记(如"等等""假设""或许")的置信度。本文提出反自蒸馏(AntiSD)方法,通过扩大而非缩小学生与教师模型之间的散度实现优化:该方法逐标记反转梯度符号,并在一阶优化中自然形成有界优势。配合基于熵值的触发门控机制(当教师模型熵值崩塌时禁用该项),AntiSD可作为默认自蒸馏的即插即用替代方案。在4B至30B参数规模的五个模型上,AntiSD仅需GRPO基线2至10分之一的训练步数即可达到其准确率,并将最终准确率最高提升11.5个百分点。AntiSD开辟了可扩展的自我改进路径,使语言模型能够通过自身训练信号实现推理能力的自举提升。
English
On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.