点相互情報量による推論RLのための反自己蒸留

要旨

オン方策自己蒸留（特権的コンテキスト、例えば検証済みの解答やフィードバックを条件として、生徒モデルを自身のコピーに引き寄せる手法）は、より強力な外部教師なしに推論能力を向上させる有望な方向性を示している。しかし、数学推論においては、同じ手法が他分野で成功してもその効果は一貫しない。点相互情報量分析により、その失敗の原因は特権的コンテキストそのものにあることが明らかになった。すなわち、コンテキストは、解答にすでに暗黙に含まれるトークン（構造的接続詞や検証可能な主張）に対する教師の確信度を過度に高め、多段階探索を駆動する熟考トークン（「待て」「さて」「もしかすると」）に対する確信度を低下させるのである。本稿では、反自己蒸留（AntiSD）を提案する。これは生徒と教師の間のダイバージェンスを下降させるのではなく上昇させる手法であり、トークンごとの符号を反転させ、一回のステップで自然に制限された利得をもたらす。エントロピー起動ゲートは、教師のエントロピーが崩壊した時点で当該項を無効化し、既定の自己蒸留に対するそのまま置き換え可能な代替を実現する。数学推論ベンチマークにおいて、4Bから30Bパラメータの5つのモデルで実験を行った結果、AntiSDはGRPOベースラインの精度に2～10倍少ない訓練ステップで到達し、最終精度を最大11.5ポイント向上させた。AntiSDは、言語モデルが自身の訓練信号を通じて推論をブートストラップする、スケーラブルな自己改善への道を開く。

English

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.