単純極まる自己蒸留がコード生成を改善する

要旨

大規模言語モデル（LLM）は、検証器や教師モデル、強化学習を用いることなく、自身の生の出力のみを用いてコード生成を改善できるだろうか？我々は、単純な自己蒸留（SSD）によってこの問いに肯定的に答える。すなわち、特定の温度パラメータとトランケーション設定でモデルから解答をサンプリングし、それらのサンプルに対して標準的な教師ありファインチューニングを実行するのである。SSDにより、Qwen3-30B-InstructのLiveCodeBench v6におけるpass@1は42.4%から55.3%に向上し、その改善はより困難な問題に集中していた。また、この手法は4B、8B、30BスケールのQwenおよびLlamaモデル（命令追従型と思考型の両変種を含む）にわたって一般性を示した。このような単純な手法が機能する理由を理解するため、我々はこの改善をLLMデコーディングにおける「精度と探索のトレードオフ」に起因すると分析し、SSDが文脈に応じてトークン分布を再形成することを明らかにした。具体的には、精度が重要な箇所では注意散漫なテール分布を抑制し、探索が重要な箇所では有用な多様性を保持するのである。総合すると、SSDはLLMのコード生成を改善するための相補的な学習後手法を提供する。

English

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

単純極まる自己蒸留がコード生成を改善する

Embarrassingly Simple Self-Distillation Improves Code Generation

要旨

Support