尴尬简单的自蒸馏技术提升代码生成能力
Embarrassingly Simple Self-Distillation Improves Code Generation
April 1, 2026
作者: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang
cs.AI
摘要
大型语言模型(LLM)能否仅利用自身原始输出(无需验证器、教师模型或强化学习)来提升代码生成能力?我们通过简单自蒸馏(SSD)方法给出了肯定答案:以特定温度参数和截断配置从模型中采样解决方案,再通过标准监督微调对这些样本进行训练。在LiveCodeBench v6基准测试中,SSD将Qwen3-30B-Instruct的pass@1指标从42.4%提升至55.3%,且提升效果集中体现在难题上。该方法在4B、8B和30B规模的Qwen与Llama系列模型(包括指令微调版和思维链版)中均具泛化性。为探究这种简单方法有效的机理,我们追溯其增益源于LLM解码过程中的精确性与探索性冲突,并证明SSD能以上下文相关的方式重构词元分布——在需要精确性的场景抑制干扰性的分布尾部,同时在需要探索性的场景保留有效多样性。综上,SSD为提升LLM代码生成能力提供了一条互补的后训练路径。
English
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.