代码推理之谷：大规模语言模型知识蒸馏的扩展

摘要

将具备推理能力的大型语言模型（LLM）的思维轨迹提炼至较小模型已被证实有效。然而，关于模型性能如何随蒸馏数据量扩展的研究却相对匮乏。本研究中，我们探讨了在两种小型非推理LLM上提炼编程竞赛技能的扩展趋势。我们验证了一个假设，即存在一个“代码推理低谷”：随着数据量的增加，编程竞赛的下游性能首先下降，随后以快于对数线性的速度稳步提升。在识别出这一趋势后，我们进一步在相同数据上对模型进行了两个不同蒸馏阶段的微调，以确定它们各自学习阶段的具体结论。我们发现，在低至中低数据量范围内，小型模型从较易的编程问题中获益显著大于从较难问题中。此外，令人惊讶的是，训练数据中输出的正确性对蒸馏结果并无影响。我们的研究在超越直觉理解代码推理蒸馏训练动态方面迈出了重要一步。

English

Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a valley of code reasoning: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

代码推理之谷：大规模语言模型知识蒸馏的扩展

The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

摘要

Support