通过合成编辑序列训练语言模型可以改善代码合成。
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
October 3, 2024
作者: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus
cs.AI
摘要
软件工程师主要通过编辑现有程序来编写代码。相比之下,大型语言模型(LLMs)会自回归地在单次遍历中合成程序。其中一个解释是开源编辑数据的稀缺性。尽管用于代码合成的高质量指导数据已经稀缺,但高质量的编辑数据更加稀缺。为了填补这一空白,我们开发了一种名为LintSeq的合成数据生成算法。该算法通过使用一个代码检查器来程序化地采样无错误插入,将现有代码重构为一系列代码编辑的序列。它将编辑序列输出为由连续程序差异组成的文本字符串。为了测试LintSeq,我们将其用于将一组指令+程序对的数据集重构为指令+程序差异序列元组。然后,我们对一系列从2.6B到14B参数的较小LLMs进行指令微调,使用这个数据集的重构版本和原始版本,在代码合成基准测试中比较零样本性能。我们展示了在重复采样过程中,编辑序列微调模型产生比基线更多样化的程序。这导致了更好的基准覆盖推理时间扩展,即作为样本函数的问题“pass@k”的分数,即给定“k”次尝试中任何尝试解决的问题的比例。例如,在HumanEval pass@50上,微调了合成编辑序列的小型LLMs在绝对分数上与GPT-4竞争,并在绝对分数上比在基线数据集上微调的模型表现出+20%(+/-3%)的优势。最后,我们还对我们自己的微型LM进行了代码理解的预训练。我们展示了在合成代码编辑上微调微型模型会产生适用于设备模型类的最先进代码合成。我们的1.5亿参数编辑序列LM与具有两倍参数的代码模型相匹配或表现优异,无论是否进行重复采样,包括Codex和AlphaCode。
English
Software engineers mainly write code by editing existing programs. In
contrast, large language models (LLMs) autoregressively synthesize programs in
a single pass. One explanation for this is the scarcity of open-sourced edit
data. While high-quality instruction data for code synthesis is already scarce,
high-quality edit data is even scarcer. To fill this gap, we develop a
synthetic data generation algorithm called LintSeq. This algorithm refactors
existing code into a sequence of code edits by using a linter to procedurally
sample across the error-free insertions that can be used to sequentially write
programs. It outputs edit sequences as text strings consisting of consecutive
program diffs. To test LintSeq, we use it to refactor a dataset of instruction
+ program pairs into instruction + program-diff-sequence tuples. Then, we
instruction finetune a series of smaller LLMs ranging from 2.6B to 14B
parameters on both the re-factored and original versions of this dataset,
comparing zero-shot performance on code synthesis benchmarks. We show that
during repeated sampling, edit sequence finetuned models produce more diverse
programs than baselines. This results in better inference-time scaling for
benchmark coverage as a function of samples, i.e. the fraction of problems
"pass@k" solved by any attempt given "k" tries. For example, on HumanEval
pass@50, small LLMs finetuned on synthetic edit sequences are competitive with
GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%)
in absolute score. Finally, we also pretrain our own tiny LMs for code
understanding. We show that finetuning tiny models on synthetic code edits
results in state-of-the-art code synthesis for the on-device model class. Our
150M parameter edit sequence LM matches or outperforms code models with twice
as many parameters, both with and without repeated sampling, including Codex
and AlphaCode.Summary
AI-Generated Summary