在合成編輯序列上訓練語言模型可改善程式碼合成。
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
October 3, 2024
作者: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus
cs.AI
摘要
軟體工程師主要通過編輯現有程式碼來撰寫程式碼。相比之下,大型語言模型(LLMs)在單次通過中自動回歸地合成程式碼。其中一個解釋是開源編輯數據的稀缺性。儘管程式碼合成的高質量指導數據已經稀缺,高質量的編輯數據甚至更加稀缺。為了填補這一空白,我們開發了一個名為LintSeq的合成數據生成算法。該算法通過使用linter對現有程式碼進行重構,將其轉換為一系列程式碼編輯,以程序化地採樣無錯誤插入,這些插入可用於順序地撰寫程式。它將編輯序列輸出為由連續程式差異組成的文本字符串。為了測試LintSeq,我們使用它將一組指令+程式對重構為指令+程式差異序列元組。然後,我們對一系列從26億到140億參數的較小LLMs在這個數據集的重構和原始版本上進行指令微調,比較零-shot性能在程式碼合成基準測試上的表現。我們展示了在重複採樣過程中,經過編輯序列微調的模型生成比基準更多樣化的程式。這導致基於樣本的基準覆蓋性能在推斷時間上的擴展更好,即任何嘗試在給定“k”次嘗試時解決的問題“pass@k”的分數。例如,在HumanEval pass@50上,經過合成編輯序列微調的小型LLMs在絕對分數上與GPT-4競爭,並在基準數據集上的模型表現比+20%(+/-3%)更好。最後,我們還對我們自己的微型LM進行程式碼理解的預訓練。我們展示了在合成程式碼編輯上對微型模型進行微調導致了在設備模型類別中的最先進程式碼合成。我們的1.5億參數編輯序列LM在重複採樣與否的情況下,與兩倍參數的程式碼模型(包括Codex和AlphaCode)相匹敵或優於其表現。
English
Software engineers mainly write code by editing existing programs. In
contrast, large language models (LLMs) autoregressively synthesize programs in
a single pass. One explanation for this is the scarcity of open-sourced edit
data. While high-quality instruction data for code synthesis is already scarce,
high-quality edit data is even scarcer. To fill this gap, we develop a
synthetic data generation algorithm called LintSeq. This algorithm refactors
existing code into a sequence of code edits by using a linter to procedurally
sample across the error-free insertions that can be used to sequentially write
programs. It outputs edit sequences as text strings consisting of consecutive
program diffs. To test LintSeq, we use it to refactor a dataset of instruction
+ program pairs into instruction + program-diff-sequence tuples. Then, we
instruction finetune a series of smaller LLMs ranging from 2.6B to 14B
parameters on both the re-factored and original versions of this dataset,
comparing zero-shot performance on code synthesis benchmarks. We show that
during repeated sampling, edit sequence finetuned models produce more diverse
programs than baselines. This results in better inference-time scaling for
benchmark coverage as a function of samples, i.e. the fraction of problems
"pass@k" solved by any attempt given "k" tries. For example, on HumanEval
pass@50, small LLMs finetuned on synthetic edit sequences are competitive with
GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%)
in absolute score. Finally, we also pretrain our own tiny LMs for code
understanding. We show that finetuning tiny models on synthetic code edits
results in state-of-the-art code synthesis for the on-device model class. Our
150M parameter edit sequence LM matches or outperforms code models with twice
as many parameters, both with and without repeated sampling, including Codex
and AlphaCode.Summary
AI-Generated Summary