CodeFusion：一個用於程式碼生成的預訓練擴散模型

摘要

想像一位開發者只能修改他們最後一行的程式碼，他們在正確之前需要多少次重新從頭撰寫一個函數呢？從自然語言生成程式碼的自回歸模型也有類似的限制：它們不容易重新考慮先前生成的標記。我們引入了 CodeFusion，一個預訓練擴散程式碼生成模型，通過迭代地對編碼的自然語言進行去噪，來解決這個限制。我們在 Bash、Python 和 Microsoft Excel 條件格式化 (CF) 規則的自然語言轉換程式碼生成任務上評估了 CodeFusion。實驗表明，CodeFusion（7500 萬參數）在 top-1 準確度方面與最先進的自回歸系統（3500 萬至1750 億參數）表現相當，並且在 top-3 和 top-5 準確度方面優於它們，這是因為它在多樣性與質量之間有更好的平衡。

English

Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

CodeFusion：一個用於程式碼生成的預訓練擴散模型

CodeFusion: A Pre-trained Diffusion Model for Code Generation

摘要

Support