CodeFusion：用于代码生成的预训练扩散模型

摘要

想象一位开发人员只能修改他们的最后一行代码，他们需要多少次才能在正确之前从头开始编写一个函数？用于从自然语言生成代码的自回归模型具有类似的限制：它们不容易允许重新考虑先前生成的标记。我们引入了CodeFusion，一个预训练的扩散代码生成模型，通过迭代地对编码的自然语言进行去噪来解决这个限制。我们在Bash、Python和Microsoft Excel条件格式（CF）规则的自然语言到代码生成任务上评估了CodeFusion。实验结果显示，CodeFusion（7500万参数）在一致性最佳准确率方面与最先进的自回归系统（3500万至1750亿参数）表现相当，并且在前三和前五准确率方面胜过它们，这是由于其在多样性与质量之间更好的平衡。

English

Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

CodeFusion：用于代码生成的预训练扩散模型

CodeFusion: A Pre-trained Diffusion Model for Code Generation

摘要

Support