CodeFusion: コード生成のための事前学習済み拡散モデル

要旨

最後の一行のコードしか変更できない開発者を想像してみてください。そのような状況では、関数が正しく動作するまでに何度も最初から書き直さなければならないでしょう。自然言語からコードを生成する自己回帰モデルも同様の制約を抱えています。つまり、生成された初期のトークンを容易に見直すことができないのです。私たちはこの制約を解決するため、CodeFusionという事前学習済み拡散コード生成モデルを提案します。CodeFusionは、エンコードされた自然言語を条件として、完全なプログラムを反復的にノイズ除去することでこの問題に対処します。私たちはCodeFusionを、Bash、Python、Microsoft Excelの条件付き書式（CF）ルールに対する自然言語からコードへの生成タスクで評価しました。実験の結果、CodeFusion（7500万パラメータ）は、トップ1精度において最先端の自己回帰システム（3億5000万～1750億パラメータ）と同等の性能を発揮し、多様性と品質のバランスが優れているため、トップ3およびトップ5精度ではそれらを上回りました。

English

Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

CodeFusion: コード生成のための事前学習済み拡散モデル

CodeFusion: A Pre-trained Diffusion Model for Code Generation

要旨

Support