CodeFusion: 코드 생성을 위한 사전 학습된 디퓨전 모델

초록

개발자가 마지막 코드 줄만 수정할 수 있다면, 함수가 정확해지기까지 얼마나 자주 처음부터 다시 작성해야 할까요? 자연어에서 코드를 생성하는 자동회귀(autoregressive) 모델도 이와 유사한 한계를 가지고 있습니다: 이 모델들은 이전에 생성된 토큰을 쉽게 재고할 수 없습니다. 우리는 이러한 한계를 해결하기 위해 CodeFusion을 소개합니다. CodeFusion은 사전 학습된 확산(diffusion) 코드 생성 모델로, 인코딩된 자연어를 조건으로 하여 전체 프로그램을 반복적으로 노이즈 제거(denoising)합니다. 우리는 CodeFusion을 Bash, Python, 그리고 Microsoft Excel 조건부 서식(CF) 규칙에 대한 자연어에서 코드 생성 작업에서 평가했습니다. 실험 결과, CodeFusion(75M 매개변수)은 최신 자동회귀 시스템(350M-175B 매개변수)과 top-1 정확도에서 비슷한 성능을 보이며, 다양성 대 품질의 더 나은 균형 덕분에 top-3 및 top-5 정확도에서 더 우수한 성능을 보입니다.

English

Imagine a developer who can only change their last line of code, how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.

CodeFusion: 코드 생성을 위한 사전 학습된 디퓨전 모델

CodeFusion: A Pre-trained Diffusion Model for Code Generation

초록

Support