ChatPaper.aiChatPaper

AST-T5:結構感知預訓練用於程式碼生成與理解

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

January 5, 2024
作者: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung
cs.AI

摘要

大型語言模型(LLMs)在與程式碼相關的任務中取得了顯著進展,然而許多LLMs將程式碼視為簡單的序列,忽略了其結構化的特性。我們引入了AST-T5,一種新穎的預訓練範式,利用抽象語法樹(AST)來增強程式碼生成、轉譯和理解能力。通過動態規劃,我們的AST感知分割保留了程式碼結構,而我們的AST感知跨度損壞目標使模型能夠重建各種程式碼結構。與其他模型不同,AST-T5避免了複雜的程式分析或架構更改,因此可以與任何編碼器-解碼器Transformer無縫集成。評估顯示,AST-T5在各種與程式碼相關的任務中始終優於大小相似的LLMs。結構感知使AST-T5在程式碼對程式碼任務中特別強大,在Bugs2Fix任務的精確匹配分數方面超越CodeT5 2分,在CodeXGLUE的Java-C#轉譯任務的精確匹配分數方面超越3分。我們的程式碼和模型可在以下網址公開獲得:https://github.com/gonglinyuan/ast_t5。
English
Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.
PDF132December 15, 2024