AST-T5:結構感知預訓練用於程式碼生成與理解
AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
January 5, 2024
作者: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung
cs.AI
摘要
大型語言模型(LLMs)在與程式碼相關的任務中取得了顯著進展,然而許多LLMs將程式碼視為簡單的序列,忽略了其結構化的特性。我們引入了AST-T5,一種新穎的預訓練範式,利用抽象語法樹(AST)來增強程式碼生成、轉譯和理解能力。通過動態規劃,我們的AST感知分割保留了程式碼結構,而我們的AST感知跨度損壞目標使模型能夠重建各種程式碼結構。與其他模型不同,AST-T5避免了複雜的程式分析或架構更改,因此可以與任何編碼器-解碼器Transformer無縫集成。評估顯示,AST-T5在各種與程式碼相關的任務中始終優於大小相似的LLMs。結構感知使AST-T5在程式碼對程式碼任務中特別強大,在Bugs2Fix任務的精確匹配分數方面超越CodeT5 2分,在CodeXGLUE的Java-C#轉譯任務的精確匹配分數方面超越3分。我們的程式碼和模型可在以下網址公開獲得:https://github.com/gonglinyuan/ast_t5。
English
Large language models (LLMs) have made significant advancements in
code-related tasks, yet many LLMs treat code as simple sequences, neglecting
its structured nature. We introduce AST-T5, a novel pretraining paradigm that
leverages the Abstract Syntax Tree (AST) for enhanced code generation,
transpilation, and understanding. Using dynamic programming, our AST-Aware
Segmentation retains code structure, while our AST-Aware Span Corruption
objective equips the model to reconstruct various code structures. Unlike other
models, AST-T5 avoids intricate program analyses or architectural changes, so
it integrates seamlessly with any encoder-decoder Transformer. Evaluations show
that AST-T5 consistently outperforms similar-sized LMs across various
code-related tasks. Structure-awareness makes AST-T5 particularly powerful in
code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the
Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in
CodeXGLUE. Our code and model are publicly available at
https://github.com/gonglinyuan/ast_t5.