AST-T5：結構感知預訓練用於程式碼生成與理解

摘要

大型語言模型（LLMs）在與程式碼相關的任務中取得了顯著進展，然而許多LLMs將程式碼視為簡單的序列，忽略了其結構化的特性。我們引入了AST-T5，一種新穎的預訓練範式，利用抽象語法樹（AST）來增強程式碼生成、轉譯和理解能力。通過動態規劃，我們的AST感知分割保留了程式碼結構，而我們的AST感知跨度損壞目標使模型能夠重建各種程式碼結構。與其他模型不同，AST-T5避免了複雜的程式分析或架構更改，因此可以與任何編碼器-解碼器Transformer無縫集成。評估顯示，AST-T5在各種與程式碼相關的任務中始終優於大小相似的LLMs。結構感知使AST-T5在程式碼對程式碼任務中特別強大，在Bugs2Fix任務的精確匹配分數方面超越CodeT5 2分，在CodeXGLUE的Java-C#轉譯任務的精確匹配分數方面超越3分。我們的程式碼和模型可在以下網址公開獲得：https://github.com/gonglinyuan/ast_t5。

English

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

AST-T5：結構感知預訓練用於程式碼生成與理解

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

摘要

Support