AST-T5: 코드 생성 및 이해를 위한 구조 인식 사전 학습

초록

대규모 언어 모델(LLM)은 코드 관련 작업에서 상당한 발전을 이루었지만, 많은 LLM이 코드를 단순한 시퀀스로 취급하여 그 구조적 특성을 간과하고 있습니다. 우리는 AST-T5라는 새로운 사전 학습 패러다임을 소개합니다. 이는 추상 구문 트리(AST)를 활용하여 코드 생성, 변환 및 이해를 향상시킵니다. 동적 프로그래밍을 사용한 AST-Aware Segmentation은 코드 구조를 유지하며, AST-Aware Span Corruption 목표는 모델이 다양한 코드 구조를 재구성할 수 있도록 합니다. 다른 모델과 달리, AST-T5는 복잡한 프로그램 분석이나 아키텍처 변경을 피하므로 모든 인코더-디코더 트랜스포머와 원활하게 통합됩니다. 평가 결과, AST-T5는 다양한 코드 관련 작업에서 유사한 크기의 언어 모델을 지속적으로 능가하는 것으로 나타났습니다. 구조 인식은 특히 코드 간 작업에서 AST-T5를 강력하게 만드는데, Bugs2Fix 작업에서 CodeT5보다 정확도 점수가 2점 높고, CodeXGLUE의 Java-C# 변환 작업에서 3점 높습니다. 우리의 코드와 모델은 https://github.com/gonglinyuan/ast_t5에서 공개되어 있습니다.

English

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

AST-T5: 코드 생성 및 이해를 위한 구조 인식 사전 학습

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

초록

Support