CodeT5+: 코드 이해 및 생성을 위한 오픈 소스 코드 대형 언어 모델

초록

방대한 소스 코드로 사전 학습된 대규모 언어 모델(LLMs)은 코드 인텔리전스 분야에서 두드러진 진전을 이루었습니다. 그러나 기존의 코드 LLM들은 아키텍처와 사전 학습 과제 측면에서 두 가지 주요 한계를 가지고 있습니다. 첫째, 이들은 특정 아키텍처(인코더 전용 또는 디코더 전용)를 채택하거나 다양한 다운스트림 작업에 대해 통합된 인코더-디코더 네트워크에 의존하는 경우가 많습니다. 전자의 패러다임은 응용 프로그램에서 유연성이 부족하다는 한계가 있으며, 후자의 경우 모델이 모든 작업에 대해 단일 시스템으로 취급되어 일부 작업에서 최적의 성능을 발휘하지 못합니다. 둘째, 이들은 종종 제한된 사전 학습 목표 세트를 사용하는데, 이는 일부 다운스트림 작업과 관련이 없을 수 있어 상당한 성능 저하를 초래합니다. 이러한 한계를 해결하기 위해, 우리는 다양한 다운스트림 코드 작업에 적합하도록 구성 요소 모듈을 유연하게 결합할 수 있는 인코더-디코더 LLM 제품군인 ``CodeT5+''를 제안합니다. 이러한 유연성은 사전 학습과 미세 조정 간의 불일치를 완화하기 위해 제안된 혼합 사전 학습 목표에 의해 가능해집니다. 이러한 목표는 단일 모드 및 이중 모드 다국어 코드 코퍼스에서 스팬 노이즈 제거, 대조 학습, 텍스트-코드 매칭, 그리고 인과적 언어 모델 사전 학습 과제를 포함합니다. 또한, 우리는 처음부터 학습하지 않고 기존의 LLM을 고정된 상태로 초기화하여 모델을 효율적으로 확장하고, 자연어 지시에 맞추기 위해 지시 튜닝을 탐구합니다. 우리는 CodeT5+를 제로샷, 미세 조정, 지시 튜닝 등 다양한 설정에서 20개 이상의 코드 관련 벤치마크에 대해 광범위하게 평가합니다. 코드 생성 및 완성, 수학 프로그래밍, 텍스트-코드 검색 작업과 같은 다양한 코드 관련 작업에서 최첨단(SoTA) 모델 성능을 관찰합니다. 특히, 우리의 지시 튜닝된 CodeT5+ 16B는 HumanEval 코드 생성 작업에서 다른 오픈 코드 LLM 대비 새로운 SoTA 결과를 달성합니다.

English

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

CodeT5+: 코드 이해 및 생성을 위한 오픈 소스 코드 대형 언어 모델

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

초록

Support