CodeT5+: 为代码理解和生成开放的大型语言模型
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
May 13, 2023
作者: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi
cs.AI
摘要
在广泛源代码预训练的大型语言模型(LLMs)已经在代码智能方面取得了显著进展。然而,现有的代码LLMs在架构和预训练任务方面存在两个主要限制。首先,它们通常采用特定架构(仅编码器或仅解码器)或依赖统一的编码器-解码器网络用于不同的下游任务。前一范式受应用中的不灵活性限制,而在后者中,模型被视为所有任务的单一系统,导致在某些任务的子集上性能不佳。其次,它们通常采用有限的预训练目标,这些目标可能与某些下游任务不相关,因此导致性能显著下降。为了解决这些限制,我们提出了“CodeT5+”,这是一系列用于代码的编码器-解码器LLMs,其中组件模块可以灵活组合以适应各种下游代码任务。这种灵活性是通过我们提出的混合预训练目标来实现的,以减轻预训练-微调差异。这些目标涵盖了跨度去噪、对比学习、文本-代码匹配和因果LM预训练任务,涵盖了单模态和双模态多语言代码语料库。此外,我们建议使用冻结的现成LLMs初始化CodeT5+,而不是从头开始训练,以有效扩展我们的模型,并探索指令微调以与自然语言指令对齐。我们在不同设置下对CodeT5+进行了广泛评估,包括零-shot、微调和指令微调。我们观察到在各种与代码相关的任务中,如代码生成和完成、数学编程以及文本到代码检索任务上,CodeT5+表现出了最先进的模型性能。特别是,我们经过指令微调的CodeT5+ 16B在HumanEval代码生成任务上取得了与其他开放代码LLMs相比的新的最先进结果。
English
Large language models (LLMs) pretrained on vast source code have achieved
prominent progress in code intelligence. However, existing code LLMs have two
main limitations in terms of architecture and pretraining tasks. First, they
often adopt a specific architecture (encoder-only or decoder-only) or rely on a
unified encoder-decoder network for different downstream tasks. The former
paradigm is limited by inflexibility in applications while in the latter, the
model is treated as a single system for all tasks, leading to suboptimal
performance on a subset of tasks. Secondly, they often employ a limited set of
pretraining objectives which might not be relevant to some downstream tasks and
hence result in substantial performance degrade. To address these limitations,
we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which
component modules can be flexibly combined to suit a wide range of downstream
code tasks. Such flexibility is enabled by our proposed mixture of pretraining
objectives to mitigate the pretrain-finetune discrepancy. These objectives
cover span denoising, contrastive learning, text-code matching, and causal LM
pretraining tasks, on both unimodal and bimodal multilingual code corpora.
Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs
without training from scratch to efficiently scale up our models, and explore
instruction-tuning to align with natural language instructions. We extensively
evaluate CodeT5+ on over 20 code-related benchmarks in different settings,
including zero-shot, finetuning, and instruction-tuning. We observe
state-of-the-art (SoTA) model performance on various code-related tasks, such
as code generation and completion, math programming, and text-to-code retrieval
tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA
results on HumanEval code generation task against other open code LLMs.