CodeT5+: 開放式代碼大型語言模型用於代碼理解和生成
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
May 13, 2023
作者: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi
cs.AI
摘要
在廣泛源代碼上預訓練的大型語言模型(LLMs)在代碼智能方面取得了顯著進展。然而,現有的代碼LLMs在架構和預訓練任務方面存在兩個主要限制。首先,它們通常採用特定架構(僅編碼器或僅解碼器)或依賴統一的編碼器-解碼器網絡進行不同下游任務。前者範式受到應用中的不靈活性的限制,而在後者中,模型被視為所有任務的單一系統,導致在某些任務的子優異表現。其次,它們通常採用有限的預訓練目標集,這些目標可能與某些下游任務無關,因此導致顯著的性能下降。為了解決這些限制,我們提出了“CodeT5+”,一系列針對代碼的編碼器-解碼器LLMs,其中組件模塊可以靈活組合以適應各種下游代碼任務。這種靈活性是通過我們提出的混合預訓練目標來實現的,以減輕預訓練-微調差異。這些目標涵蓋了跨度去噪、對比學習、文本-代碼匹配和因果LM預訓練任務,涵蓋了單模和雙模多語言代碼語料庫。此外,我們建議使用凍結的現成LLMs初始化CodeT5+,而無需從頭開始訓練,以有效擴展我們的模型,並探索指令調整以與自然語言指令對齊。我們在不同設置下對CodeT5+進行了廣泛評估,包括零-shot、微調和指令調整。我們觀察到在各種代碼相關任務上的最新模型表現,例如代碼生成和完成、數學編程以及文本到代碼檢索任務。特別是,我們調整指令的CodeT5+ 16B在HumanEval代碼生成任務上取得了新的最新模型結果,超越其他開放代碼LLMs。
English
Large language models (LLMs) pretrained on vast source code have achieved
prominent progress in code intelligence. However, existing code LLMs have two
main limitations in terms of architecture and pretraining tasks. First, they
often adopt a specific architecture (encoder-only or decoder-only) or rely on a
unified encoder-decoder network for different downstream tasks. The former
paradigm is limited by inflexibility in applications while in the latter, the
model is treated as a single system for all tasks, leading to suboptimal
performance on a subset of tasks. Secondly, they often employ a limited set of
pretraining objectives which might not be relevant to some downstream tasks and
hence result in substantial performance degrade. To address these limitations,
we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which
component modules can be flexibly combined to suit a wide range of downstream
code tasks. Such flexibility is enabled by our proposed mixture of pretraining
objectives to mitigate the pretrain-finetune discrepancy. These objectives
cover span denoising, contrastive learning, text-code matching, and causal LM
pretraining tasks, on both unimodal and bimodal multilingual code corpora.
Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs
without training from scratch to efficiently scale up our models, and explore
instruction-tuning to align with natural language instructions. We extensively
evaluate CodeT5+ on over 20 code-related benchmarks in different settings,
including zero-shot, finetuning, and instruction-tuning. We observe
state-of-the-art (SoTA) model performance on various code-related tasks, such
as code generation and completion, math programming, and text-to-code retrieval
tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA
results on HumanEval code generation task against other open code LLMs.