CodeT5+: 開放式代碼大型語言模型用於代碼理解和生成

摘要

在廣泛源代碼上預訓練的大型語言模型（LLMs）在代碼智能方面取得了顯著進展。然而，現有的代碼LLMs在架構和預訓練任務方面存在兩個主要限制。首先，它們通常採用特定架構（僅編碼器或僅解碼器）或依賴統一的編碼器-解碼器網絡進行不同下游任務。前者範式受到應用中的不靈活性的限制，而在後者中，模型被視為所有任務的單一系統，導致在某些任務的子優異表現。其次，它們通常採用有限的預訓練目標集，這些目標可能與某些下游任務無關，因此導致顯著的性能下降。為了解決這些限制，我們提出了“CodeT5+”，一系列針對代碼的編碼器-解碼器LLMs，其中組件模塊可以靈活組合以適應各種下游代碼任務。這種靈活性是通過我們提出的混合預訓練目標來實現的，以減輕預訓練-微調差異。這些目標涵蓋了跨度去噪、對比學習、文本-代碼匹配和因果LM預訓練任務，涵蓋了單模和雙模多語言代碼語料庫。此外，我們建議使用凍結的現成LLMs初始化CodeT5+，而無需從頭開始訓練，以有效擴展我們的模型，並探索指令調整以與自然語言指令對齊。我們在不同設置下對CodeT5+進行了廣泛評估，包括零-shot、微調和指令調整。我們觀察到在各種代碼相關任務上的最新模型表現，例如代碼生成和完成、數學編程以及文本到代碼檢索任務。特別是，我們調整指令的CodeT5+ 16B在HumanEval代碼生成任務上取得了新的最新模型結果，超越其他開放代碼LLMs。

English

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

CodeT5+: 開放式代碼大型語言模型用於代碼理解和生成

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

摘要

Support