为何深度学习的代码补全工具个性化至关重要

摘要

基于深度学习（DL）的代码补全工具通过实现高级代码生成，彻底改变了软件开发流程。这些工具利用在大量代码库上训练的模型，捕捉通用的编码模式。然而，针对特定组织或开发者进行微调以提升其在这些主体上的性能，其影响尚未得到充分探索。本研究填补了这一空白，通过提供坚实的实证证据来回答这一问题。具体而言，我们考察了来自两个组织（Apache和Spring）的136名开发者，两种模型架构（T5和Code Llama），以及三种模型规模（6000万、7.5亿和70亿可训练参数）。T5模型（6000万、7.5亿）在超过2000个开源项目上进行了预训练和微调，排除了目标组织的数据，并与针对组织和开发者特定数据集微调的版本进行了对比。对于Code Llama模型（70亿），我们比较了在线公开的预训练模型与通过参数高效微调方法在组织和开发者特定数据集上微调的同一模型的性能。我们的结果表明，无论是针对组织还是开发者进行额外微调，都能提升预测能力，其中组织特定的微调表现尤为突出。这一发现普遍适用于（i）两个目标组织（即Apache和Spring）以及（ii）规模完全不同的模型（从6000万到70亿可训练参数）。最后，我们展示了对组织特定数据集微调的DL模型，其代码补全性能与直接使用的预训练代码模型相当，而后者规模大约是其10倍，从而在部署和推理成本上实现了显著节约（例如，所需GPU更小）。

English

Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. In this work, we fill this gap by presenting solid empirical evidence answering this question. More specifically, we consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters). T5 models (60M, 750M) were pre-trained and fine-tuned on over 2,000 open-source projects, excluding the subject organizations' data, and compared against versions fine-tuned on organization- and developer-specific datasets. For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned via parameter-efficient fine-tuning on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning, with the former being particularly performant. Such a finding generalizes across (i) the two subject organizations (i.e., Apache and Spring) and (ii) models of completely different magnitude (from 60M to 7B trainable parameters). Finally, we show that DL models fine-tuned on an organization-specific dataset achieve the same completion performance of pre-trained code models used out of the box and being sim10times larger, with consequent savings in terms of deployment and inference cost (e.g., smaller GPUs needed).

为何深度学习的代码补全工具个性化至关重要

Why Personalizing Deep Learning-Based Code Completion Tools Matters

摘要

Support