大规模的代码表示学习

摘要

最近的研究表明，规模化的代码语言模型在下游任务，即代码生成方面表现出显著的性能提升。然而，大多数现有的代码表示学习工作是在一个亿参数规模上使用非常有限的预训练语料库进行模型训练。在这项工作中，我们通过一个两阶段的预训练方案，为代码表示学习提供了大量的代码数据。我们首先通过一种混合训练编码器的方式，利用掩码语言建模的随机性和编程语言的结构方面。然后，我们通过对比学习以硬负样本和硬正样本的无监督方式构建来增强表示。我们建立了一个现成的编码器模型，它在各种下游任务上始终以较大的优势表现优于现有模型。为了理解成功代码表示学习的因素，我们进行了详细的消融实验，并分享了我们的发现：(i) 用于源代码的定制且有效的标记级去噪方案的重要性；(ii) 硬负样本和硬正样本的重要性；(iii) 提出的双模态对比学习如何提升跨语言语义搜索性能；以及 (iv) 预训练方案如何决定模型规模与下游任务性能规模之间的关系。

English

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

大规模的代码表示学习

Code Representation Learning At Scale

摘要

Support