大規模程式碼表示學習

摘要

最近的研究顯示，規模化的程式碼語言模型在下游任務，即程式碼生成上表現出顯著的性能提升。然而，大部分現有的程式碼表示學習作品是在一億參數規模上訓練模型，並使用非常有限的預訓練語料庫。在這項工作中，我們通過兩階段的預訓練方案，利用大量的程式碼數據來推動程式碼表示學習。我們首先通過一種混合方法訓練編碼器，該方法利用了在遮罩語言建模中的隨機性以及編程語言的結構方面。然後，我們通過對比學習來增強這些表示，其中硬負樣本和硬正樣本是以非監督方式構建的。我們建立了一個即插即用的編碼器模型，它在各種下游任務上持續以很大的優勢勝過現有模型。為了理解導致成功的程式碼表示學習的因素，我們進行了詳細的消融實驗，並分享了我們的研究發現，包括：(i) 針對源代碼的定制和有效的標記級去噪方案的重要性；(ii) 硬負樣本和硬正樣本的重要性；(iii) 提出的雙模對比學習如何提升跨語言語義搜索性能；以及 (iv) 預訓練方案如何決定模型規模與下游任務性能規模之間的關係。

English

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

大規模程式碼表示學習

Code Representation Learning At Scale

摘要

Support