使用預訓練語言模型進行程式碼執行

摘要

程式碼執行是程式語言語義學的基本層面，反映了程式碼的確切行為。然而，大多數用於程式碼智能的預訓練模型忽略了執行蹤跡，僅依賴於原始碼和語法結構。本文探討了預訓練模型在理解和執行程式碼方面的表現。我們開發了一種基於變異的資料擴增技術，創建了一個大規模且現實的Python資料集和程式碼執行任務，挑戰了諸如Codex等現有模型。接著，我們提出了CodeExecutor，一個利用程式碼執行預訓練和課程學習來增強其語義理解能力的Transformer模型。我們對CodeExecutor進行了程式碼執行的評估，展示了其有希望的表現和局限性。我們還展示了它對於程式碼智能任務（如零-shot程式碼對程式碼搜索和文本對程式碼生成）的潛在好處。我們的分析提供了關於預訓練模型在程式碼執行方面的學習和泛化能力的見解。

English

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

使用預訓練語言模型進行程式碼執行

Code Execution with Pre-trained Language Models

摘要

Support