使用預訓練語言模型進行程式碼執行
Code Execution with Pre-trained Language Models
May 8, 2023
作者: Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan
cs.AI
摘要
程式碼執行是程式語言語義學的基本層面,反映了程式碼的確切行為。然而,大多數用於程式碼智能的預訓練模型忽略了執行蹤跡,僅依賴於原始碼和語法結構。本文探討了預訓練模型在理解和執行程式碼方面的表現。我們開發了一種基於變異的資料擴增技術,創建了一個大規模且現實的Python資料集和程式碼執行任務,挑戰了諸如Codex等現有模型。接著,我們提出了CodeExecutor,一個利用程式碼執行預訓練和課程學習來增強其語義理解能力的Transformer模型。我們對CodeExecutor進行了程式碼執行的評估,展示了其有希望的表現和局限性。我們還展示了它對於程式碼智能任務(如零-shot程式碼對程式碼搜索和文本對程式碼生成)的潛在好處。我們的分析提供了關於預訓練模型在程式碼執行方面的學習和泛化能力的見解。
English
Code execution is a fundamental aspect of programming language semantics that
reflects the exact behavior of the code. However, most pre-trained models for
code intelligence ignore the execution trace and only rely on source code and
syntactic structures. In this paper, we investigate how well pre-trained models
can understand and perform code execution. We develop a mutation-based data
augmentation technique to create a large-scale and realistic Python dataset and
task for code execution, which challenges existing models such as Codex. We
then present CodeExecutor, a Transformer model that leverages code execution
pre-training and curriculum learning to enhance its semantic comprehension. We
evaluate CodeExecutor on code execution and show its promising performance and
limitations. We also demonstrate its potential benefits for code intelligence
tasks such as zero-shot code-to-code search and text-to-code generation. Our
analysis provides insights into the learning and generalization abilities of
pre-trained models for code execution.