사전 학습된 언어 모델을 활용한 코드 실행

초록

코드 실행은 프로그래밍 언어의 의미론에서 코드의 정확한 동작을 반영하는 근본적인 측면입니다. 그러나 대부분의 코드 인텔리전스를 위한 사전 학습된 모델들은 실행 흔적을 무시하고 소스 코드와 구문 구조에만 의존합니다. 본 논문에서는 사전 학습된 모델들이 코드 실행을 얼마나 잘 이해하고 수행할 수 있는지 조사합니다. 우리는 대규모이고 현실적인 파이썬 데이터셋과 코드 실행 과제를 생성하기 위해 변이 기반 데이터 증강 기술을 개발하였으며, 이는 Codex와 같은 기존 모델들에 도전적인 과제를 제시합니다. 그런 다음, 우리는 코드 실행 사전 학습과 커리큘럼 학습을 활용하여 의미론적 이해를 강화한 Transformer 모델인 CodeExecutor를 소개합니다. 우리는 CodeExecutor를 코드 실행에 대해 평가하고, 그 유망한 성능과 한계를 보여줍니다. 또한, 제로샷 코드-코드 검색 및 텍스트-코드 생성과 같은 코드 인텔리전스 작업에 대한 잠재적 이점을 입증합니다. 우리의 분석은 코드 실행을 위한 사전 학습된 모델들의 학습 및 일반화 능력에 대한 통찰을 제공합니다.

English

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

사전 학습된 언어 모델을 활용한 코드 실행

Code Execution with Pre-trained Language Models

초록

Support