실행 없이 코드 커버리지 예측하기

초록

코드 커버리지는 테스트 중에 실행된 프로그램 요소(예: 구문 또는 분기)의 범위를 정량화하는 데 널리 사용되는 지표입니다. 코드 커버리지를 계산하는 것은 코드 빌드 및 실행과 계측을 위한 추가 오버헤드가 필요하여 자원 집약적입니다. 더욱이, 코드 스니펫의 커버리지를 계산하려면 전체 프로그램 컨텍스트가 필요합니다. 머신러닝을 사용하여 이 비용이 많이 드는 프로세스를 분산시키면 소스 코드 컨텍스트만 필요로 하여 코드 커버리지 비용을 낮출 수 있으며, 코드 커버리지 예측 작업은 모델이 코드를 이해하는 능력을 판단하는 새로운 벤치마크가 될 수 있습니다. 우리는 대규모 언어 모델(LLM)을 위한 코드 커버리지 예측이라는 새로운 벤치마크 작업을 제안합니다. 이 작업을 공식화하여 주어진 테스트 케이스와 입력에 의해 메서드의 어떤 라인이 실행되는지 결정함으로써 LLM의 코드 실행 이해 능력을 평가합니다. 우리는 HumanEval 데이터셋의 테스트와 코드를 실행하고 코드 커버리지 정보를 수집하여 COVERAGEEVAL이라는 데이터셋을 구축 및 공개합니다. OpenAI의 GPT-4와 GPT-3.5-Turbo, Google의 BARD, Anthropic의 Claude를 포함한 코드 관련 작업에 사용되는 4개의 최첨단 LLM의 코드 커버리지 예측 작업 성능을 보고합니다. 마지막으로, 코드 커버리지가 지표 및 사전 학습 데이터 소스로서 소프트웨어 엔지니어링 작업에서 전반적인 LLM 성능에 유용하다고 주장합니다.

English

Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.

실행 없이 코드 커버리지 예측하기

Predicting Code Coverage without Execution

초록

Support