実行なしでコードカバレッジを予測する

要旨

コードカバレッジは、テスト中にステートメントやブランチなどのプログラム要素がどの程度実行されたかを定量化するために広く使用される指標です。コードカバレッジの計算はリソース集約的であり、コードのビルドと実行に加えて、計装のための追加のオーバーヘッドが必要です。さらに、コードスニペットのカバレッジを計算するには、プログラム全体のコンテキストが必要です。機械学習を使用してこの高コストなプロセスを償却することで、ソースコードのコンテキストのみを必要とし、コードカバレッジのコストを削減できる可能性があります。また、コードカバレッジ予測のタスクは、モデルのコード理解能力を判断するための新しいベンチマークとして機能します。我々は、大規模言語モデル（LLMs）のための新しいベンチマークタスクである「コードカバレッジ予測」を提案します。このタスクを形式化し、与えられたテストケースと入力によってメソッドのどの行が実行されるかを決定することで、LLMsのコード実行理解能力を評価します。我々は、HumanEvalデータセットのテストとコードを実行し、コードカバレッジ情報を収集することで、COVERAGEEVALというデータセットをキュレーションし、公開します。OpenAIのGPT-4とGPT-3.5-Turbo、GoogleのBARD、AnthropicのClaudeを含む、コード関連タスクに使用される4つの最先端LLMsのコードカバレッジ予測タスクにおける性能を報告します。最後に、コードカバレッジが指標および事前学習データソースとして、ソフトウェアエンジニアリングタスクにおけるLLMsの全体的な性能にとって価値があることを主張します。

English

Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.

実行なしでコードカバレッジを予測する

Predicting Code Coverage without Execution

要旨

Support