无需执行即可预测代码覆盖率。

摘要

代码覆盖率是一种广泛使用的度量标准，用于量化程序元素（如语句或分支）在测试期间的执行程度。计算代码覆盖率是一项资源密集型任务，需要构建代码并执行，同时还需要额外的开销用于仪器化。此外，计算任何代码片段的覆盖率都需要整个程序的上下文。利用机器学习来分摊这一昂贵的过程可以降低代码覆盖率的成本，因为只需要源代码上下文，而且代码覆盖率预测任务可以成为评估模型理解代码能力的新颖基准。我们提出了一个名为大型语言模型（LLMs）代码覆盖率预测的新颖基准任务。我们形式化这一任务，以评估LLMs在理解代码执行方面的能力，通过确定给定测试用例和输入执行的方法中的哪些行。我们策划并发布了一个名为COVERAGEEVAL的数据集，通过执行来自HumanEval数据集的测试和代码，并收集代码覆盖率信息。我们报告了用于代码相关任务的四种最先进的LLMs的性能，包括OpenAI的GPT-4和GPT-3.5-Turbo，Google的BARD以及Anthropic的Claude，在代码覆盖率预测任务上的表现。最后，我们认为代码覆盖率作为度量标准和预训练数据源对于LLMs在软件工程任务上的整体性能是有价值的。

English

Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.

无需执行即可预测代码覆盖率。

Predicting Code Coverage without Execution

摘要

Support