在不執行的情況下預測程式碼覆蓋率。

摘要

程式碼覆蓋率是一項廣泛使用的指標，用於量化在測試期間執行程式元素（如語句或分支）的程度。計算程式碼覆蓋率需要耗費大量資源，需要建構程式碼並執行，並為儀器化增加額外開銷。此外，計算任何程式碼片段的覆蓋率需要整個程式的上下文。利用機器學習來攤提這昂貴的過程可以降低程式碼覆蓋率的成本，只需要源代碼上下文，並且程式碼覆蓋率預測任務可以成為評估模型理解程式碼能力的新穎基準。我們提出了一個名為大型語言模型（LLMs）程式碼覆蓋率預測的新穎基準任務。我們將此任務正式化，以評估LLMs在理解程式碼執行方面的能力，即確定給定測試案例和輸入時哪些方法行被執行。我們通過執行來自HumanEval數據集的測試和程式碼，並收集程式碼覆蓋率信息，匯編並發布了一個名為COVERAGEEVAL的數據集。我們報告了用於程式碼相關任務的四種最先進的LLMs的性能，包括OpenAI的GPT-4和GPT-3.5-Turbo、Google的BARD和Anthropic的Claude，在程式碼覆蓋率預測任務上的表現。最後，我們認為程式碼覆蓋率作為指標和預訓練數據來源對LLM在軟體工程任務的整體性能是有價值的。

English

Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks.

在不執行的情況下預測程式碼覆蓋率。

Predicting Code Coverage without Execution

摘要

Support