大型語言模型對程式碼自動完成的靜態評估

摘要

基於程式碼訓練的大型語言模型已顯示出提升軟體開發人員生產力的巨大潛力。已提出多個基於執行的基準來評估模型生成的程式碼在簡單編程問題上的功能正確性。然而，考慮到執行成本，對於複雜的現實專案進行相同評估是昂貴的。相反地，靜態分析工具如 linters 能夠在不執行程式的情況下檢測錯誤，但尚未被廣泛應用於評估程式碼生成模型。在這項研究中，我們提出了一個靜態評估框架，通過利用抽象語法樹來量化 Python 程式碼完成中的靜態錯誤。與基於執行的評估相比，我們的方法不僅更有效率，而且適用於現實世界中的程式碼。在實驗中，我們從開源存儲庫中收集程式碼上下文，使用公共模型生成一百萬個函數主體。我們的靜態分析顯示，未定義名稱和未使用變數是語言模型中最常見的錯誤之一。通過廣泛研究，我們還展示了取樣溫度、模型大小和上下文對程式碼完成中靜態錯誤的影響。

English

Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

大型語言模型對程式碼自動完成的靜態評估

A Static Evaluation of Code Completion by Large Language Models

摘要

Support