大型語言模型對程式碼自動完成的靜態評估
A Static Evaluation of Code Completion by Large Language Models
June 5, 2023
作者: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang
cs.AI
摘要
基於程式碼訓練的大型語言模型已顯示出提升軟體開發人員生產力的巨大潛力。已提出多個基於執行的基準來評估模型生成的程式碼在簡單編程問題上的功能正確性。然而,考慮到執行成本,對於複雜的現實專案進行相同評估是昂貴的。相反地,靜態分析工具如 linters 能夠在不執行程式的情況下檢測錯誤,但尚未被廣泛應用於評估程式碼生成模型。在這項研究中,我們提出了一個靜態評估框架,通過利用抽象語法樹來量化 Python 程式碼完成中的靜態錯誤。與基於執行的評估相比,我們的方法不僅更有效率,而且適用於現實世界中的程式碼。在實驗中,我們從開源存儲庫中收集程式碼上下文,使用公共模型生成一百萬個函數主體。我們的靜態分析顯示,未定義名稱和未使用變數是語言模型中最常見的錯誤之一。通過廣泛研究,我們還展示了取樣溫度、模型大小和上下文對程式碼完成中靜態錯誤的影響。
English
Large language models trained on code have shown great potential to increase
productivity of software developers. Several execution-based benchmarks have
been proposed to evaluate functional correctness of model-generated code on
simple programming problems. Nevertheless, it is expensive to perform the same
evaluation on complex real-world projects considering the execution cost. On
the contrary, static analysis tools such as linters, which can detect errors
without running the program, haven't been well explored for evaluating code
generation models. In this work, we propose a static evaluation framework to
quantify static errors in Python code completions, by leveraging Abstract
Syntax Trees. Compared with execution-based evaluation, our method is not only
more efficient, but also applicable to code in the wild. For experiments, we
collect code context from open source repos to generate one million function
bodies using public models. Our static analysis reveals that Undefined Name and
Unused Variable are the most common errors among others made by language
models. Through extensive studies, we also show the impact of sampling
temperature, model size, and context on static errors in code completions.