大規模言語モデルによるコード補完の静的評価

要旨

コードで訓練された大規模言語モデルは、ソフトウェア開発者の生産性を向上させる大きな可能性を示しています。これまでに、単純なプログラミング問題におけるモデル生成コードの機能的正しさを評価するために、いくつかの実行ベースのベンチマークが提案されてきました。しかし、実行コストを考慮すると、複雑な実世界のプロジェクトに対して同じ評価を行うのは高コストです。一方で、プログラムを実行せずにエラーを検出できるリンターなどの静的解析ツールは、コード生成モデルの評価に十分に活用されていません。本研究では、抽象構文木を活用して、Pythonコード補完における静的エラーを定量化する静的評価フレームワークを提案します。実行ベースの評価と比較して、私たちの手法はより効率的であるだけでなく、実際のコードにも適用可能です。実験では、オープンソースリポジトリからコードコンテキストを収集し、公開モデルを使用して100万個の関数本体を生成します。私たちの静的解析により、未定義名と未使用変数が言語モデルによって生成されるコードで最も一般的なエラーであることが明らかになりました。また、広範な研究を通じて、サンプリング温度、モデルサイズ、およびコンテキストがコード補完における静的エラーに与える影響を示します。

English

Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

大規模言語モデルによるコード補完の静的評価

A Static Evaluation of Code Completion by Large Language Models

要旨

Support