CrossCodeEval: クロスファイルコード補完のための多様かつ多言語なベンチマーク

要旨

コード補完モデルは近年大きな進歩を遂げてきたが、現在広く使われている評価データセット（HumanEvalやMBPPなど）は、主に単一ファイル内でのコード補完タスクに焦点を当てている。このような過度に単純化された設定は、現実世界のソフトウェア開発シナリオを十分に反映しておらず、実際のリポジトリは複数のファイルにまたがり、多くのクロスファイル依存関係が存在し、コードを正確に補完するためにはクロスファイルのコンテキストにアクセスし理解することがしばしば必要となる。このギャップを埋めるため、我々はCrossCodeEvalを提案する。これは、コードを正確に補完するために深いクロスファイルの文脈理解を必要とする、多様で多言語に対応したコード補完ベンチマークである。CrossCodeEvalは、Python、Java、TypeScript、C#という4つの人気プログラミング言語における、多様な現実世界のオープンソースで許諾されたリポジトリのセットに基づいて構築されている。正確な補完のために厳密にクロスファイルのコンテキストを必要とする例を作成するために、我々は現在のファイル内でクロスファイルのコンテキストが使用されている箇所を特定する、シンプルでありながら効率的な静的解析ベースのアプローチを提案する。 CodeGenやStarCoderのような最先端のコード言語モデルを用いた大規模な実験により、CrossCodeEvalは関連するクロスファイルのコンテキストが欠如している場合に極めて困難であることが示され、プロンプトにこれらのコンテキストを追加することで明確な改善が見られた。しかし、そのような改善にもかかわらず、最高性能のモデルであっても最高のパフォーマンスには明らかに到達しておらず、CrossCodeEvalが広範なコンテキストを活用してより良いコード補完を行うモデルの能力を評価するのに適していることが示唆される。最後に、クロスファイルのコンテキストを取得するための様々な方法をベンチマークし、CrossCodeEvalがコード検索器の能力を測定するためにも使用できることを示す。

English

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

CrossCodeEval: クロスファイルコード補完のための多様かつ多言語なベンチマーク

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

要旨

Support