REPOEXEC: リポジトリレベルの実行可能ベンチマークによるコード生成の評価

要旨

リポジトリレベルでの規模において、CodeLLMが実行可能で機能的に正しいコードを生成する能力は、ほとんど未開拓のままです。本論文では、リポジトリレベルでのコード生成を評価するための新しいベンチマークであるRepoExecを紹介します。RepoExecは、実行可能性、高カバレッジ率を伴う自動テストケース生成による機能的正しさ、そして正確なコード生成のための注意深く設計されたクロスファイルコンテキストという3つの主要な側面に焦点を当てています。私たちの研究では、開発者が必要なコード依存関係を指定する制御されたシナリオを探求し、モデルがこれらを正確に統合することを求めます。実験結果は、事前学習済みLLMが正しさの点で命令チューニングされたモデルを上回る一方で、後者が提供された依存関係を活用し、デバッグ能力を示す点で優れていることを示しています。また、コード依存関係に焦点を当てた新しい命令チューニングデータセットを導入し、当データセットでファインチューニングされたCodeLLMがこれらの依存関係を効果的に活用する能力が向上することを実証します。RepoExecは、コードの機能性と開発者の意図との整合性を包括的に評価し、実世界のシナリオにおいてより信頼性が高く適用可能なCodeLLMへの道を開くことを目指しています。データセットとソースコードは、https://github.com/FSoft-AI4Code/RepoExec で公開されています。

English

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~https://github.com/FSoft-AI4Code/RepoExec.

REPOEXEC: リポジトリレベルの実行可能ベンチマークによるコード生成の評価

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

要旨

Support