REPOEXEC：使用存儲庫級可執行基準評估代碼生成

摘要

CodeLLM 的能力在存儲庫級別規模生成可執行且功能正確的代碼，目前仍然是一個未被充分探索的領域。我們引入了 RepoExec，這是一個用於評估存儲庫級別代碼生成的新型基準。RepoExec 主要關注三個方面：可執行性、通過自動測試用例生成實現高覆蓋率的功能正確性，以及精心設計的跨文件上下文，以準確生成代碼。我們的工作探索了一個受控情景，開發人員在其中指定必要的代碼依賴，挑戰模型準確整合這些依賴。實驗表明，預訓練的 LLM 在正確性方面優於指令調整模型，而後者在利用提供的依賴並展示調試能力方面表現出色。我們還引入了一個新的指令調整數據集，專注於代碼依賴性，並展示了在我們數據集上微調的 CodeLLM 具有更好地利用這些依賴性的能力。RepoExec 旨在全面評估代碼功能性和與開發人員意圖的一致性，為在現實場景中更可靠和適用的 CodeLLM 打下基礎。數據集和源代碼可在以下鏈接找到：https://github.com/FSoft-AI4Code/RepoExec。

English

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~https://github.com/FSoft-AI4Code/RepoExec.

REPOEXEC：使用存儲庫級可執行基準評估代碼生成

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

摘要

Support