REPOEXEC:使用存儲庫級可執行基準評估代碼生成
REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark
June 17, 2024
作者: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui
cs.AI
摘要
CodeLLM 的能力在存儲庫級別規模生成可執行且功能正確的代碼,目前仍然是一個未被充分探索的領域。我們引入了 RepoExec,這是一個用於評估存儲庫級別代碼生成的新型基準。RepoExec 主要關注三個方面:可執行性、通過自動測試用例生成實現高覆蓋率的功能正確性,以及精心設計的跨文件上下文,以準確生成代碼。我們的工作探索了一個受控情景,開發人員在其中指定必要的代碼依賴,挑戰模型準確整合這些依賴。實驗表明,預訓練的 LLM 在正確性方面優於指令調整模型,而後者在利用提供的依賴並展示調試能力方面表現出色。我們還引入了一個新的指令調整數據集,專注於代碼依賴性,並展示了在我們數據集上微調的 CodeLLM 具有更好地利用這些依賴性的能力。RepoExec 旨在全面評估代碼功能性和與開發人員意圖的一致性,為在現實場景中更可靠和適用的 CodeLLM 打下基礎。數據集和源代碼可在以下鏈接找到:https://github.com/FSoft-AI4Code/RepoExec。
English
The ability of CodeLLMs to generate executable and functionally correct code
at the repository-level scale remains largely unexplored. We introduce
RepoExec, a novel benchmark for evaluating code generation at the
repository-level scale. RepoExec focuses on three main aspects: executability,
functional correctness through automated test case generation with high
coverage rate, and carefully crafted cross-file contexts to accurately generate
code. Our work explores a controlled scenario where developers specify
necessary code dependencies, challenging the model to integrate these
accurately. Experiments show that while pretrained LLMs outperform
instruction-tuned models in correctness, the latter excel in utilizing provided
dependencies and demonstrating debugging capabilities. We also introduce a new
instruction-tuned dataset that focuses on code dependencies and demonstrate
that CodeLLMs fine-tuned on our dataset have a better capability to leverage
these dependencies effectively. RepoExec aims to provide a comprehensive
evaluation of code functionality and alignment with developer intent, paving
the way for more reliable and applicable CodeLLMs in real-world scenarios. The
dataset and source code can be found
at~https://github.com/FSoft-AI4Code/RepoExec.Summary
AI-Generated Summary