ChatPaper.aiChatPaper

REPOEXEC:使用存儲庫級可執行基準評估代碼生成

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

June 17, 2024
作者: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui
cs.AI

摘要

CodeLLM 的能力在存儲庫級別規模生成可執行且功能正確的代碼,目前仍然是一個未被充分探索的領域。我們引入了 RepoExec,這是一個用於評估存儲庫級別代碼生成的新型基準。RepoExec 主要關注三個方面:可執行性、通過自動測試用例生成實現高覆蓋率的功能正確性,以及精心設計的跨文件上下文,以準確生成代碼。我們的工作探索了一個受控情景,開發人員在其中指定必要的代碼依賴,挑戰模型準確整合這些依賴。實驗表明,預訓練的 LLM 在正確性方面優於指令調整模型,而後者在利用提供的依賴並展示調試能力方面表現出色。我們還引入了一個新的指令調整數據集,專注於代碼依賴性,並展示了在我們數據集上微調的 CodeLLM 具有更好地利用這些依賴性的能力。RepoExec 旨在全面評估代碼功能性和與開發人員意圖的一致性,為在現實場景中更可靠和適用的 CodeLLM 打下基礎。數據集和源代碼可在以下鏈接找到:https://github.com/FSoft-AI4Code/RepoExec。
English
The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~https://github.com/FSoft-AI4Code/RepoExec.

Summary

AI-Generated Summary

PDF111December 2, 2024