ChatPaper.aiChatPaper

REPOEXEC:使用存储库级可执行基准评估代码生成

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

June 17, 2024
作者: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui
cs.AI

摘要

CodeLLM 的能力在存储库级别规模生成可执行且在功能上正确的代码的能力仍然大部分未被探索。我们引入 RepoExec,这是一个用于评估存储库级别规模下代码生成的新型基准。RepoExec 主要关注三个方面:可执行性、通过具有高覆盖率的自动生成测试用例实现功能正确性,以及精心设计的跨文件上下文以准确生成代码。我们的工作探索了一个受控场景,开发人员在其中指定必要的代码依赖关系,挑战模型准确集成这些依赖关系。实验证明,虽然预训练的 LLM 在正确性方面优于指令调整的模型,但后者在利用提供的依赖关系和展示调试能力方面表现出色。我们还引入了一个新的指令调整数据集,侧重于代码依赖关系,并展示了在我们的数据集上微调的 CodeLLMs 具有更好地利用这些依赖关系的能力。RepoExec 旨在全面评估代码功能和与开发人员意图的一致性,为在实际场景中更可靠和适用的 CodeLLMs 打下基础。数据集和源代码可在以下链接找到:https://github.com/FSoft-AI4Code/RepoExec。
English
The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~https://github.com/FSoft-AI4Code/RepoExec.

Summary

AI-Generated Summary

PDF111December 2, 2024