DevEval：与现实世界代码库对齐的手动注释代码生成基准测试

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

May 30, 2024

作者: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li

cs.AI

摘要

评估大型语言模型（LLMs）的编码能力仍然是一个悬而未决的问题。我们发现现有的基准测试与真实世界的代码存储库存在严重不对齐，并且不足以评估LLMs的编码能力。为了填补这一知识空白，我们提出了一个名为DevEval的新基准测试，具有三个创新点。 (1) DevEval在多个维度上与真实世界的存储库保持一致，例如代码分布和依赖分布。(2) DevEval由13位开发人员进行了注释，并包含全面的注释（例如需求、原始存储库、参考代码和参考依赖项）。 (3) DevEval包括来自117个存储库的1,874个测试样本，涵盖10个流行领域（例如互联网、数据库）。基于DevEval，我们提出了存储库级别的代码生成，并在DevEval上评估了8个流行的LLMs（例如gpt-4、gpt-3.5、StarCoder 2、DeepSeek Coder、CodeLLaMa）。我们的实验揭示了这些LLMs在真实代码存储库中的编码能力。例如，在我们的实验中，gpt-4-turbo的最高Pass@1仅为53.04%。我们还分析了LLMs的失败案例并总结了它们的不足之处。我们希望DevEval能促进LLMs在真实代码存储库中的发展。DevEval、提示和LLMs的预测已经发布。

English

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.