DevEval：與真實世界的程式庫對齊的手動標註程式碼生成基準。

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

May 30, 2024

作者: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li

cs.AI

摘要

如何評估大型語言模型（LLMs）的編碼能力仍然是一個懸而未決的問題。我們發現現有的基準測試與真實世界的程式庫不夠對齊，也無法充分評估LLMs的編碼能力。為了彌補這一知識缺口，我們提出了一個名為DevEval的新基準測試，具有三個優勢。 (1) DevEval在多個維度上與真實世界的程式庫對齊，例如程式碼分佈和依賴分佈。 (2) DevEval由13位開發人員進行標註，包含全面的標註（例如需求、原始程式庫、參考程式碼和參考依賴項）。 (3) DevEval包含來自117個程式庫的1,874個測試樣本，涵蓋10個熱門領域（例如互聯網、數據庫）。基於DevEval，我們提出了基於程式庫的程式碼生成，並在DevEval上評估了8個熱門的LLMs（例如gpt-4、gpt-3.5、StarCoder 2、DeepSeek Coder、CodeLLaMa）。我們的實驗揭示了這些LLMs在真實世界程式庫中的編碼能力。例如，在我們的實驗中，gpt-4-turbo的最高Pass@1僅為53.04%。我們還分析了LLMs的失敗案例並總結了它們的不足之處。我們希望DevEval能促進LLMs在真實程式庫中的發展。DevEval、提示和LLMs的預測已經發布。

English

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.

Summary

AI-Generated Summary