DevEval: 실제 코드 저장소와 정렬된 수동 주석 처리된 코드 생성 벤치마크

초록

대규모 언어 모델(LLM)의 코딩 능력을 평가하는 방법은 여전히 해결되지 않은 문제로 남아 있습니다. 우리는 기존 벤치마크가 실제 코드 저장소와 잘 맞지 않으며, LLM의 코딩 능력을 평가하기에는 부족하다는 점을 발견했습니다. 이러한 지식 격차를 해결하기 위해, 우리는 DevEval이라는 새로운 벤치마크를 제안합니다. DevEval은 세 가지 측면에서 발전을 이루었습니다. (1) DevEval은 코드 분포 및 의존성 분포 등 여러 차원에서 실제 저장소와 일치합니다. (2) DevEval은 13명의 개발자에 의해 주석이 달렸으며, 요구사항, 원본 저장소, 참조 코드, 참조 의존성 등 포괄적인 주석을 포함합니다. (3) DevEval은 117개의 저장소에서 추출한 1,874개의 테스트 샘플로 구성되어 있으며, 인터넷, 데이터베이스 등 10개의 인기 있는 도메인을 다룹니다. DevEval을 기반으로, 우리는 저장소 수준의 코드 생성을 제안하고 gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa 등 8개의 인기 있는 LLM을 DevEval에서 평가했습니다. 우리의 실험은 이러한 LLM의 실제 코드 저장소에서의 코딩 능력을 보여줍니다. 예를 들어, 우리 실험에서 gpt-4-turbo의 최고 Pass@1은 단 53.04%에 불과했습니다. 또한, 우리는 LLM의 실패 사례를 분석하고 그들의 단점을 요약했습니다. 우리는 DevEval이 실제 코드 저장소에서 LLM의 발전을 촉진할 수 있기를 바랍니다. DevEval, 프롬프트, 그리고 LLM의 예측 결과는 공개되었습니다.

English

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.

DevEval: 실제 코드 저장소와 정렬된 수동 주석 처리된 코드 생성 벤치마크

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

초록

Support