CodeEditorBench: 대규모 언어 모델의 코드 편집 능력 평가

초록

코드를 위한 대형 언어 모델(LLMs)은 빠르게 진화하고 있으며, 코드 편집은 중요한 역량으로 부상하고 있습니다. 우리는 코드 편집 작업(디버깅, 번역, 다듬기, 요구사항 변경 등)에서 LLMs의 성능을 엄격하게 평가하기 위해 설계된 평가 프레임워크인 CodeEditorBench를 소개합니다. 기존의 코드 생성에만 초점을 맞춘 벤치마크와 달리, CodeEditorBench는 소프트웨어 개발의 실제 시나리오와 실용적인 측면을 강조합니다. 우리는 다양한 프로그래밍 언어, 복잡성 수준, 편집 작업을 포함하는 다섯 가지 소스에서 다양한 코딩 문제와 시나리오를 선별했습니다. 19개의 LLMs에 대한 평가 결과, 특히 Gemini-Ultra와 GPT-4와 같은 폐쇄형 모델이 CodeEditorBench에서 오픈소스 모델을 능가하며, 문제 유형과 프롬프트 민감도에 따른 모델 성능 차이를 보여주었습니다. CodeEditorBench는 코드 편집 능력을 평가하기 위한 견고한 플랫폼을 제공함으로써 LLMs의 발전을 촉진하고자 합니다. 우리는 커뮤니티가 데이터셋을 확장하고 새로운 LLMs를 벤치마크할 수 있도록 모든 프롬프트와 데이터셋을 공개할 예정입니다. CodeEditorBench를 도입함으로써, 우리는 코드 편집에서의 LLMs 발전에 기여하고 연구자와 실무자에게 유용한 자원을 제공합니다.

English

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

CodeEditorBench: 대규모 언어 모델의 코드 편집 능력 평가

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

초록

Support