AutoCodeBench: 대형 언어 모델은 자동 코드 벤치마크 생성기입니다

초록

대규모 언어 모델(LLMs)은 다양한 분야에서 뛰어난 능력을 보여주었으며, 코드 생성은 주요 관심 분야로 부상하고 있습니다. 코드 생성 능력을 평가하기 위해 수많은 벤치마크가 제안되었지만, 이러한 벤치마크는 몇 가지 중요한 한계에 직면해 있습니다. 첫째, 이들은 종종 수동 주석에 의존하는데, 이는 시간이 많이 소요될 뿐만 아니라 다양한 프로그래밍 언어와 문제 복잡도에 걸쳐 확장하기 어렵습니다. 둘째, 대부분의 기존 벤치마크는 주로 Python에 초점을 맞추고 있으며, 소수의 다국어 벤치마크는 제한된 난이도와 불균등한 언어 분포를 보입니다. 이러한 문제를 해결하기 위해, 우리는 수동 주석 없이도 고난이도의 다국어 코드 생성 데이터셋을 자동으로 생성하는 방법인 AutoCodeGen을 제안합니다. AutoCodeGen은 LLMs를 사용하여 테스트 입력을 생성하고 다국어 샌드박스를 통해 테스트 출력을 얻음으로써 테스트 케이스의 정확성과 완전성을 보장하며, 역순 문제 생성과 다중 필터링 단계를 통해 높은 데이터 품질을 달성합니다. 이 새로운 방법을 사용하여, 우리는 20개의 프로그래밍 언어에 걸쳐 균등하게 분포된 3,920개의 문제로 구성된 대규모 코드 생성 벤치마크인 AutoCodeBench을 소개합니다. 이 벤치마크는 특히 도전적이고 다양하며 실용적인 다국어 작업에서 LLMs를 평가하도록 설계되었습니다. 우리는 AutoCodeBench과 그 간소화 버전인 AutoCodeBench-Lite에서 30개 이상의 주요 오픈소스 및 상용 LLMs를 평가했습니다. 결과는 가장 발전된 LLMs조차도 이러한 작업의 복잡성, 다양성, 다국어 특성에 어려움을 겪는 것을 보여줍니다. 또한, 우리는 기본 모델의 few-shot 코드 생성 능력을 평가하기 위해 특별히 설계된 AutoCodeBench-Complete을 소개합니다. 우리는 AutoCodeBench 시리즈가 가치 있는 자원으로 활용되고, 더 도전적이고 실용적인 다국어 코드 생성 시나리오에 커뮤니티의 관심을 끌기를 바랍니다.

English

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.

AutoCodeBench: 대형 언어 모델은 자동 코드 벤치마크 생성기입니다

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

초록

Support