AutoCodeBench:大型语言模型作为自动代码基准生成器
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
August 12, 2025
作者: Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, Fengzong Lian
cs.AI
摘要
大型语言模型(LLMs)在多个领域展现了卓越的能力,其中代码生成已成为一个关键研究方向。尽管已有众多基准测试被提出以评估其代码生成能力,但这些基准测试面临几项关键局限。首先,它们往往依赖人工标注,这不仅耗时,且难以在不同编程语言和问题复杂度间扩展。其次,现有基准测试大多集中于Python,而少数多语言基准测试则存在难度有限和语言分布不均的问题。为应对这些挑战,我们提出了AutoCodeGen,一种无需人工标注即可自动生成高难度多语言代码生成数据集的方法。AutoCodeGen通过利用LLMs生成测试输入,并通过多语言沙箱获取测试输出,确保了测试用例的正确性和完整性,同时通过逆向问题生成和多步过滤实现了高质量数据。基于这一创新方法,我们推出了AutoCodeBench,一个包含3,920个问题、均匀分布于20种编程语言的大规模代码生成基准测试,专为评估LLMs在具有挑战性、多样性和实用性的多语言任务上的表现而设计。我们对超过30个领先的开源和专有LLMs在AutoCodeBench及其简化版AutoCodeBench-Lite上进行了评估,结果显示,即便是最先进的LLMs在面对这些任务的复杂性、多样性和多语言特性时也显得力不从心。此外,我们还推出了专为基础模型设计的AutoCodeBench-Complete,用以评估其少样本代码生成能力。我们期望AutoCodeBench系列能成为宝贵的资源,激励社区关注更具挑战性和实用性的多语言代码生成场景。
English
Large Language Models (LLMs) have demonstrated remarkable capabilities across
various domains, with code generation emerging as a key area of focus. While
numerous benchmarks have been proposed to evaluate their code generation
abilities, these benchmarks face several critical limitations. First, they
often rely on manual annotations, which are time-consuming and difficult to
scale across different programming languages and problem complexities. Second,
most existing benchmarks focus primarily on Python, while the few multilingual
benchmarks suffer from limited difficulty and uneven language distribution. To
address these challenges, we propose AutoCodeGen, an automated method for
generating high-difficulty multilingual code generation datasets without manual
annotations. AutoCodeGen ensures the correctness and completeness of test cases
by generating test inputs with LLMs and obtaining test outputs through a
multilingual sandbox, while achieving high data quality through reverse-order
problem generation and multiple filtering steps. Using this novel method, we
introduce AutoCodeBench, a large-scale code generation benchmark comprising
3,920 problems evenly distributed across 20 programming languages. It is
specifically designed to evaluate LLMs on challenging, diverse, and practical
multilingual tasks. We evaluate over 30 leading open-source and proprietary
LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The
results show that even the most advanced LLMs struggle with the complexity,
diversity, and multilingual nature of these tasks. Besides, we introduce
AutoCodeBench-Complete, specifically designed for base models to assess their
few-shot code generation capabilities. We hope the AutoCodeBench series will
serve as a valuable resource and inspire the community to focus on more
challenging and practical multilingual code generation scenarios.