McEval：大规模多语言代码评估

摘要

大型语言模型（LLMs）在代码理解、补全和生成任务中展现出显著进展。编程基准测试由一系列代码挑战和相应的测试用例组成，用作评估不同LLMs在这些任务中能力的标准。然而，大多数现有基准测试主要集中在Python上，仍然局限于有限数量的语言，其他语言是从Python样本翻译而来（例如MultiPL-E），降低了数据的多样性。为了进一步促进代码LLMs的研究，我们提出了一个覆盖40种编程语言（McEval）的大规模多语言代码基准测试，包含16K个测试样本，大大推动了代码LLMs在多语言场景中的极限。该基准测试包含具有精心策划的大规模多语言指令语料库McEval-Instruct的具有挑战性的代码补全、理解和生成评估任务。此外，我们引入了一个在McEval-Instruct上训练的有效多语言编码器mCoder，以支持多语言编程语言生成。对McEval的广泛实验结果表明，在许多语言中，开源模型和闭源LLMs（例如GPT系列模型）之间仍存在艰难的道路。指令语料库、评估基准测试和排行榜可在https://mceval.github.io/ 上找到。

English

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard to evaluate the capability of different LLMs in such tasks. However, most existing benchmarks primarily focus on Python and are still restricted to a limited number of languages, where other languages are translated from the Python samples (e.g. MultiPL-E) degrading the data diversity. To further facilitate the research of code LLMs, we propose a massively multilingual code benchmark covering 40 programming languages (McEval) with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. In addition, we introduce an effective multilingual coder mCoder trained on McEval-Instruct to support multilingual programming language generation. Extensive experimental results on McEval show that there is still a difficult journey between open-source models and closed-source LLMs (e.g. GPT-series models) in numerous languages. The instruction corpora, evaluation benchmark, and leaderboard are available at https://mceval.github.io/.

McEval：大规模多语言代码评估

McEval: Massively Multilingual Code Evaluation

摘要

Summary

Support