McEval：大規模多語言程式碼評估

摘要

大型語言模型（LLMs）在程式碼理解、完成和生成任務中展現了顯著的進展。程式設計基準測試由一系列程式碼挑戰和相應的測試案例組成，用作評估不同LLMs在這些任務中的能力的標準。然而，大多數現有的基準測試主要集中在Python上，仍然僅限於有限數量的語言，其他語言是從Python樣本翻譯而來（例如MultiPL-E），降低了數據的多樣性。為了進一步促進程式碼LLMs的研究，我們提出了一個包含40種程式設計語言（McEval）的大規模多語言程式碼基準測試，其中包含16K個測試樣本，大幅推動了程式碼LLMs在多語言情境中的極限。該基準測試包含具有精心策劃的大規模多語言指令語料庫McEval-Instruct的具有挑戰性的程式碼完成、理解和生成評估任務。此外，我們引入了一個在McEval-Instruct上訓練的有效多語言編碼器mCoder，以支持多語言程式設計語言生成。對McEval的廣泛實驗結果顯示，在眾多語言中，開源模型和封閉源LLMs（例如GPT系列模型）之間仍存在著艱難的旅程。指令語料庫、評估基準測試和排行榜可在https://mceval.github.io/上找到。

English

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard to evaluate the capability of different LLMs in such tasks. However, most existing benchmarks primarily focus on Python and are still restricted to a limited number of languages, where other languages are translated from the Python samples (e.g. MultiPL-E) degrading the data diversity. To further facilitate the research of code LLMs, we propose a massively multilingual code benchmark covering 40 programming languages (McEval) with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. In addition, we introduce an effective multilingual coder mCoder trained on McEval-Instruct to support multilingual programming language generation. Extensive experimental results on McEval show that there is still a difficult journey between open-source models and closed-source LLMs (e.g. GPT-series models) in numerous languages. The instruction corpora, evaluation benchmark, and leaderboard are available at https://mceval.github.io/.

McEval：大規模多語言程式碼評估

McEval: Massively Multilingual Code Evaluation

摘要

Support