CodeMMLU：一个用于评估CodeLLMs代码理解能力的多任务基准。

摘要

最近对大型编程语言模型（Code Large Language Models，CodeLLMs）的研究主要集中在开放式代码生成任务上，通常忽略了代码理解和领悟的关键方面。为了弥合这一差距，我们提出了CodeMMLU，这是一个全面的多项选择问答基准，旨在评估LLMs中软件和代码理解的深度。CodeMMLU包括来自不同领域的超过10,000个问题，涵盖了诸如代码分析、缺陷检测和跨多种编程语言的软件工程原则等任务。与传统基准不同，CodeMMLU评估模型推理代码的能力，而不仅仅是生成代码，从而提供对它们对复杂软件概念和系统的掌握更深入的见解。我们的广泛评估显示，即使是最先进的模型在面对CodeMMLU时也面临着重大挑战，突显了超越代码生成的理解能力方面的不足。通过强调代码理解与有效生成之间的关键关系，CodeMMLU作为推进AI辅助软件开发的重要资源，最终旨在创建更可靠和能力更强的编码助手。

English

Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

CodeMMLU：一个用于评估CodeLLMs代码理解能力的多任务基准。

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

摘要

Support