CodeMMLU：CodeLLMのコード理解能力を評価するためのマルチタスクベンチマーク

要旨

最近のCode Large Language Models（CodeLLMs）の進歩は、主にオープンエンドのコード生成タスクに焦点を当てており、しばしばコードの理解と理解という重要な側面を無視しています。このギャップを埋めるために、私たちはCodeMMLUを提案します。これは、LLMsにおけるソフトウェアとコードの理解の深さを評価するために設計された包括的な多肢選択問題回答のベンチマークです。CodeMMLUには、さまざまなドメインから収集された1万以上の質問が含まれており、コード分析、欠陥検出、および複数のプログラミング言語にわたるソフトウェアエンジニアリング原則などのタスクが含まれています。従来のベンチマークとは異なり、CodeMMLUはモデルがコードについて論理的に考える能力を評価し、単に生成するだけでなく、複雑なソフトウェアの概念やシステムに対する理解をより深く提供します。私たちの包括的な評価により、最先端のモデルでさえCodeMMLUに大きな課題を抱えていることが明らかになり、コード生成を超えた理解の欠如が浮き彫りにされました。コードの理解と効果的な生成との重要な関係を強調することで、CodeMMLUはAI支援ソフトウェア開発を推進するための重要なリソースとなり、最終的にはより信頼性が高く、能力があるコーディングアシスタントを作成することを目指しています。

English

Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

CodeMMLU：CodeLLMのコード理解能力を評価するためのマルチタスクベンチマーク

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

要旨

Support