of Code 大規模コード言語モデルのための包括的コード批評ベンチマーク：CodeCriticBench Abstract Large language models of code (Code LLMs) have demonstrated remarkable capabilities in code generation and completion. However, their ability to critique code, which involves identifying issues and suggesting improvements, remains understudied. We introduce CodeCriticBench, a holistic benchmark for evaluating Code LLMs' code critique capabilities. CodeCriticBench comprises three key components: (1) a diverse set of real-world code snippets with intentional issues spanning multiple programming languages and complexity levels; (2) a taxonomy of code issues covering functional correctness, readability, maintainability, and security; and (3) evaluation metrics that assess both the identification of issues and the quality of suggested improvements. We evaluate several state-of-the-art Code LLMs on CodeCriticBench, revealing significant gaps in their code critique abilities. Our findings highlight the need for further research into enhancing Code LLMs' code critique capabilities, which is crucial for their effective deployment in software development workflows. 要約コード大規模言語モデル（Code LLM）は、コード生成や補完において顕著な能力を示してきた。しかし、コードの問題点を特定し改善を提案するコード批評能力については、まだ研究が十分に行われていない。本論文では、Code LLMのコード批評能力を評価するための包括的ベンチマークであるCodeCriticBenchを提案する。CodeCriticBenchは以下の3つの主要な要素で構成されている：(1) 複数のプログラミング言語と複雑さのレベルにわたる意図的な問題を含む多様な実世界のコードスニペット、(2) 機能的正確性、可読性、保守性、セキュリティを網羅するコード問題の分類体系、(3) 問題の特定と提案された改善の質の両方を評価する評価指標。我々は、いくつかの最先端のCode LLMをCodeCriticBenchで評価し、それらのコード批評能力に大きなギャップがあることを明らかにした。本研究の結果は、ソフトウェア開発ワークフローにおける効果的な展開のために、Code LLMのコード批評能力を向上させるためのさらなる研究の必要性を強調している。

要旨

大規模言語モデル（LLMs）の批判能力は、推論能力にとって重要であり、必要な提案（例えば、詳細な分析や建設的なフィードバック）を提供することができます。そのため、LLMsの批判能力をどのように評価するかが大きな注目を集めており、いくつかの批判ベンチマークが提案されています。しかし、既存の批判ベンチマークには通常以下のような制限があります：(1) 一般的な領域での多様な推論タスクに焦点を当てており、コードタスク（例えば、コード生成タスクのみをカバーするなど）の評価が不十分で、クエリの難易度が比較的容易である（例えば、CriticBenchのコードクエリはHumanevalとMBPPから取得されている）。(2) 異なる次元からの包括的な評価が欠けている。これらの制限に対処するため、我々はCodeCriticBenchと呼ばれる包括的なコード批判ベンチマークを導入します。具体的には、CodeCriticBenchは異なる難易度の2つの主要なコードタスク（すなわち、コード生成とコードQA）を含んでいます。さらに、評価プロトコルには、基本的な批判評価と、異なる特性に対する高度な批判評価が含まれており、高度な設定のためには細かく設計された評価チェックリストが用意されています。最後に、既存のLLMsに対する広範な実験結果を示し、CodeCriticBenchの有効性を実証します。

English

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

要旨

Support