CodeCriticBench：一個面向大型語言模型的全面性程式碼評測基準

摘要

大型語言模型（LLMs）的批判能力對於其推理能力至關重要，這能提供必要的建議（例如詳細分析和建設性反饋）。因此，如何評估LLMs的批判能力引起了極大關注，並已提出了多個批判基準。然而，現有的批判基準通常存在以下限制：(1) 主要關注一般領域的多樣化推理任務，對代碼任務的評估不足（例如僅涵蓋代碼生成任務），且查詢的難度相對較低（例如CriticBench的代碼查詢來自Humaneval和MBPP）。(2) 缺乏從不同維度進行的全面評估。為解決這些限制，我們引入了一個全面的代碼批判基準，稱為CodeCriticBench。具體而言，我們的CodeCriticBench包含兩大主流代碼任務（即代碼生成和代碼問答），並涵蓋不同難度。此外，評估協議包括基本批判評估和高級批判評估，針對不同特性設計了細緻的評估清單，用於高級設置。最後，我們對現有LLMs進行了廣泛的實驗，結果顯示了CodeCriticBench的有效性。

English

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

CodeCriticBench：一個面向大型語言模型的全面性程式碼評測基準

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

摘要

Support