CodeCriticBench:一個面向大型語言模型的全面性程式碼評測基準
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
February 23, 2025
作者: Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang
cs.AI
摘要
大型語言模型(LLMs)的批判能力對於其推理能力至關重要,這能提供必要的建議(例如詳細分析和建設性反饋)。因此,如何評估LLMs的批判能力引起了極大關注,並已提出了多個批判基準。然而,現有的批判基準通常存在以下限制:(1) 主要關注一般領域的多樣化推理任務,對代碼任務的評估不足(例如僅涵蓋代碼生成任務),且查詢的難度相對較低(例如CriticBench的代碼查詢來自Humaneval和MBPP)。(2) 缺乏從不同維度進行的全面評估。為解決這些限制,我們引入了一個全面的代碼批判基準,稱為CodeCriticBench。具體而言,我們的CodeCriticBench包含兩大主流代碼任務(即代碼生成和代碼問答),並涵蓋不同難度。此外,評估協議包括基本批判評估和高級批判評估,針對不同特性設計了細緻的評估清單,用於高級設置。最後,我們對現有LLMs進行了廣泛的實驗,結果顯示了CodeCriticBench的有效性。
English
The critique capacity of Large Language Models (LLMs) is essential for
reasoning abilities, which can provide necessary suggestions (e.g., detailed
analysis and constructive feedback). Therefore, how to evaluate the critique
capacity of LLMs has drawn great attention and several critique benchmarks have
been proposed. However, existing critique benchmarks usually have the following
limitations: (1). Focusing on diverse reasoning tasks in general domains and
insufficient evaluation on code tasks (e.g., only covering code generation
task), where the difficulty of queries is relatively easy (e.g., the code
queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive
evaluation from different dimensions. To address these limitations, we
introduce a holistic code critique benchmark for LLMs called CodeCriticBench.
Specifically, our CodeCriticBench includes two mainstream code tasks (i.e.,
code generation and code QA) with different difficulties. Besides, the
evaluation protocols include basic critique evaluation and advanced critique
evaluation for different characteristics, where fine-grained evaluation
checklists are well-designed for advanced settings. Finally, we conduct
extensive experimental results of existing LLMs, which show the effectiveness
of CodeCriticBench.Summary
AI-Generated Summary