CodeCriticBench: 대규모 언어 모델을 위한 종합적인 코드 비평 벤치마크

초록

대규모 언어 모델(LLM)의 비평 능력은 추론 능력에 있어 필수적이며, 이는 필요한 제안(예: 상세한 분석과 건설적인 피드백)을 제공할 수 있습니다. 따라서 LLM의 비평 능력을 평가하는 방법은 큰 관심을 받아 왔으며, 여러 비평 벤치마크가 제안되었습니다. 그러나 기존의 비평 벤치마크는 일반적으로 다음과 같은 한계를 가지고 있습니다: (1) 일반 도메인의 다양한 추론 작업에 초점을 맞추고 코드 작업(예: 코드 생성 작업만 포함)에 대한 평가가 부족하며, 쿼리의 난이도가 상대적으로 쉬운 편입니다(예: CriticBench의 코드 쿼리는 Humaneval과 MBPP에서 가져옴). (2) 다양한 차원에서의 종합적인 평가가 부족합니다. 이러한 한계를 해결하기 위해, 우리는 CodeCriticBench라는 종합적인 코드 비평 벤치마크를 소개합니다. 구체적으로, CodeCriticBench는 서로 다른 난이도의 두 가지 주요 코드 작업(즉, 코드 생성과 코드 QA)을 포함합니다. 또한, 평가 프로토콜은 기본 비평 평가와 고급 비평 평가를 포함하며, 고급 설정을 위해 세분화된 평가 체크리스트가 잘 설계되어 있습니다. 마지막으로, 우리는 기존 LLM에 대한 광범위한 실험 결과를 수행하여 CodeCriticBench의 효과를 입증합니다.

English

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.

CodeCriticBench: 대규모 언어 모델을 위한 종합적인 코드 비평 벤치마크

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

초록

Support