JudgeBench：用于评估基于LLM的法官的基准测试

摘要

基于LLM的评判者已经成为人类评估的可扩展替代方案，并越来越被用于评估、比较和改进模型。然而，很少有人对基于LLM的评判者本身的可靠性进行审查。随着LLMs变得更加先进，它们的响应变得更加复杂，需要更强大的评判者来评估它们。现有的基准主要关注评判者与人类偏好的一致性，但往往未能考虑到更具挑战性的任务，在这些任务中，众包的人类偏好并不是事实和逻辑正确性的良好指标。为了解决这个问题，我们提出了一个新颖的评估框架，以客观评估基于LLM的评判者。基于这个框架，我们提出了JudgeBench，一个用于评估基于LLM的评判者在涵盖知识、推理、数学和编码的具有挑战性响应对上的基准。JudgeBench利用一种新颖的流程，将现有的困难数据集转换为具有反映客观正确性的偏好标签的具有挑战性响应对。我们对一系列提示的评判者、微调的评判者、多智能体评判者和奖励模型进行了全面评估，结果显示，JudgeBench比以前的基准更具挑战性，许多强大模型（例如GPT-4o）的表现仅略优于随机猜测。总的来说，JudgeBench为评估日益先进的基于LLM的评判者提供了一个可靠的平台。数据和代码可在https://github.com/ScalerLab/JudgeBench 获取。

English

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

JudgeBench：用于评估基于LLM的法官的基准测试

JudgeBench: A Benchmark for Evaluating LLM-based Judges

摘要

Support