JudgeLM：经过微调的大型语言模型是可扩展的评判者。

摘要

在开放式场景中评估大型语言模型（LLMs）具有挑战性，因为现有的基准和度量无法全面衡量它们。为解决这一问题，我们提出将LLMs微调为可扩展的评判者（JudgeLM），以在开放式基准中高效有效地评估LLMs。我们首先提出了一个包含任务种子、LLMs生成答案以及GPT-4生成判断的全面、大规模、高质量数据集，用于微调高性能评判者，以及用于评估评判者的新基准。我们从7B、13B到33B参数规模训练JudgeLM，并对其能力和行为进行系统分析。然后，我们分析了将LLMs微调为评判者时的关键偏见，并将其视为位置偏见、知识偏见和格式偏见。为解决这些问题，JudgeLM引入了一系列技术，包括交换增强、参考支持和参考剔除，明显提升了评判者的性能。JudgeLM在现有的PandaLM基准和我们提出的新基准上均取得了最先进的评判者性能。我们的JudgeLM高效，而JudgeLM-7B仅需3分钟即可使用8个A100 GPU对5K个样本进行评判。JudgeLM与教师评判者达成高度一致，达成的一致性超过90%，甚至超过人际一致性。JudgeLM还展示了作为单一答案、多模态模型、多个答案和多轮对话评判者的扩展能力。

English

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

JudgeLM：经过微调的大型语言模型是可扩展的评判者。

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

摘要

Support