ChatPaper.aiChatPaper

JudgeLM:经过微调的大型语言模型是可扩展的评判者。

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

October 26, 2023
作者: Lianghui Zhu, Xinggang Wang, Xinlong Wang
cs.AI

摘要

在开放式场景中评估大型语言模型(LLMs)具有挑战性,因为现有的基准和度量无法全面衡量它们。为解决这一问题,我们提出将LLMs微调为可扩展的评判者(JudgeLM),以在开放式基准中高效有效地评估LLMs。我们首先提出了一个包含任务种子、LLMs生成答案以及GPT-4生成判断的全面、大规模、高质量数据集,用于微调高性能评判者,以及用于评估评判者的新基准。我们从7B、13B到33B参数规模训练JudgeLM,并对其能力和行为进行系统分析。然后,我们分析了将LLMs微调为评判者时的关键偏见,并将其视为位置偏见、知识偏见和格式偏见。为解决这些问题,JudgeLM引入了一系列技术,包括交换增强、参考支持和参考剔除,明显提升了评判者的性能。JudgeLM在现有的PandaLM基准和我们提出的新基准上均取得了最先进的评判者性能。我们的JudgeLM高效,而JudgeLM-7B仅需3分钟即可使用8个A100 GPU对5K个样本进行评判。JudgeLM与教师评判者达成高度一致,达成的一致性超过90%,甚至超过人际一致性。JudgeLM还展示了作为单一答案、多模态模型、多个答案和多轮对话评判者的扩展能力。
English
Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
PDF356December 15, 2024