JudgeLM:微調的大型語言模型是可擴展的評判者
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
October 26, 2023
作者: Lianghui Zhu, Xinggang Wang, Xinlong Wang
cs.AI
摘要
在開放式場景中評估大型語言模型(LLMs)是具有挑戰性的,因為現有的基準和指標無法全面評估它們。為了解決這個問題,我們提出將LLMs微調為可擴展的評審(JudgeLM),以在開放式基準中高效有效地評估LLMs。我們首先提出了一個包含任務種子、LLMs生成答案和GPT-4生成判斷的全面、大規模、高質量數據集,用於微調高性能評審,以及用於評估評審的新基準。我們從7B、13B到33B參數的不同規模訓練JudgeLM,並對其能力和行為進行系統分析。然後,我們分析了將LLM微調為評審時的關鍵偏見,並將其視為位置偏見、知識偏見和格式偏見。為了解決這些問題,JudgeLM引入了一系列技術,包括交換增強、參考支持和參考刪除,明顯提升了評審的性能。JudgeLM在現有的PandaLM基準和我們提出的新基準上獲得了最先進的評審表現。我們的JudgeLM高效,而JudgeLM-7B僅需3分鐘即可使用8個A100 GPU對5K樣本進行評判。JudgeLM與教師評審達成高度一致,達成超過90%的一致性,甚至超過人與人之間的一致性。JudgeLM還展示了作為單一答案、多模型、多答案和多輪對話評審的擴展能力。
English
Evaluating Large Language Models (LLMs) in open-ended scenarios is
challenging because existing benchmarks and metrics can not measure them
comprehensively. To address this problem, we propose to fine-tune LLMs as
scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in
open-ended benchmarks. We first propose a comprehensive, large-scale,
high-quality dataset containing task seeds, LLMs-generated answers, and
GPT-4-generated judgments for fine-tuning high-performance judges, as well as a
new benchmark for evaluating the judges. We train JudgeLM at different scales
from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its
capabilities and behaviors. We then analyze the key biases in fine-tuning LLM
as a judge and consider them as position bias, knowledge bias, and format bias.
To address these issues, JudgeLM introduces a bag of techniques including swap
augmentation, reference support, and reference drop, which clearly enhance the
judge's performance. JudgeLM obtains the state-of-the-art judge performance on
both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM
is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8
A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an
agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM
also demonstrates extended capabilities in being judges of the single answer,
multimodal models, multiple answers, and multi-turn chat.