JudgeLM：微調的大型語言模型是可擴展的評判者

摘要

在開放式場景中評估大型語言模型（LLMs）是具有挑戰性的，因為現有的基準和指標無法全面評估它們。為了解決這個問題，我們提出將LLMs微調為可擴展的評審（JudgeLM），以在開放式基準中高效有效地評估LLMs。我們首先提出了一個包含任務種子、LLMs生成答案和GPT-4生成判斷的全面、大規模、高質量數據集，用於微調高性能評審，以及用於評估評審的新基準。我們從7B、13B到33B參數的不同規模訓練JudgeLM，並對其能力和行為進行系統分析。然後，我們分析了將LLM微調為評審時的關鍵偏見，並將其視為位置偏見、知識偏見和格式偏見。為了解決這些問題，JudgeLM引入了一系列技術，包括交換增強、參考支持和參考刪除，明顯提升了評審的性能。JudgeLM在現有的PandaLM基準和我們提出的新基準上獲得了最先進的評審表現。我們的JudgeLM高效，而JudgeLM-7B僅需3分鐘即可使用8個A100 GPU對5K樣本進行評判。JudgeLM與教師評審達成高度一致，達成超過90%的一致性，甚至超過人與人之間的一致性。JudgeLM還展示了作為單一答案、多模型、多答案和多輪對話評審的擴展能力。

English

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

JudgeLM：微調的大型語言模型是可擴展的評判者

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

摘要

Support