JudgeLM: Modelli Linguistici di Grande Dimensione Ottimizzati come Giudici Scalabili

Abstract

Valutare i modelli linguistici di grandi dimensioni (LLM) in scenari aperti è complesso poiché i benchmark e le metriche esistenti non sono in grado di misurarli in modo completo. Per affrontare questo problema, proponiamo di ottimizzare gli LLM come giudici scalabili (JudgeLM) per valutare gli LLM in modo efficiente ed efficace nei benchmark aperti. Inizialmente, proponiamo un dataset ampio, di alta qualità e completo, contenente semi di attività, risposte generate da LLM e giudizi generati da GPT-4 per ottimizzare giudici ad alte prestazioni, oltre a un nuovo benchmark per valutare i giudici. Addestriamo JudgeLM su diverse scale, da 7B, 13B, fino a 33B parametri, e conduciamo un'analisi sistematica delle sue capacità e comportamenti. Successivamente, analizziamo i principali bias nell'ottimizzazione degli LLM come giudici, identificandoli come bias di posizione, bias di conoscenza e bias di formato. Per risolvere questi problemi, JudgeLM introduce una serie di tecniche, tra cui l'aumentazione tramite scambio, il supporto di riferimento e l'eliminazione del riferimento, che migliorano chiaramente le prestazioni del giudice. JudgeLM ottiene prestazioni all'avanguardia sia sul benchmark esistente PandaLM che sul nostro nuovo benchmark proposto. Il nostro JudgeLM è efficiente e JudgeLM-7B richiede solo 3 minuti per giudicare 5.000 campioni con 8 GPU A100. JudgeLM raggiunge un elevato accordo con il giudice insegnante, superando il 90% di accordo, che supera persino l'accordo tra umani. JudgeLM dimostra inoltre capacità estese nel valutare singole risposte, modelli multimodali, risposte multiple e chat multi-turn.

English

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

JudgeLM: Modelli Linguistici di Grande Dimensione Ottimizzati come Giudici Scalabili

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Abstract

Support