JudgeBench: un benchmark per valutare i giudici basati su LLM.

Abstract

I giudici basati su LLM sono emersi come un'alternativa scalabile alla valutazione umana e vengono sempre più utilizzati per valutare, confrontare e migliorare i modelli. Tuttavia, la affidabilità dei giudici basati su LLM stessi è raramente scrutinata. Man mano che gli LLM diventano più avanzati, le loro risposte diventano più sofisticate, richiedendo giudici più robusti per valutarli. I benchmark esistenti si concentrano principalmente sull'allineamento di un giudice con le preferenze umane, ma spesso non tengono conto di compiti più impegnativi in cui le preferenze umane raccolte in crowd sono un povero indicatore di correttezza fattuale e logica. Per affrontare questo problema, proponiamo un nuovo framework di valutazione per valutare oggettivamente i giudici basati su LLM. Basandoci su questo framework, proponiamo JudgeBench, un benchmark per valutare i giudici basati su LLM su coppie di risposte impegnative che spaziano dalla conoscenza, al ragionamento, alla matematica e alla codifica. JudgeBench sfrutta un nuovo processo per convertire dataset difficili esistenti in coppie di risposte impegnative con etichette di preferenza che riflettono la correttezza oggettiva. La nostra valutazione completa su una serie di giudici sollecitati, giudici ottimizzati, giudici multi-agente e modelli di ricompensa mostra che JudgeBench presenta una sfida significativamente maggiore rispetto ai benchmark precedenti, con molti modelli robusti (ad esempio, GPT-4o) che si comportano appena leggermente meglio di un'ipotesi casuale. In generale, JudgeBench offre una piattaforma affidabile per valutare giudici basati su LLM sempre più avanzati. I dati e il codice sono disponibili su https://github.com/ScalerLab/JudgeBench.

English

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

JudgeBench: un benchmark per valutare i giudici basati su LLM.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Abstract

Summary

Support

Support