ChatPaper.aiChatPaper

群眾比較推理:為LLM作為評判者開啟全面評估

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

February 18, 2025
作者: Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
cs.AI

摘要

LLM-as-a-Judge(大語言模型作為評判者)通過生成思維鏈(CoT)判斷,已成為一種廣泛採用的自動評估方法。然而,其可靠性因CoT推理無法捕捉全面且深入的細節而受到影響,往往導致不完整的結果。現有方法主要依賴多數投票或標準擴展,這不足以解決CoT的局限性。我們提出了基於群眾的比較評估方法,該方法引入額外的群眾回應與候選回應進行比較,從而揭示候選回應中更深層次和更全面的細節。這一過程有效引導LLM-as-a-Judge提供更為詳盡的CoT判斷。大量實驗表明,我們的方法提升了評估的可靠性,在五個基準測試中平均準確率提高了6.7%。此外,我們的方法生成了更高質量的CoT,有助於評判蒸餾,並在監督微調(SFT)的拒絕採樣(稱為群眾拒絕採樣)中表現出更優的性能,從而實現更高效的SFT。我們的分析證實,由我們生成的CoT更為全面且質量更高,且隨著推理規模的擴大,評估準確率也有所提升。
English
LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.

Summary

AI-Generated Summary

PDF62February 19, 2025