群眾比較推理:為LLM作為評判者開啟全面評估
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
February 18, 2025
作者: Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
cs.AI
摘要
LLM-as-a-Judge(大語言模型作為評判者)通過生成思維鏈(CoT)判斷,已成為一種廣泛採用的自動評估方法。然而,其可靠性因CoT推理無法捕捉全面且深入的細節而受到影響,往往導致不完整的結果。現有方法主要依賴多數投票或標準擴展,這不足以解決CoT的局限性。我們提出了基於群眾的比較評估方法,該方法引入額外的群眾回應與候選回應進行比較,從而揭示候選回應中更深層次和更全面的細節。這一過程有效引導LLM-as-a-Judge提供更為詳盡的CoT判斷。大量實驗表明,我們的方法提升了評估的可靠性,在五個基準測試中平均準確率提高了6.7%。此外,我們的方法生成了更高質量的CoT,有助於評判蒸餾,並在監督微調(SFT)的拒絕採樣(稱為群眾拒絕採樣)中表現出更優的性能,從而實現更高效的SFT。我們的分析證實,由我們生成的CoT更為全面且質量更高,且隨著推理規模的擴大,評估準確率也有所提升。
English
LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become
a widely adopted auto-evaluation method. However, its reliability is
compromised by the CoT reasoning's inability to capture comprehensive and
deeper details, often leading to incomplete outcomes. Existing methods mainly
rely on majority voting or criteria expansion, which is insufficient to address
the limitation in CoT. We propose Crowd-based Comparative Evaluation, which
introduces additional crowd responses to compare with the candidate responses,
thereby exposing deeper and more comprehensive details within the candidate
responses. This process effectively guides LLM-as-a-Judge to provide a more
detailed CoT judgment. Extensive experiments demonstrate that our approach
enhances evaluation reliability, achieving an average accuracy gain of 6.7%
across five benchmarks. Moreover, our method produces higher-quality CoTs that
facilitate judge distillation and exhibit superior performance in rejection
sampling for supervised fine-tuning (SFT), referred to as crowd rejection
sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs
generated by ours are more comprehensive and of higher quality, and evaluation
accuracy improves as inference scales.Summary
AI-Generated Summary