クラウド比較推論：LLM-as-a-Judgeのための包括的評価の解放

要旨

LLM-as-a-Judgeは、連鎖的思考（CoT）による判断を生成する自動評価手法として広く採用されています。しかし、CoT推論が包括的かつ深い詳細を捉えることができないため、その信頼性は損なわれ、しばしば不完全な結果を招いています。既存の手法は主に多数決や評価基準の拡張に依存していますが、これらはCoTの限界を十分に解決するものではありません。本研究では、Crowd-based Comparative Evaluationを提案します。これは、追加のクラウド応答を導入して候補応答と比較することで、候補応答内のより深く包括的な詳細を明らかにします。このプロセスにより、LLM-as-a-Judgeがより詳細なCoT判断を提供するよう効果的に導きます。大規模な実験により、本手法が評価の信頼性を向上させ、5つのベンチマークで平均6.7%の精度向上を達成することが示されました。さらに、本手法はより高品質なCoTを生成し、判断蒸留を促進し、教師ありファインチューニング（SFT）のためのリジェクションサンプリング（クラウドリジェクションサンプリングと呼ばれる）において優れた性能を発揮し、より効率的なSFTを可能にします。我々の分析により、本手法によって生成されたCoTがより包括的で高品質であり、推論スケールが大きくなるにつれて評価精度が向上することが確認されました。

English

LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.

クラウド比較推論：LLM-as-a-Judgeのための包括的評価の解放

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

要旨

Support