JuStRank: システムランキングのためのLLM判事のベンチマーク化

要旨

生成AIの急速な進歩を考慮すると、利用可能な多数のモデルや構成の間で系統的に比較し選択する必要が迫られています。このような評価の規模と汎用性から、この課題に対する魅力的な解決策として、LLMベースの判定者の使用が考えられます。重要なのは、このアプローチにおいてまずLLM判定者自体の品質を検証することが必要とされる点です。これまでの研究は、LLM判定者のインスタンスベースの評価に焦点を当ててきました。ここでは、判定者が一連の応答または応答ペアにわたって評価されるが、それらのソースシステムには無関心です。私たちは、この設定がシステムレベルのランキングに影響を与える重要な要因、例えば判定者が特定のシステムに対する肯定的または否定的なバイアスを見逃していると主張します。このギャップを埋めるために、我々は初めて、システムランカーとしてのLLM判定者の大規模な研究を実施します。システムのスコアは、複数のシステム出力にわたる判断スコアを集計することで生成され、その結果得られたシステムランキングを人間によるランキングと比較することで、判定者の品質を評価します。全体的な判定者の評価を超えて、我々の分析は、彼らの決定力やバイアスを含む、判定者の行動の詳細な特性を提供します。

English

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

JuStRank: システムランキングのためのLLM判事のベンチマーク化

JuStRank: Benchmarking LLM Judges for System Ranking

要旨

Support