複数の評価者から効率的なマルチターン対話評価器を学習する

要旨

大規模言語モデル（LLM）の会話能力を評価することは、依然として困難な課題です。現在の主流のアプローチは、主に「LLM-as-a-judge」パラダイムに依存しており、LLMに評価者としての役割を促し、対話の品質を評価します。しかし、このような方法はしばしばさまざまなバイアスに悩まされ、評価結果の信頼性と一貫性を損なうことがあります。これらのバイアスを軽減するために、最近の手法では複数のLLMを評価者として採用し、それらの判断を集約して最適な評価を選択します。この多評価者アプローチは効果的ではありますが、推論時に大きな計算コストを伴います。本論文では、複数のLLM評価者の集合知を捉え、それらの選好知識を単一のモデルに集約する効率的な多ターン対話評価器を提案します。私たちのアプローチは、多様な多評価者フィードバックの利点を維持しつつ、評価コストを大幅に削減し、迅速かつ柔軟な対話品質評価を可能にします。7つの単一評価およびペアワイズ比較対話評価ベンチマークでの広範な実験により、私たちの手法が多様なシナリオにおいて既存のベースラインを上回り、その効率性と堅牢性を示しています。

English

Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

複数の評価者から効率的なマルチターン対話評価器を学習する

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

要旨

Support