從多位評判者中學習高效的多輪對話評估器

摘要

评估大型语言模型（LLMs）的对话能力仍是一项具有挑战性的任务。当前主流方法主要依赖于“LLM作为评判者”的范式，即通过提示一个LLM充当评估者来评判对话质量。然而，此类方法常受多种偏见影响，从而削弱了评估结果的可靠性与一致性。为缓解这些偏见，近期研究采用多个LLM作为评判者，并汇总其判断以选出最优评估。尽管有效，这种多评判者方法在推理过程中带来了显著的计算开销。本文提出了一种高效的多轮对话评估器，通过将多个LLM评判者的偏好知识聚合至单一模型中，捕捉其集体智慧。我们的方法在保留多样化多评判者反馈优势的同时，大幅降低了评估成本，实现了快速且灵活的对话质量评估。在七个单评分及成对比较对话评估基准上的广泛实验表明，本方法在多种场景下均优于现有基线，展现了其效率与鲁棒性。

English

Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

從多位評判者中學習高效的多輪對話評估器

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

摘要

Support