从多位评审中学习高效的多轮对话评估器
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges
August 1, 2025
作者: Yuqi Tang, Kehua Feng, Yunfeng Wang, Zhiwen Chen, Chengfei Lv, Gang Yu, Qiang Zhang, Keyan Ding
cs.AI
摘要
评估大型语言模型(LLMs)的对话能力仍是一项具有挑战性的任务。当前主流方法主要依赖于“LLM作为评判者”的范式,即通过提示LLM充当评估者来评判对话质量。然而,此类方法常受多种偏见影响,削弱了评估结果的可靠性与一致性。为缓解这些偏见,近期方法采用多个LLM作为评判者,并汇总其判断以选出最佳评估。尽管有效,这种多评判者方法在推理过程中带来了显著的计算开销。本文提出了一种高效的多轮对话评估器,通过将多个LLM评判者的偏好知识聚合至单一模型中,捕捉其集体智慧。我们的方法在保留多样化多评判者反馈优势的同时,大幅降低了评估成本,实现了快速且灵活的对话质量评估。在七个单评分及成对比较对话评估基准上的广泛实验表明,本方法在多种场景下均优于现有基线,展现了其高效性与鲁棒性。
English
Evaluating the conversational abilities of large language models (LLMs)
remains a challenging task. Current mainstream approaches primarily rely on the
``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator
to assess dialogue quality. However, such methods often suffer from various
biases, which undermine the reliability and consistency of the evaluation
results. To mitigate these biases, recent methods employ multiple LLMs as
judges and aggregate their judgments to select the optimal assessment. Although
effective, this multi-judge approach incurs significant computational overhead
during inference. In this paper, we propose an efficient multi-turn dialogue
evaluator that captures the collective wisdom of multiple LLM judges by
aggregating their preference knowledge into a single model. Our approach
preserves the advantages of diverse multi-judge feedback while drastically
reducing the evaluation cost, enabling fast and flexible dialogue quality
assessment. Extensive experiments on seven single rating and pairwise
comparison dialogue evaluation benchmarks demonstrate that our method
outperforms existing baselines across diverse scenarios, showcasing its
efficiency and robustness.