다중 평가자로부터 효율적인 다중 턴 대화 평가자 학습

초록

대규모 언어 모델(LLM)의 대화 능력을 평가하는 것은 여전히 어려운 과제로 남아 있다. 현재 주류 접근 방식은 주로 "LLM-as-a-judge" 패러다임에 의존하며, 이는 LLM을 평가자로 활용하여 대화 품질을 평가하는 방식이다. 그러나 이러한 방법은 다양한 편향으로 인해 평가 결과의 신뢰성과 일관성이 저해되는 경우가 많다. 이러한 편향을 완화하기 위해 최근의 방법들은 여러 LLM을 판단자로 활용하고 그들의 평가를 종합하여 최적의 평가를 선택한다. 이 다중 판단자 접근 방식은 효과적이지만, 추론 과정에서 상당한 계산 오버헤드를 초래한다. 본 논문에서는 다중 LLM 판단자의 집단 지혜를 포착하여 그들의 선호 지식을 단일 모델로 통합하는 효율적인 다중 턴 대화 평가자를 제안한다. 우리의 접근 방식은 다양한 다중 판단자 피드백의 이점을 유지하면서 평가 비용을 크게 줄여 빠르고 유연한 대화 품질 평가를 가능하게 한다. 7개의 단일 평점 및 쌍별 비교 대화 평가 벤치마크에서의 광범위한 실험을 통해, 우리의 방법이 다양한 시나리오에서 기존 베이스라인을 능가하며 효율성과 견고성을 입증하였다.

English

Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

다중 평가자로부터 효율적인 다중 턴 대화 평가자 학습

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

초록

Support