Fusion-Eval: 대형 언어 모델과 평가자 통합

초록

대규모 언어 모델(LLM)을 평가하는 것은 자연어 이해의 복잡성과 높은 수준의 추론에 대한 기대를 고려할 때 복잡한 작업입니다. 전통적인 평가 방식은 일반적으로 인간 기반, 모델 기반 또는 자동 지표 기반 패러다임에 의존하며, 각각 고유한 장단점을 가지고 있습니다. 우리는 "Fusion-Eval"이라는 시스템을 소개합니다. 이 시스템은 LLM을 직접 평가에만 사용하는 것이 아니라 다양한 평가자의 통찰력을 능숙하게 통합하는 데 활용합니다. 이를 통해 Fusion-Eval은 유연성을 갖추고 다양한 작업에 효과적으로 적용할 수 있으며, 여러 참조를 최적으로 활용할 수 있습니다. SummEval 데이터셋에서의 테스트에서 Fusion-Eval은 0.96의 스피어만 상관관계를 달성하여 다른 평가자들을 능가했습니다. Fusion-Eval의 성공은 LLM이 인간의 관점과 밀접하게 일치하는 평가를 생성할 수 있는 잠재력을 강조하며, LLM 평가 분야에서 새로운 기준을 제시합니다.

English

Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.

Fusion-Eval: 대형 언어 모델과 평가자 통합

Fusion-Eval: Integrating Evaluators with LLMs

초록

Support