融合评估器与LLMs:Fusion-Eval
Fusion-Eval: Integrating Evaluators with LLMs
November 15, 2023
作者: Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
cs.AI
摘要
评估大型语言模型(LLMs)是一项复杂的任务,特别是考虑到自然语言理解的复杂性和对高级推理的期望。传统评估通常依赖于基于人类、基于模型或基于自动度量的范式,每种方法都有其优点和缺点。我们引入了“融合评估”(Fusion-Eval)系统,该系统不仅仅利用LLMs进行直接评估,而且巧妙地整合了来自不同评估者的见解。这使得融合评估具有灵活性,能够在各种任务中有效工作,并充分利用多个参考文献。在SummEval数据集上的测试中,融合评估实现了0.96的Spearman相关性,胜过其他评估者。融合评估的成功凸显了LLMs产生与人类观点密切一致的评估的潜力,在LLM评估领域树立了新的标准。
English
Evaluating Large Language Models (LLMs) is a complex task, especially
considering the intricacies of natural language understanding and the
expectations for high-level reasoning. Traditional evaluations typically lean
on human-based, model-based, or automatic-metrics-based paradigms, each with
its own advantages and shortcomings. We introduce "Fusion-Eval", a system that
employs LLMs not solely for direct evaluations, but to skillfully integrate
insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling
it to work effectively across diverse tasks and make optimal use of multiple
references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman
correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval
underscores the potential of LLMs to produce evaluations that closely align
human perspectives, setting a new standard in the field of LLM evaluation.