融合评估器与LLMs：Fusion-Eval

摘要

评估大型语言模型（LLMs）是一项复杂的任务，特别是考虑到自然语言理解的复杂性和对高级推理的期望。传统评估通常依赖于基于人类、基于模型或基于自动度量的范式，每种方法都有其优点和缺点。我们引入了“融合评估”（Fusion-Eval）系统，该系统不仅仅利用LLMs进行直接评估，而且巧妙地整合了来自不同评估者的见解。这使得融合评估具有灵活性，能够在各种任务中有效工作，并充分利用多个参考文献。在SummEval数据集上的测试中，融合评估实现了0.96的Spearman相关性，胜过其他评估者。融合评估的成功凸显了LLMs产生与人类观点密切一致的评估的潜力，在LLM评估领域树立了新的标准。

English

Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.

融合评估器与LLMs：Fusion-Eval

Fusion-Eval: Integrating Evaluators with LLMs

摘要

Support