ChatPaper.aiChatPaper

融合評估器與LLM的Fusion-Eval

Fusion-Eval: Integrating Evaluators with LLMs

November 15, 2023
作者: Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
cs.AI

摘要

評估大型語言模型(LLMs)是一項複雜的任務,特別是考慮到自然語言理解的細微差異和對高層次推理的期望。傳統的評估通常依賴於基於人類、基於模型或基於自動指標的範式,每種方法都有其優點和缺點。我們引入了“Fusion-Eval”,這是一個系統,不僅僅利用LLMs進行直接評估,還巧妙地整合了來自不同評估者的見解。這使得Fusion-Eval具有靈活性,能夠有效地應用於各種任務,並充分利用多個參考資料。在SummEval數據集上的測試中,Fusion-Eval實現了0.96的Spearman相關性,優於其他評估者。Fusion-Eval的成功凸顯了LLMs產生與人類觀點密切一致的評估的潛力,在LLM評估領域設定了新的標準。
English
Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.
PDF62December 15, 2024