融合評估器與LLM的Fusion-Eval
Fusion-Eval: Integrating Evaluators with LLMs
November 15, 2023
作者: Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng
cs.AI
摘要
評估大型語言模型(LLMs)是一項複雜的任務,特別是考慮到自然語言理解的細微差異和對高層次推理的期望。傳統的評估通常依賴於基於人類、基於模型或基於自動指標的範式,每種方法都有其優點和缺點。我們引入了“Fusion-Eval”,這是一個系統,不僅僅利用LLMs進行直接評估,還巧妙地整合了來自不同評估者的見解。這使得Fusion-Eval具有靈活性,能夠有效地應用於各種任務,並充分利用多個參考資料。在SummEval數據集上的測試中,Fusion-Eval實現了0.96的Spearman相關性,優於其他評估者。Fusion-Eval的成功凸顯了LLMs產生與人類觀點密切一致的評估的潛力,在LLM評估領域設定了新的標準。
English
Evaluating Large Language Models (LLMs) is a complex task, especially
considering the intricacies of natural language understanding and the
expectations for high-level reasoning. Traditional evaluations typically lean
on human-based, model-based, or automatic-metrics-based paradigms, each with
its own advantages and shortcomings. We introduce "Fusion-Eval", a system that
employs LLMs not solely for direct evaluations, but to skillfully integrate
insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling
it to work effectively across diverse tasks and make optimal use of multiple
references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman
correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval
underscores the potential of LLMs to produce evaluations that closely align
human perspectives, setting a new standard in the field of LLM evaluation.