融合評估器與LLM的Fusion-Eval

摘要

評估大型語言模型（LLMs）是一項複雜的任務，特別是考慮到自然語言理解的細微差異和對高層次推理的期望。傳統的評估通常依賴於基於人類、基於模型或基於自動指標的範式，每種方法都有其優點和缺點。我們引入了“Fusion-Eval”，這是一個系統，不僅僅利用LLMs進行直接評估，還巧妙地整合了來自不同評估者的見解。這使得Fusion-Eval具有靈活性，能夠有效地應用於各種任務，並充分利用多個參考資料。在SummEval數據集上的測試中，Fusion-Eval實現了0.96的Spearman相關性，優於其他評估者。Fusion-Eval的成功凸顯了LLMs產生與人類觀點密切一致的評估的潛力，在LLM評估領域設定了新的標準。

English

Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.

融合評估器與LLM的Fusion-Eval

Fusion-Eval: Integrating Evaluators with LLMs

摘要

Support