Fusion-Eval: LLMと評価手法の統合

要旨

大規模言語モデル（LLM）の評価は、自然言語理解の複雑さや高度な推論能力への期待を考慮すると、非常に困難な課題です。従来の評価手法は、人間による評価、モデルベースの評価、自動指標ベースの評価といったパラダイムに依存しており、それぞれに利点と欠点があります。本論文では「Fusion-Eval」を提案します。このシステムは、LLMを直接的な評価に使用するだけでなく、多様な評価者からの洞察を巧みに統合するために活用します。これにより、Fusion-Evalは柔軟性を獲得し、様々なタスクに効果的に対応し、複数の参照を最適に活用することが可能となります。SummEvalデータセットでのテストにおいて、Fusion-Evalは0.96のスピアマン相関係数を達成し、他の評価手法を上回りました。Fusion-Evalの成功は、LLMが人間の視点に極めて近い評価を生成する可能性を示しており、LLM評価の分野において新たな基準を確立するものです。

English

Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.

Fusion-Eval: LLMと評価手法の統合

Fusion-Eval: Integrating Evaluators with LLMs

要旨

Support