FLASK:基於對齊技能的細粒度語言模型評估集
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
July 20, 2023
作者: Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
cs.AI
摘要
評估大型語言模型(LLMs)具有挑戰性,因為要對齊人類價值需要組合多種技能,且所需技能集取決於指示的不同。最近的研究以兩種方式評估了LLMs的表現,(1)在多個獨立基準上進行自動評估,以及(2)進行人類或基於機器的評估,為回應給出總體分數。然而,這兩種設置都是粗粒度評估,未考慮需要基於實例技能組合的用戶指示的性質,這限制了對LLMs真實能力的解釋。在本文中,我們介紹了FLASK(基於對齊技能集的細粒度語言模型評估),這是一種細粒度評估協議,可用於基於模型和基於人類的評估,將粗粒度評分分解為基於實例技能集的級別。具體來說,我們定義了LLMs需要遵循開放式用戶指示所需的12種細粒度技能,並通過為每個實例分配一組技能來構建評估集。此外,通過為每個實例註釋目標領域和難度級別,FLASK提供了一個全面分析模型表現的整體觀點,具體取決於技能、領域和難度。通過使用FLASK,我們比較了多個開源和專有LLMs,觀察到基於模型和基於人類的評估之間高度相關的發現。FLASK使開發人員能夠更準確地衡量模型的性能以及通過分析使LLMs在特定技能方面優秀的因素來改進模型。對於從業者來說,FLASK可用於在各種LLMs之間進行全面比較,從而為特定情況推薦合適的模型。我們在https://github.com/kaistAI/FLASK 上發布了評估數據和代碼實現。
English
Evaluation of Large Language Models (LLMs) is challenging because aligning to
human values requires the composition of multiple skills and the required set
of skills varies depending on the instruction. Recent studies have evaluated
the performance of LLMs in two ways, (1) automatic evaluation on several
independent benchmarks and (2) human or machined-based evaluation giving an
overall score to the response. However, both settings are coarse-grained
evaluations, not considering the nature of user instructions that require
instance-wise skill composition, which limits the interpretation of the true
capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language
Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation
protocol that can be used for both model-based and human-based evaluation which
decomposes coarse-level scoring to an instance-wise skill set-level.
Specifically, we define 12 fine-grained skills needed for LLMs to follow
open-ended user instructions and construct an evaluation set by allocating a
set of skills for each instance. Additionally, by annotating the target domains
and difficulty level for each instance, FLASK provides a holistic view with a
comprehensive analysis of a model's performance depending on skill, domain, and
difficulty. Through using FLASK, we compare multiple open-sourced and
proprietary LLMs and observe highly-correlated findings between model-based and
human-based evaluations. FLASK enables developers to more accurately measure
the model performance and how it can be improved by analyzing factors that make
LLMs proficient in particular skills. For practitioners, FLASK can be used to
recommend suitable models for particular situations through comprehensive
comparison among various LLMs. We release the evaluation data and code
implementation at https://github.com/kaistAI/FLASK.