基于对齐技能的细粒度语言模型评估 FLASK:Fine-grained Language Model Evaluation based on Alignment Skill Sets
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
July 20, 2023
作者: Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
cs.AI
摘要
评估大型语言模型(LLMs)具有挑战性,因为与人类价值观的对齐需要结合多种技能,所需技能集取决于指令。最近的研究以两种方式评估了LLMs的性能,(1)在几个独立基准上进行自动评估,和(2)人类或基于机器的评估为响应给出总体得分。然而,这两种设置都是粗粒度评估,未考虑到需要逐个实例的技能组合的用户指令的性质,这限制了对LLMs真实能力的解释。在本文中,我们介绍了基于对齐技能集的细粒度语言模型评估协议FLASK(Fine-grained Language Model Evaluation based on Alignment SKill Sets),该协议可用于基于模型和基于人类的评估,将粗粒度评分分解为逐个实例的技能集水平。具体而言,我们定义了LLMs需要遵循开放式用户指令的12种细粒度技能,并通过为每个实例分配一组技能来构建评估集。此外,通过为每个实例注释目标领域和难度级别,FLASK提供了一个全面分析,全面分析了模型的性能取决于技能、领域和难度。通过使用FLASK,我们比较了多个开源和专有LLMs,并观察到模型评估和人类评估之间高度相关的发现。FLASK使开发人员能够更准确地衡量模型的性能以及通过分析使LLMs在特定技能上熟练的因素来改进模型。对于从业者,FLASK可用于通过对各种LLMs进行全面比较来推荐适合特定情况的模型。我们在https://github.com/kaistAI/FLASK发布了评估数据和代码实现。
English
Evaluation of Large Language Models (LLMs) is challenging because aligning to
human values requires the composition of multiple skills and the required set
of skills varies depending on the instruction. Recent studies have evaluated
the performance of LLMs in two ways, (1) automatic evaluation on several
independent benchmarks and (2) human or machined-based evaluation giving an
overall score to the response. However, both settings are coarse-grained
evaluations, not considering the nature of user instructions that require
instance-wise skill composition, which limits the interpretation of the true
capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language
Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation
protocol that can be used for both model-based and human-based evaluation which
decomposes coarse-level scoring to an instance-wise skill set-level.
Specifically, we define 12 fine-grained skills needed for LLMs to follow
open-ended user instructions and construct an evaluation set by allocating a
set of skills for each instance. Additionally, by annotating the target domains
and difficulty level for each instance, FLASK provides a holistic view with a
comprehensive analysis of a model's performance depending on skill, domain, and
difficulty. Through using FLASK, we compare multiple open-sourced and
proprietary LLMs and observe highly-correlated findings between model-based and
human-based evaluations. FLASK enables developers to more accurately measure
the model performance and how it can be improved by analyzing factors that make
LLMs proficient in particular skills. For practitioners, FLASK can be used to
recommend suitable models for particular situations through comprehensive
comparison among various LLMs. We release the evaluation data and code
implementation at https://github.com/kaistAI/FLASK.