基于对齐技能的细粒度语言模型评估 FLASK：Fine-grained Language Model Evaluation based on Alignment Skill Sets

摘要

评估大型语言模型（LLMs）具有挑战性，因为与人类价值观的对齐需要结合多种技能，所需技能集取决于指令。最近的研究以两种方式评估了LLMs的性能，（1）在几个独立基准上进行自动评估，和（2）人类或基于机器的评估为响应给出总体得分。然而，这两种设置都是粗粒度评估，未考虑到需要逐个实例的技能组合的用户指令的性质，这限制了对LLMs真实能力的解释。在本文中，我们介绍了基于对齐技能集的细粒度语言模型评估协议FLASK（Fine-grained Language Model Evaluation based on Alignment SKill Sets），该协议可用于基于模型和基于人类的评估，将粗粒度评分分解为逐个实例的技能集水平。具体而言，我们定义了LLMs需要遵循开放式用户指令的12种细粒度技能，并通过为每个实例分配一组技能来构建评估集。此外，通过为每个实例注释目标领域和难度级别，FLASK提供了一个全面分析，全面分析了模型的性能取决于技能、领域和难度。通过使用FLASK，我们比较了多个开源和专有LLMs，并观察到模型评估和人类评估之间高度相关的发现。FLASK使开发人员能够更准确地衡量模型的性能以及通过分析使LLMs在特定技能上熟练的因素来改进模型。对于从业者，FLASK可用于通过对各种LLMs进行全面比较来推荐适合特定情况的模型。我们在https://github.com/kaistAI/FLASK发布了评估数据和代码实现。

English

Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Through using FLASK, we compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.

基于对齐技能的细粒度语言模型评估 FLASK：Fine-grained Language Model Evaluation based on Alignment Skill Sets

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

摘要

Support