FLASK：基於對齊技能的細粒度語言模型評估集

摘要

評估大型語言模型（LLMs）具有挑戰性，因為要對齊人類價值需要組合多種技能，且所需技能集取決於指示的不同。最近的研究以兩種方式評估了LLMs的表現，（1）在多個獨立基準上進行自動評估，以及（2）進行人類或基於機器的評估，為回應給出總體分數。然而，這兩種設置都是粗粒度評估，未考慮需要基於實例技能組合的用戶指示的性質，這限制了對LLMs真實能力的解釋。在本文中，我們介紹了FLASK（基於對齊技能集的細粒度語言模型評估），這是一種細粒度評估協議，可用於基於模型和基於人類的評估，將粗粒度評分分解為基於實例技能集的級別。具體來說，我們定義了LLMs需要遵循開放式用戶指示所需的12種細粒度技能，並通過為每個實例分配一組技能來構建評估集。此外，通過為每個實例註釋目標領域和難度級別，FLASK提供了一個全面分析模型表現的整體觀點，具體取決於技能、領域和難度。通過使用FLASK，我們比較了多個開源和專有LLMs，觀察到基於模型和基於人類的評估之間高度相關的發現。FLASK使開發人員能夠更準確地衡量模型的性能以及通過分析使LLMs在特定技能方面優秀的因素來改進模型。對於從業者來說，FLASK可用於在各種LLMs之間進行全面比較，從而為特定情況推薦合適的模型。我們在https://github.com/kaistAI/FLASK 上發布了評估數據和代碼實現。

English

Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Through using FLASK, we compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.

FLASK：基於對齊技能的細粒度語言模型評估集

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

摘要

Support