FLASK: 정렬 기술 기반 세분화된 언어 모델 평가

초록

대규모 언어 모델(LLMs)의 평가는 인간의 가치와의 정렬(alignment)이 다중 기술의 조합을 필요로 하고, 요구되는 기술 집합이 지시에 따라 달라지기 때문에 어려운 과제입니다. 최근 연구들은 LLMs의 성능을 두 가지 방식으로 평가해 왔습니다: (1) 여러 독립적인 벤치마크에서의 자동 평가와 (2) 응답에 대한 전반적인 점수를 부여하는 인간 또는 기계 기반 평가. 그러나 두 설정 모두 사용자 지시의 특성, 즉 인스턴스별 기술 조합을 고려하지 않은 거시적 평가로, 이는 LLMs의 진정한 능력을 해석하는 데 한계를 가져옵니다. 본 논문에서는 FLASK(Fine-grained Language Model Evaluation based on Alignment SKill Sets)를 소개합니다. FLASK는 모델 기반 및 인간 기반 평가 모두에 사용할 수 있는 세분화된 평가 프로토콜로, 거시적 점수화를 인스턴스별 기술 집합 수준으로 분해합니다. 구체적으로, 우리는 개방형 사용자 지시를 따르기 위해 LLMs가 필요로 하는 12가지 세분화된 기술을 정의하고, 각 인스턴스에 대해 기술 집합을 할당하여 평가 세트를 구성합니다. 또한, 각 인스턴스에 대한 대상 도메인과 난이도 수준을 주석 처리함으로써, FLASK는 기술, 도메인, 난이도에 따른 모델 성능의 종합적인 분석을 제공합니다. FLASK를 사용하여 여러 오픈소스 및 독점 LLMs를 비교한 결과, 모델 기반 평가와 인간 기반 평가 간에 높은 상관관계를 관찰했습니다. FLASK는 개발자들이 모델 성능을 더 정확하게 측정하고, 특정 기술에 능숙해지기 위한 요인을 분석하여 개선 방안을 모색할 수 있게 합니다. 실무자들에게는 FLASK를 통해 다양한 LLMs 간의 종합적인 비교를 통해 특정 상황에 적합한 모델을 추천하는 데 사용할 수 있습니다. 우리는 평가 데이터와 코드 구현을 https://github.com/kaistAI/FLASK에서 공개합니다.

English

Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Through using FLASK, we compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.

FLASK: 정렬 기술 기반 세분화된 언어 모델 평가

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

초록

Support