계층적 프롬프팅 분류체계: 대규모 언어 모델을 위한 범용 평가 프레임워크

초록

다양한 작업을 해결하는 데 있어 대규모 언어 모델(LLM)의 효과성을 평가하는 것은 그들의 강점과 약점을 이해하는 데 필수적입니다. 기존의 평가 기법은 일반적으로 단일 프롬프트 전략을 데이터셋에 균일하게 적용하며, 작업의 복잡도 차이를 고려하지 않습니다. 본 연구에서는 Hierarchical Prompting Taxonomy(HPT)를 소개합니다. 이 분류 체계는 가장 단순한 것부터 가장 복잡한 것까지 다섯 가지 독특한 프롬프트 전략으로 구성된 Hierarchical Prompt Framework(HPF)를 사용하여 LLM을 더 정밀하게 평가하고 명확한 관점을 제공합니다. 이 분류 체계는 데이터셋과 LLM에 대해 Hierarchical Prompting Score(HP-Score)라는 점수를 부여하여, 다양한 작업을 해결하는 능력을 세밀하게 이해하고 작업 복잡도의 보편적인 측정 기준을 제공합니다. 또한, 각 작업에 적합한 프롬프트 전략을 자동으로 선택하는 Adaptive Hierarchical Prompt 프레임워크를 소개합니다. 본 연구는 Llama 3 8B, Phi 3 3.8B, Mistral 7B, Gemma 7B와 같은 네 가지 지시 튜닝된 LLM을 사용하여 BoolQ, CommonSenseQA(CSQA), IWSLT-2017 en-fr(IWSLT), SamSum 데이터셋에서 수동 및 적응형 계층적 프롬프트 프레임워크를 비교합니다. 실험 결과는 HPT의 효과성을 입증하며, 다양한 작업과 LLM 능력을 비교할 수 있는 신뢰할 수 있는 방법을 제공합니다. 이 논문은 데이터셋의 복잡도와 LLM의 능력을 모두 평가할 수 있는 보편적인 평가 지표 개발로 이어집니다. 수동 HPF와 적응형 HPF의 구현은 공개적으로 제공됩니다.

English

Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.

계층적 프롬프팅 분류체계: 대규모 언어 모델을 위한 범용 평가 프레임워크

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

초록

Support