階層提示分類法:大型語言模型的通用評估框架
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
June 18, 2024
作者: Devichand Budagam, Sankalp KJ, Ashutosh Kumar, Vinija Jain, Aman Chadha
cs.AI
摘要
評估大型語言模型(LLMs)在應對多樣任務方面的效能對於了解其優勢和劣勢至關重要。傳統的評估技術通常在數據集上統一應用單一提示策略,並未考慮任務複雜度的變化程度。我們引入了階層提示分類(HPT),這是一種使用由最簡單到最複雜排列的五種獨特提示策略組成的階層提示框架(HPF),以更精確地評估LLMs並提供更清晰的視角。該分類根據分類規則為數據集和LLMs分配一個分數,稱為階層提示分數(HP-Score),以提供對它們解決多樣任務能力的微妙理解,並提供一個通用的任務複雜度度量。此外,我們引入了自適應階層提示框架,該框架自動選擇每個任務的適當提示策略。本研究使用四個經過調整指令的LLMs,即Llama 3 8B、Phi 3 3.8B、Mistral 7B和Gemma 7B,跨四個數據集(BoolQ、CommonSenseQA(CSQA)、IWSLT-2017 en-fr(IWSLT)和SamSum)比較手動和自適應階層提示框架。實驗證明了HPT的有效性,提供了一種可靠的比較不同任務和LLM能力的方法。本文促使發展出一種可用於評估數據集複雜度和LLMs能力的通用評估指標。手動HPF和自適應HPF的實施均可公開獲取。
English
Assessing the effectiveness of large language models (LLMs) in addressing
diverse tasks is essential for comprehending their strengths and weaknesses.
Conventional evaluation techniques typically apply a single prompting strategy
uniformly across datasets, not considering the varying degrees of task
complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy
that employs a Hierarchical Prompt Framework (HPF) composed of five unique
prompting strategies, arranged from the simplest to the most complex, to assess
LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a
score, called the Hierarchical Prompting Score (HP-Score), to datasets as well
as LLMs based on the rules of the taxonomy, providing a nuanced understanding
of their ability to solve diverse tasks and offering a universal measure of
task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt
framework, which automates the selection of appropriate prompting strategies
for each task. This study compares manual and adaptive hierarchical prompt
frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B,
Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA),
IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness
of HPT, providing a reliable way to compare different tasks and LLM
capabilities. This paper leads to the development of a universal evaluation
metric that can be used to evaluate both the complexity of the datasets and the
capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is
publicly available.Summary
AI-Generated Summary