ChatPaper.aiChatPaper

分层提示分类法:大型语言模型的通用评估框架

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

June 18, 2024
作者: Devichand Budagam, Sankalp KJ, Ashutosh Kumar, Vinija Jain, Aman Chadha
cs.AI

摘要

评估大型语言模型(LLMs)在解决多样化任务中的有效性对于理解它们的优势和劣势至关重要。传统的评估技术通常会统一地应用单一提示策略于数据集,而不考虑任务复杂度的差异。我们引入了分层提示分类法(HPT),这是一种使用由五种独特提示策略组成的分层提示框架(HPF)的分类法,这些策略从简单到复杂排列,以更精确地评估LLMs并提供更清晰的视角。该分类法根据分类法规则为数据集和LLMs分配一个分数,称为分层提示分数(HP-Score),以提供对它们解决多样化任务能力的微妙理解,并提供任务复杂度的通用度量。此外,我们引入了自适应分层提示框架,该框架自动选择适当的提示策略来处理每个任务。本研究使用四个经过指令调整的LLMs,分别为Llama 3 8B、Phi 3 3.8B、Mistral 7B和Gemma 7B,跨四个数据集:BoolQ、CommonSenseQA(CSQA)、IWSLT-2017 en-fr(IWSLT)和SamSum,比较了手动和自适应分层提示框架。实验证明了HPT的有效性,提供了一种可靠的比较不同任务和LLM能力的方法。本文促进了一个可用于评估数据集复杂度和LLMs能力的通用评估指标的发展。手动HPF和自适应HPF的实施已公开可用。
English
Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.

Summary

AI-Generated Summary

PDF51December 4, 2024