ChatPaper.aiChatPaper

StructEval:通过结构化评估加深和拓宽大型语言模型评估

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

August 6, 2024
作者: Boxi Cao, Mengjie Ren, Hongyu Lin, Xianpei Han, Feng Zhang, Junfeng Zhan, Le Sun
cs.AI

摘要

评估是大型语言模型发展的指挥棒。当前的评估通常采用单项评估范式来评估每个原子测试目标,这种方法往往难以区分模型是否真正具备所需的能力,还是仅仅是记忆/猜测特定问题的答案。为此,我们提出了一种新颖的评估框架,称为StructEval。StructEval从原子测试目标出发,通过跨多个认知层次和关键概念进行结构化评估,从而为LLM提供全面、健壮和一致的评估。对三个广泛使用的基准进行的实验表明,StructEval可作为一个可靠工具,抵抗数据污染风险,减少潜在偏见的干扰,从而提供关于模型能力更可靠和一致的结论。我们的框架还为未来的基于原则和值得信赖的LLM评估协议的设计提供了启示。
English
Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

Summary

AI-Generated Summary

PDF102November 28, 2024