ChatPaper.aiChatPaper

StructEval:透過結構化評估加深和擴展大型語言模型的評估

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

August 6, 2024
作者: Boxi Cao, Mengjie Ren, Hongyu Lin, Xianpei Han, Feng Zhang, Junfeng Zhan, Le Sun
cs.AI

摘要

評估是大型語言模型發展的接力棒。目前的評估通常採用單項評估範式來評估每個原子測試目標,這種方法難以辨別模型是否真正具備所需的能力,或僅僅是記憶/猜測特定問題的答案。為此,我們提出了一種新穎的評估框架,稱為StructEval。從原子測試目標出發,StructEval通過在多個認知層次和關鍵概念上進行結構化評估,從而為LLM提供了全面、強大且一致的評估。對三個廣泛使用的基準進行的實驗表明,StructEval作為一個可靠工具,可以抵抗數據污染的風險,減少潛在偏見的干擾,從而提供更可靠和一致的關於模型能力的結論。我們的框架還為未來設計有原則且值得信賴的LLM評估協議提供了啟示。
English
Evaluation is the baton for the development of large language models. Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. To this end, we propose a novel evaluation framework referred to as StructEval. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. Experiments on three widely-used benchmarks demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities. Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

Summary

AI-Generated Summary

PDF102November 28, 2024