KoLA:精心评估大型语言模型的世界知识
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
June 15, 2023
作者: Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, Juanzi Li
cs.AI
摘要
大型语言模型(LLMs)的前所未有性能要求改进评估。我们认为,与其仅仅探索LLM能力的广度,精心和深思熟虑的设计对于进行全面、公正和适用的评估至关重要。鉴于世界知识对LLMs的重要性,我们构建了一个面向知识的LLM评估基准(KoLA),在其中我们精心设计了三个关键因素:(1)对于能力建模,我们模仿人类认知形成了一个包含19个任务的四级知识相关能力分类法。 (2)对于数据,为了确保公平比较,我们既使用了维基百科这样一个LLMs普遍预训练的语料库,又使用持续收集的新兴语料库,旨在评估处理未见数据和不断演化知识的能力。 (3)对于评估标准,我们采用对比系统,包括全面标准分数,以便更好地跨任务和模型进行数值比较,以及一种独特的自对比度量标准,用于自动评估知识幻觉。我们评估了21个开源和商业LLMs,并得出了一些有趣的发现。KoLA数据集和开放参与排行榜已在https://kola.xlore.cn 上公开发布,并将持续更新,为开发LLMs和知识相关系统提供参考。
English
The unprecedented performance of large language models (LLMs) necessitates
improvements in evaluations. Rather than merely exploring the breadth of LLM
abilities, we believe meticulous and thoughtful designs are essential to
thorough, unbiased, and applicable evaluations. Given the importance of world
knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark
(KoLA), in which we carefully design three crucial factors: (1) For ability
modeling, we mimic human cognition to form a four-level taxonomy of
knowledge-related abilities, covering 19 tasks. (2) For data, to ensure fair
comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs,
along with continuously collected emerging corpora, aiming to evaluate the
capacity to handle unseen data and evolving knowledge. (3) For evaluation
criteria, we adopt a contrastive system, including overall standard scores for
better numerical comparability across tasks and models and a unique
self-contrast metric for automatically evaluating knowledge hallucination. We
evaluate 21 open-source and commercial LLMs and obtain some intriguing
findings. The KoLA dataset and open-participation leaderboard are publicly
released at https://kola.xlore.cn and will be continuously updated to provide
references for developing LLMs and knowledge-related systems.