KoLA：精心評估大型語言模型對世界知識的基準。

摘要

大型語言模型（LLM）的前所未有表現需要改進評估。我們認為，與其僅僅探索LLM能力的廣度，精心和周到的設計對於進行徹底、公正和適用的評估至關重要。鑒於世界知識對LLM的重要性，我們建立了一個以知識為導向的LLM評估基準（KoLA），在其中我們精心設計了三個關鍵因素：（1）對於能力建模，我們模仿人類認知，形成了一個包含19個任務的四級知識相關能力分類。（2）對於數據，為了確保公平比較，我們既使用了經常被LLM預先訓練的語料庫維基百科，也使用了持續收集的新興語料庫，旨在評估處理未見數據和不斷發展知識的能力。（3）對於評估標準，我們採用對比系統，包括整體標準分數，以便更好地在任務和模型之間進行數值比較，以及一個獨特的自對比指標，用於自動評估知識幻覺。我們評估了21個開源和商業LLM，並獲得了一些有趣的發現。KoLA數據集和開放參與排行榜已在https://kola.xlore.cn 公開發布，並將持續更新，以提供開發LLM和知識相關系統的參考。

English

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering 19 tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate 21 open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

KoLA：精心評估大型語言模型對世界知識的基準。

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

摘要

Support