我们是否已经完成了MMLU?
Are We Done with MMLU?
June 6, 2024
作者: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
cs.AI
摘要
也许并非如此。我们识别并分析了流行的大规模多任务语言理解(MMLU)基准中的错误。尽管MMLU被广泛采用,但我们的分析显示存在许多地面真相错误,这些错误掩盖了LLM的真实能力。例如,我们发现分析的病毒学子集中有57%的问题存在错误。为了解决这个问题,我们引入了一个全面的框架,使用一种新颖的错误分类法来识别数据集中的错误。然后,我们创建了MMLU-Redux,这是30个MMLU主题中3,000个手动重新注释的问题的子集。利用MMLU-Redux,我们展示了与最初报告的模型性能指标存在显著差异。我们的结果强烈主张修订MMLU中错误频发的问题,以增强其作为基准的未来实用性和可靠性。因此,我们开放了MMLU-Redux以进行额外的注释。https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux。
English
Maybe not. We identify and analyse errors in the popular Massive Multitask
Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted,
our analysis demonstrates numerous ground truth errors that obscure the true
capabilities of LLMs. For example, we find that 57% of the analysed questions
in the Virology subset contain errors. To address this issue, we introduce a
comprehensive framework for identifying dataset errors using a novel error
taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually
re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we
demonstrate significant discrepancies with the model performance metrics that
were originally reported. Our results strongly advocate for revising MMLU's
error-ridden questions to enhance its future utility and reliability as a
benchmark. Therefore, we open up MMLU-Redux for additional annotation
https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.Summary
AI-Generated Summary