ChatPaper.aiChatPaper

我們是否已經完成了MMLU?

Are We Done with MMLU?

June 6, 2024
作者: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
cs.AI

摘要

也許不是。我們識別並分析了流行的大規模多任務語言理解(MMLU)基準中的錯誤。儘管MMLU被廣泛採用,但我們的分析顯示出許多地面真相錯誤,這些錯誤掩蓋了LLM的真正能力。例如,我們發現在病毒學子集中分析的問題中,有57% 包含錯誤。為了解決這個問題,我們引入了一個全面的框架,使用一個新穎的錯誤分類法來識別數據集中的錯誤。然後,我們創建了MMLU-Redux,這是跨越30個MMLU主題的3,000個手動重新標註問題的子集。使用MMLU-Redux,我們展示了與最初報告的模型性能指標存在顯著差異。我們的結果堅決主張修訂MMLU中錯誤的問題,以增強其作為基準的未來效用和可靠性。因此,我們開放了MMLU-Redux 供進一步註釋。
English
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

Summary

AI-Generated Summary

PDF401December 8, 2024