MMLUはもう終わりなのか？

要旨

必ずしもそうとは言えません。私たちは、広く採用されているMassive Multitask Language Understanding（MMLU）ベンチマークのエラーを特定し、分析しました。MMLUは広く採用されているものの、私たちの分析は、大規模言語モデル（LLM）の真の能力を曇らせる多数の正解エラーの存在を明らかにしています。例えば、ウイルス学分野のサブセットにおいて、分析された質問の57％にエラーが含まれていることがわかりました。この問題に対処するため、私たちは新しいエラータクソノミーを使用してデータセットエラーを特定する包括的なフレームワークを導入しました。その後、30のMMLU科目にわたる3,000の手動で再アノテーションされた質問からなるMMLU-Reduxを作成しました。MMLU-Reduxを使用することで、当初報告されていたモデルのパフォーマンス指標との間に大きな不一致があることを示しました。私たちの結果は、MMLUのエラーが多い質問を修正し、将来のベンチマークとしての有用性と信頼性を高めることを強く推奨しています。そのため、私たちはMMLU-Reduxを追加のアノテーションのために公開します https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux。

English

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

MMLUはもう終わりなのか？

Are We Done with MMLU?

要旨

Support