MMLU-Pro:一个更健壮且具挑战性的多任务语言理解基准测试
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
June 3, 2024
作者: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen
cs.AI
摘要
在大规模语言模型时代,像大规模多任务语言理解(MMLU)这样的基准对推动人工智能在语言理解和推理方面在不同领域取得的成就起到了关键作用。然而,随着模型的不断改进,它们在这些基准上的表现已经开始趋于平稳,这使得越来越难以区分模型能力上的差异。本文介绍了MMLU-Pro,这是一个增强型数据集,旨在通过整合更具挑战性、注重推理的问题,并将选项选择从四个扩展到十个,来扩展主要基于知识的MMLU基准。此外,MMLU-Pro消除了MMLU中的琐碎和嘈杂问题。我们的实验结果表明,MMLU-Pro不仅提高了挑战性,使准确率较MMLU下降了16%至33%,而且在不同提示下表现出更大的稳定性。在测试了24种不同提示风格后,模型得分对提示变化的敏感性从MMLU的4-5%降至MMLU-Pro的仅为2%。此外,我们发现,利用“思维链”(CoT)推理的模型在MMLU-Pro上的表现优于直接回答,这与原始MMLU的研究结果形成鲜明对比,表明MMLU-Pro包含了更复杂的推理问题。我们的评估证实,MMLU-Pro是一个更具区分性的基准,可更好地跟踪该领域的进展。
English
In the age of large-scale language models, benchmarks like the Massive
Multitask Language Understanding (MMLU) have been pivotal in pushing the
boundaries of what AI can achieve in language comprehension and reasoning
across diverse domains. However, as models continue to improve, their
performance on these benchmarks has begun to plateau, making it increasingly
difficult to discern differences in model capabilities. This paper introduces
MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven
MMLU benchmark by integrating more challenging, reasoning-focused questions and
expanding the choice set from four to ten options. Additionally, MMLU-Pro
eliminates the trivial and noisy questions in MMLU. Our experimental results
show that MMLU-Pro not only raises the challenge, causing a significant drop in
accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability
under varying prompts. With 24 different prompt styles tested, the sensitivity
of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in
MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT)
reasoning achieved better performance on MMLU-Pro compared to direct answering,
which is in stark contrast to the findings on the original MMLU, indicating
that MMLU-Pro includes more complex reasoning questions. Our assessments
confirm that MMLU-Pro is a more discriminative benchmark to better track
progress in the field.Summary
AI-Generated Summary