MMLU-Pro:一個更穩健且具挑戰性的多任務語言理解基準測試
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
June 3, 2024
作者: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen
cs.AI
摘要
在大型語言模型時代,像是大規模多任務語言理解(MMLU)這樣的基準已成為推動人工智慧在語言理解和推理跨不同領域取得的成就的關鍵。然而,隨著模型持續改進,它們在這些基準上的表現已經開始趨於平緩,使得越來越難以辨別模型能力的差異。本文介紹了MMLU-Pro,這是一個增強的數據集,旨在擴展主要基於知識的MMLU基準,通過整合更具挑戰性、著重推理的問題,並將選擇集從四個擴展到十個選項。此外,MMLU-Pro消除了MMLU中的瑣碎和噪音問題。我們的實驗結果顯示,MMLU-Pro不僅提高了挑戰,使準確率比MMLU下降了16%至33%,還表現出在不同提示下更大的穩定性。在測試了24種不同提示風格後,模型分數對提示變化的敏感度從MMLU的4-5%降至MMLU-Pro的僅為2%。此外,我們發現,利用“思維鏈”(CoT)推理的模型在MMLU-Pro上表現優於直接回答,這與原始MMLU的研究結果形成鮮明對比,表明MMLU-Pro包含了更複雜的推理問題。我們的評估證實,MMLU-Pro是一個更具區分性的基準,可以更好地追蹤該領域的進展。
English
In the age of large-scale language models, benchmarks like the Massive
Multitask Language Understanding (MMLU) have been pivotal in pushing the
boundaries of what AI can achieve in language comprehension and reasoning
across diverse domains. However, as models continue to improve, their
performance on these benchmarks has begun to plateau, making it increasingly
difficult to discern differences in model capabilities. This paper introduces
MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven
MMLU benchmark by integrating more challenging, reasoning-focused questions and
expanding the choice set from four to ten options. Additionally, MMLU-Pro
eliminates the trivial and noisy questions in MMLU. Our experimental results
show that MMLU-Pro not only raises the challenge, causing a significant drop in
accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability
under varying prompts. With 24 different prompt styles tested, the sensitivity
of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in
MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT)
reasoning achieved better performance on MMLU-Pro compared to direct answering,
which is in stark contrast to the findings on the original MMLU, indicating
that MMLU-Pro includes more complex reasoning questions. Our assessments
confirm that MMLU-Pro is a more discriminative benchmark to better track
progress in the field.Summary
AI-Generated Summary