ChatPaper.aiChatPaper

BIG-Bench 超難任務

BIG-Bench Extra Hard

February 26, 2025
作者: Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
cs.AI

摘要

大型語言模型(LLMs)在日常應用中的部署日益增多,這要求其具備強大的通用推理能力和多樣化的推理技能。然而,現有的LLM推理基準主要集中於數學和編程能力,在評估更廣泛的推理熟練度方面存在空白。BIG-Bench數據集是一個特例,它作為評估LLM通用推理能力的關鍵基準,得益於其多樣化的挑戰性任務集,這些任務允許在統一框架內對各種技能進行全面的通用推理評估。然而,LLM的最新進展導致了在BIG-Bench及其更難版本BIG-Bench Hard(BBH)上的飽和。最先進的模型在BBH的許多任務上取得了接近完美的分數,從而削弱了其實用性。為解決這一限制,我們引入了BIG-Bench Extra Hard(BBEH),這是一個旨在突破LLM推理評估界限的新基準。BBEH將BBH中的每個任務替換為一個新穎的任務,這些任務探測相似的推理能力但顯著增加了難度。我們在BBEH上評估了各種模型,並觀察到最佳通用模型的(調和)平均準確率為9.8%,而最佳推理專用模型的平均準確率為44.8%,這表明仍有很大的改進空間,並突顯了在LLM中實現穩健通用推理的持續挑戰。我們在以下網址公開了BBEH:https://github.com/google-deepmind/bbeh。
English
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.

Summary

AI-Generated Summary

PDF72February 27, 2025