MMAU-Pro:一個全面且具挑戰性的基準,用於評估音頻通用智能的整體性能
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
August 19, 2025
作者: Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadze, Yao Liu, Leibny Paola Garcia Perera, Eleni Zanou, Themos Stafylakis, Joon Son Chung, David Harwath, Chao Zhang, Dinesh Manocha, Alicia Lozano-Diez, Santosh Kesiraju, Sreyan Ghosh, Ramani Duraiswami
cs.AI
摘要
音頻理解——包括語音、非語音聲音和音樂——是實現人類水平智能的關鍵。因此,AI代理必須展現出全面的音頻理解能力,才能被視為具備通用智能。然而,全面評估聽覺智能仍然具有挑戰性。為填補這一空白,我們推出了MMAU-Pro,這是最全面且經過嚴格策劃的基準測試,用於評估AI系統的音頻智能。MMAU-Pro包含5,305個實例,每個實例都配有一個或多個音頻,並與人類專家生成的問答對配對,涵蓋語音、聲音、音樂及其組合。與現有基準不同,MMAU-Pro在49種獨特技能和多個複雜維度上評估聽覺智能,包括長篇音頻理解、空間音頻推理、多音頻理解等。所有問題都經過精心設計,要求進行深思熟慮的多跳推理,包括多選題和開放式回答格式。重要的是,音頻數據直接來自“野外”,而非已知分佈的現有數據集。我們評估了22個領先的開源和專有多模態AI模型,揭示了顯著的局限性:即使是Gemini 2.5 Flash和Audio Flamingo 3等最先進的模型,其準確率也僅分別為59.2%和51.7%,在多個類別中接近隨機表現。我們廣泛的分析突出了具體的不足,並提供了新穎的見解,為社區提供了可操作的前景,以增強未來AI系統在音頻通用智能方面的進展。基準測試和代碼可在https://sonalkum.github.io/mmau-pro獲取。
English
Audio comprehension-including speech, non-speech sounds, and music-is
essential for achieving human-level intelligence. Consequently, AI agents must
demonstrate holistic audio understanding to qualify as generally intelligent.
However, evaluating auditory intelligence comprehensively remains challenging.
To address this gap, we introduce MMAU-Pro, the most comprehensive and
rigorously curated benchmark for assessing audio intelligence in AI systems.
MMAU-Pro contains 5,305 instances, where each instance has one or more audios
paired with human expert-generated question-answer pairs, spanning speech,
sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro
evaluates auditory intelligence across 49 unique skills and multiple complex
dimensions, including long-form audio comprehension, spatial audio reasoning,
multi-audio understanding, among others. All questions are meticulously
designed to require deliberate multi-hop reasoning, including both
multiple-choice and open-ended response formats. Importantly, audio data is
sourced directly ``from the wild" rather than from existing datasets with known
distributions. We evaluate 22 leading open-source and proprietary multimodal AI
models, revealing significant limitations: even state-of-the-art models such as
Gemini 2.5 Flash and Audio Flamingo 3 achieve only 59.2% and 51.7% accuracy,
respectively, approaching random performance in multiple categories. Our
extensive analysis highlights specific shortcomings and provides novel
insights, offering actionable perspectives for the community to enhance future
AI systems' progression toward audio general intelligence. The benchmark and
code is available at https://sonalkum.github.io/mmau-pro.