MuChoMusic:評估多模式音頻語言模型中的音樂理解
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
August 2, 2024
作者: Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov
cs.AI
摘要
聯合處理音訊與語言的多模型在音訊理解方面具有巨大潛力,並且在音樂領域中越來越受到採用。這些模型允許用戶通過文字查詢並獲取有關特定音訊輸入的信息,因此有潛力通過基於語言的界面實現各種音樂理解任務。然而,對它們進行評估存在相當大的挑戰,目前如何有效評估它們對音樂相關輸入的正確解釋能力仍不清楚。基於這一動機,我們引入了MuChoMusic,這是一個專注於音訊的多模式語言模型中音樂理解評估的基準。MuChoMusic包含1,187個多選問題,所有問題均由人類標註者驗證,涵蓋了來自兩個公開音樂數據集的644首音樂曲目,並涵蓋了各種流派。基準中的問題旨在評估跨越幾個維度的知識和推理能力,這些維度涵蓋了基本音樂概念及其與文化和功能背景的關係。通過基準所提供的全面分析,我們評估了五個開源模型並識別了幾個問題,包括對語言模態的過度依賴,指出需要更好的多模式整合。數據和代碼均已開源。
English
Multimodal models that jointly process audio and language hold great promise
in audio understanding and are increasingly being adopted in the music domain.
By allowing users to query via text and obtain information about a given audio
input, these models have the potential to enable a variety of music
understanding tasks via language-based interfaces. However, their evaluation
poses considerable challenges, and it remains unclear how to effectively assess
their ability to correctly interpret music-related inputs with current methods.
Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music
understanding in multimodal language models focused on audio. MuChoMusic
comprises 1,187 multiple-choice questions, all validated by human annotators,
on 644 music tracks sourced from two publicly available music datasets, and
covering a wide variety of genres. Questions in the benchmark are crafted to
assess knowledge and reasoning abilities across several dimensions that cover
fundamental musical concepts and their relation to cultural and functional
contexts. Through the holistic analysis afforded by the benchmark, we evaluate
five open-source models and identify several pitfalls, including an
over-reliance on the language modality, pointing to a need for better
multimodal integration. Data and code are open-sourced.Summary
AI-Generated Summary