ChatPaper.aiChatPaper

MuChoMusic:评估多模态音频-语言模型中的音乐理解

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

August 2, 2024
作者: Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov
cs.AI

摘要

联合处理音频和语言的多模态模型在音频理解方面具有巨大潜力,并在音乐领域越来越受到采用。通过允许用户通过文本查询并获取有关特定音频输入的信息,这些模型有潜力通过基于语言的界面实现各种音乐理解任务。然而,它们的评估面临着相当大的挑战,目前尚不清楚如何有效评估它们对音乐相关输入的正确解释能力。受此启发,我们引入了MuChoMusic,这是一个专注于音频的多模态语言模型中评估音乐理解的基准。MuChoMusic包括1,187个多项选择问题,全部由人类标注者验证,涵盖了来自两个公开音乐数据集的644首音乐曲目,并涵盖了各种流派。基准中的问题旨在评估跨越几个维度的知识和推理能力,涵盖了基本音乐概念及其与文化和功能背景的关系。通过基准所提供的整体分析,我们评估了五个开源模型,并确定了几个缺陷,包括对语言模态过度依赖,指出需要更好的多模态整合。数据和代码均已开源。
English
Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfaces. However, their evaluation poses considerable challenges, and it remains unclear how to effectively assess their ability to correctly interpret music-related inputs with current methods. Motivated by this, we introduce MuChoMusic, a benchmark for evaluating music understanding in multimodal language models focused on audio. MuChoMusic comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets, and covering a wide variety of genres. Questions in the benchmark are crafted to assess knowledge and reasoning abilities across several dimensions that cover fundamental musical concepts and their relation to cultural and functional contexts. Through the holistic analysis afforded by the benchmark, we evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality, pointing to a need for better multimodal integration. Data and code are open-sourced.

Summary

AI-Generated Summary

PDF122November 28, 2024