CMI-Bench:評估音樂教學的綜合基準
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
June 14, 2025
作者: Yinghao Ma, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa
cs.AI
摘要
近期,音频-文本大语言模型(LLMs)的进展为音乐理解与生成开辟了新的可能性。然而,现有的基准测试在范围上存在局限,往往依赖于简化的任务或多选评估,未能反映现实世界音乐分析的复杂性。我们将一系列传统的音乐信息检索(MIR)注释重新诠释为指令跟随格式,并引入了CMI-Bench,这是一个全面的音乐指令跟随基准,旨在评估音频-文本LLMs在多样化的MIR任务上的表现。这些任务包括流派分类、情感回归、情感标签、乐器分类、音高估计、调性检测、歌词转录、旋律提取、演唱技巧识别、乐器演奏技巧检测、音乐标签、音乐描述以及(下)拍跟踪,反映了MIR研究的核心挑战。与以往基准不同,CMI-Bench采用了与先前最先进的MIR模型一致的标准化评估指标,确保了与监督方法的直接可比性。我们提供了一个评估工具包,支持所有开源的音频-文本LLMs,包括LTU、Qwen-audio、SALMONN、MusiLingo等。实验结果显示,LLMs与监督模型之间存在显著的性能差距,同时揭示了它们在文化、年代和性别上的偏见,凸显了当前模型在处理MIR任务时的潜力与局限。CMI-Bench为评估音乐指令跟随建立了统一的基础,推动了音乐感知LLMs的进步。
English
Recent advances in audio-text large language models (LLMs) have opened new
possibilities for music understanding and generation. However, existing
benchmarks are limited in scope, often relying on simplified tasks or
multi-choice evaluations that fail to reflect the complexity of real-world
music analysis. We reinterpret a broad range of traditional MIR annotations as
instruction-following formats and introduce CMI-Bench, a comprehensive music
instruction following benchmark designed to evaluate audio-text LLMs on a
diverse set of music information retrieval (MIR) tasks. These include genre
classification, emotion regression, emotion tagging, instrument classification,
pitch estimation, key detection, lyrics transcription, melody extraction, vocal
technique recognition, instrument performance technique detection, music
tagging, music captioning, and (down)beat tracking: reflecting core challenges
in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized
evaluation metrics consistent with previous state-of-the-art MIR models,
ensuring direct comparability with supervised approaches. We provide an
evaluation toolkit supporting all open-source audio-textual LLMs, including
LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant
performance gaps between LLMs and supervised models, along with their culture,
chronological and gender bias, highlighting the potential and limitations of
current models in addressing MIR tasks. CMI-Bench establishes a unified
foundation for evaluating music instruction following, driving progress in
music-aware LLMs.