CMI-Bench: 음악 지도 평가를 위한 포괄적 벤치마크

초록

오디오-텍스트 대형 언어 모델(LLMs)의 최근 발전은 음악 이해 및 생성에 대한 새로운 가능성을 열어주었다. 그러나 기존 벤치마크는 단순화된 작업이나 다중 선택 평가에 의존하는 경우가 많아, 실제 음악 분석의 복잡성을 반영하지 못하는 한계가 있다. 우리는 다양한 전통적인 음악 정보 검색(MIR) 주석을 명령 수행 형식으로 재해석하고, CMI-Bench라는 포괄적인 음악 명령 수행 벤치마크를 소개한다. 이 벤치마크는 오디오-텍스트 LLMs를 다양한 MIR 작업에 대해 평가하도록 설계되었으며, 장르 분류, 감정 회귀, 감정 태깅, 악기 분류, 피치 추정, 키 감지, 가사 전사, 멜로디 추출, 보컬 기법 인식, 악기 연주 기법 감지, 음악 태깅, 음악 캡셔닝, (다운)비트 추적 등을 포함한다. 이는 MIR 연구의 핵심 과제를 반영한다. 이전 벤치마크와 달리, CMI-Bench는 이전 최첨단 MIR 모델과 일치하는 표준화된 평가 지표를 채택하여 지도 학습 접근법과의 직접적인 비교가 가능하도록 한다. 우리는 LTU, Qwen-audio, SALMONN, MusiLingo 등 모든 오픈소스 오디오-텍스트 LLMs를 지원하는 평가 툴킷을 제공한다. 실험 결과는 LLMs와 지도 모델 간의 상당한 성능 차이와 함께, 문화적, 연대적, 성별 편향을 드러내며, 현재 모델들이 MIR 작업을 해결하는 데 있어 잠재력과 한계를 강조한다. CMI-Bench는 음악 명령 수행 평가를 위한 통합된 기반을 마련함으로써, 음악 인식 LLMs의 발전을 촉진한다.

English

Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.

CMI-Bench: 음악 지도 평가를 위한 포괄적 벤치마크

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

초록

Support