WildScore: Benchmarking van MLLM's in symbolische muziekredenering in realistische omgevingen

Samenvatting

Recente ontwikkelingen in Multimodale Grote Taalmodellen (MLLMs) hebben indrukwekkende capaciteiten getoond op diverse visie-taaltaken. Hun redeneervermogen in het domein van multimodale symbolische muziek blijft echter grotendeels onontgonnen. Wij introduceren WildScore, de eerste in-the-wild benchmark voor multimodale symbolische muziekredenering en -analyse, ontworpen om de capaciteit van MLLMs te evalueren om real-world muziekpartituren te interpreteren en complexe musicologische vragen te beantwoorden. Elk geval in WildScore is afkomstig uit authentieke muziekcomposities en wordt vergezeld door echte door gebruikers gegenereerde vragen en discussies, waardoor de complexiteit van praktische muziekanalyse wordt vastgelegd. Om systematische evaluatie te vergemakkelijken, stellen we een systematische taxonomie voor, bestaande uit zowel hoogwaardige als fijnmazige musicologische ontologieën. Bovendien formuleren we complexe muziekredenering als meerkeuzevragen, wat een gecontroleerde en schaalbare beoordeling van het symbolische muziekbegrip van MLLMs mogelijk maakt. Empirische benchmarking van state-of-the-art MLLMs op WildScore onthult intrigerende patronen in hun visueel-symbolische redenering, waarbij zowel veelbelovende richtingen als hardnekkige uitdagingen voor MLLMs in symbolische muziekredenering en -analyse aan het licht komen. We maken de dataset en code beschikbaar.

English

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.

WildScore: Benchmarking van MLLM's in symbolische muziekredenering in realistische omgevingen

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Samenvatting

Support