WildScore：評估多模態大語言模型在真實場景中的符號音樂推理能力

摘要

多模態大型語言模型（MLLMs）的最新進展在各種視覺-語言任務中展現了令人印象深刻的性能。然而，這些模型在多模態符號音樂領域的推理能力仍未被充分探索。我們引入了WildScore，這是首個面向真實場景的多模態符號音樂推理與分析基準，旨在評估MLLMs解讀現實世界樂譜並回答複雜音樂學問題的能力。WildScore中的每個實例均源自真實的音樂作品，並附有真實用戶生成的問題與討論，捕捉了實際音樂分析的細微之處。為了促進系統性評估，我們提出了一個系統化的分類法，包含高層次與細粒度的音樂學本體。此外，我們將複雜的音樂推理框架化為多選題問答，從而實現對MLLMs符號音樂理解能力的可控且可擴展的評估。在WildScore上對最先進的MLLMs進行的實證基準測試揭示了它們在視覺-符號推理中的有趣模式，既展現了MLLMs在符號音樂推理與分析中的潛力方向，也揭示了其面臨的持續挑戰。我們公開了數據集與代碼。

English

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.