ChatPaper.aiChatPaper

WildScore:評估多模態大語言模型在真實場景中的符號音樂推理能力

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

September 5, 2025
作者: Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu
cs.AI

摘要

多模態大型語言模型(MLLMs)的最新進展在各種視覺-語言任務中展現了令人印象深刻的性能。然而,這些模型在多模態符號音樂領域的推理能力仍未被充分探索。我們引入了WildScore,這是首個面向真實場景的多模態符號音樂推理與分析基準,旨在評估MLLMs解讀現實世界樂譜並回答複雜音樂學問題的能力。WildScore中的每個實例均源自真實的音樂作品,並附有真實用戶生成的問題與討論,捕捉了實際音樂分析的細微之處。為了促進系統性評估,我們提出了一個系統化的分類法,包含高層次與細粒度的音樂學本體。此外,我們將複雜的音樂推理框架化為多選題問答,從而實現對MLLMs符號音樂理解能力的可控且可擴展的評估。在WildScore上對最先進的MLLMs進行的實證基準測試揭示了它們在視覺-符號推理中的有趣模式,既展現了MLLMs在符號音樂推理與分析中的潛力方向,也揭示了其面臨的持續挑戰。我們公開了數據集與代碼。
English
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
PDF112September 8, 2025