MusiXQA:推动多模态大语言模型中的视觉音乐理解
MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models
June 28, 2025
作者: Jian Chen, Wenye Ma, Penghang Liu, Wei Wang, Tengwei Song, Ming Li, Chenguang Wang, Ruiyi Zhang, Changyou Chen
cs.AI
摘要
多模态大语言模型(MLLMs)在自然图像、富含文本的文档及平面设计等领域展现出了卓越的视觉推理能力。然而,其在乐谱解读方面的潜力尚未得到充分挖掘。为填补这一空白,我们推出了MusiXQA,这是首个旨在评估并推动MLLMs在乐谱理解领域发展的综合性数据集。MusiXQA包含通过MusiXTeX生成的高质量合成乐谱,并配有结构化标注,涵盖音符音高与时长、和弦、谱号、调号/拍号及文本信息,支持多样化的视觉问答任务。经过广泛评估,我们揭示了当前顶尖MLLMs在此领域的显著局限。除基准测试外,我们还开发了Phi-3-MusiX,这是一个基于我们数据集微调的MLLM,相较于基于GPT的方法实现了显著的性能提升。所提出的数据集与模型为未来MLLMs在乐谱理解方面的进步奠定了基石。代码、数据及模型将在论文被接受后公开发布。
English
Multimodal Large Language Models (MLLMs) have achieved remarkable visual
reasoning abilities in natural images, text-rich documents, and graphic
designs. However, their ability to interpret music sheets remains
underexplored. To bridge this gap, we introduce MusiXQA, the first
comprehensive dataset for evaluating and advancing MLLMs in music sheet
understanding. MusiXQA features high-quality synthetic music sheets generated
via MusiXTeX, with structured annotations covering note pitch and duration,
chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks.
Through extensive evaluations, we reveal significant limitations of current
state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed
Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant
performance gains over GPT-based methods. The proposed dataset and model
establish a foundation for future advances in MLLMs for music sheet
understanding. Code, data, and model will be released upon acceptance.