MusiXQA：推動多模態大語言模型中的視覺音樂理解進展

摘要

多模态大型语言模型（MLLMs）在自然图像、富含文本的文档及平面设计等领域已展现出卓越的视觉推理能力。然而，其在解读乐谱方面的潜力尚待深入挖掘。为填补这一空白，我们推出了MusiXQA，首个旨在评估并推动MLLMs在乐谱理解领域发展的综合性数据集。MusiXQA集成了通过MusiXTeX生成的高质量合成乐谱，并附有结构化标注，涵盖音符音高与时值、和弦、谱号、调号/拍号及文本信息，支持多样化的视觉问答任务。通过广泛评估，我们揭示了当前顶尖MLLMs在此领域的显著局限。除基准测试外，我们还开发了Phi-3-MusiX，一款基于本数据集微调的MLLM，相较于基于GPT的方法实现了显著的性能提升。所提出的数据集与模型为未来MLLMs在乐谱理解方面的进步奠定了基石。代码、数据及模型将在论文接受后公开发布。

English

Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

MusiXQA：推動多模態大語言模型中的視覺音樂理解進展

MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

摘要

Support