MusiXQA：推动多模态大语言模型中的视觉音乐理解

摘要

多模态大语言模型（MLLMs）在自然图像、富含文本的文档及平面设计等领域展现出了卓越的视觉推理能力。然而，其在乐谱解读方面的潜力尚未得到充分挖掘。为填补这一空白，我们推出了MusiXQA，这是首个旨在评估并推动MLLMs在乐谱理解领域发展的综合性数据集。MusiXQA包含通过MusiXTeX生成的高质量合成乐谱，并配有结构化标注，涵盖音符音高与时长、和弦、谱号、调号/拍号及文本信息，支持多样化的视觉问答任务。经过广泛评估，我们揭示了当前顶尖MLLMs在此领域的显著局限。除基准测试外，我们还开发了Phi-3-MusiX，这是一个基于我们数据集微调的MLLM，相较于基于GPT的方法实现了显著的性能提升。所提出的数据集与模型为未来MLLMs在乐谱理解方面的进步奠定了基石。代码、数据及模型将在论文被接受后公开发布。

English

Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

MusiXQA：推动多模态大语言模型中的视觉音乐理解

MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

摘要

Support