MusiXQA: 멀티모달 대규모 언어 모델에서 시각적 음악 이해의 발전

초록

멀티모달 대형 언어 모델(MLLMs)은 자연 이미지, 텍스트가 풍부한 문서, 그래픽 디자인 등에서 놀라운 시각적 추론 능력을 보여왔습니다. 그러나 악보 해석 능력은 아직 충분히 탐구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 악보 이해를 평가하고 발전시키기 위한 최초의 포괄적인 데이터셋인 MusiXQA를 소개합니다. MusiXQA는 MusiXTeX를 통해 생성된 고품질의 합성 악보로 구성되어 있으며, 음표의 높이와 지속 시간, 코드, 음자리표, 조/박자 기호, 텍스트 등을 포함한 구조화된 주석을 제공하여 다양한 시각적 질의응답 작업을 가능하게 합니다. 광범위한 평가를 통해, 우리는 현재 최첨단 MLLMs의 이 분야에서의 상당한 한계를 밝혀냈습니다. 벤치마킹을 넘어, 우리는 이 데이터셋을 기반으로 미세 조정된 MLLM인 Phi-3-MusiX를 개발하여 GPT 기반 방법 대비 상당한 성능 향상을 달성했습니다. 제안된 데이터셋과 모델은 악보 이해를 위한 MLLMs의 미래 발전을 위한 기반을 마련합니다. 코드, 데이터, 모델은 논문 채택 시 공개될 예정입니다.

English

Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

MusiXQA: 멀티모달 대규모 언어 모델에서 시각적 음악 이해의 발전

MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

초록

Support