BMdataset：一個音樂學導向的LilyPond資料集

摘要

符号音乐研究长期依赖MIDI格式数据集，而基于文本的乐谱排版格式（如LilyPond）在音乐理解领域仍属空白。我们推出BMdataset——一个由音乐学家精心策划的数据集，包含393份由专家直接根据巴洛克时期原始手稿转录的LilyPond乐谱（涵盖2,646个乐章），并附有作曲家、音乐形式、乐器编制及乐章属性等元数据。基于此资源，我们提出LilyBERT（权重文件详见https://huggingface.co/csc-unipd/lilybert），该模型通过扩展115个LilyPond专用标记的词汇表，并采用掩码语言模型预训练策略，将CodeBERT编码器适配于符号音乐数据。在跨域Mutopia语料库上的线性探测实验表明：尽管规模适中（约9000万标记），仅基于BMdataset微调的模型在作曲家与风格分类任务上均优于在完整PDMX语料库（约150亿标记）上持续预训练的效果，印证了精细策划的小型数据集比嘈杂的大型语料库更能有效促进音乐理解。结合广泛预训练与领域特定微调的策略取得了最佳综合效果（作曲家识别准确率达84.3%），证实两种数据机制具有互补性。我们公开数据集、分词器及模型，为LilyPond表征学习建立基准。

English

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

BMdataset：一個音樂學導向的LilyPond資料集

BMdataset: A Musicologically Curated LilyPond Dataset

摘要

Support