BMdataset：一份音乐学导向的LilyPond数据集

摘要

符号音乐研究长期依赖MIDI格式数据集，而基于文本的乐谱排版格式（如LilyPond）在音乐理解领域尚未得到探索。我们推出BMdataset——一个由音乐学家精心策划的数据集，包含393份由专家直接根据巴洛克时期原始手稿转录的LilyPond乐谱（涵盖2,646个乐章），并附有包含作曲家、音乐形式、乐器编制及段落属性的元数据。基于此资源，我们提出LilyBERT（权重详见https://huggingface.co/csc-unipd/lilybert），该模型通过扩展115个LilyPond专用标记的词汇表，并采用掩码语言模型预训练，将CodeBERT架构适配于符号音乐数据。在跨域Mutopia语料库上的线性探测实验表明：尽管规模较小（约9000万标记），仅基于BMdataset微调的模型在作曲家和风格分类任务上均优于在完整PDMX语料库（约150亿标记）上持续预训练的效果，这证明小型专家级精标数据集比大型嘈杂语料库对音乐理解更有效。结合广泛预训练与领域特定微调可获得最佳综合效果（作曲家识别准确率达84.3%），证实两种数据模式具有互补性。我们公开数据集、分词器及模型，为LilyPond表征学习建立基准。

English

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.