BMdataset: Een musicologisch samengestelde LilyPond-dataset

Samenvatting

Onderzoek naar symbolische muziek heeft bijna uitsluitend gebruikgemaakt van op MIDI gebaseerde datasets; op tekst gebaseerde notatieformaten zoals LilyPond blijven onontgonnen terrein voor muziekbegrip. Wij presenteren BMdataset, een musicologisch samengestelde dataset van 393 LilyPond-partituren (2.646 delen) die door experts rechtstreeks zijn overgetrokken uit originele Barokhandschriften, met metadata over componist, muziekvorm, instrumentatie en sectiekenmerken. Voortbouwend op deze bron introduceren we LilyBERT (gewichten zijn te vinden op https://huggingface.co/csc-unipd/lilybert), een op CodeBERT gebaseerde encoder die is aangepast voor symbolische muziek door uitbreiding van de vocabulaire met 115 LilyPond-specifieke tokens en voorafgaande training met een gemaskeerd taalmodel. Lineaire probing op de niet-verwante Mutopia-corpus toont aan dat, ondanks de bescheiden omvang (~90M tokens), finetunen op enkel BMdataset beter presteert dan continue voorafgaande training op de volledige PDMX-corpus (~15B tokens) voor zowel componist- als stijlclassificatie. Dit demonstreert dat kleine, deskundig samengestelde datasets effectiever kunnen zijn dan grote, ruisrijke corpora voor muziekbegrip. Het combineren van brede voorafgaande training met domeinspecifiek finetunen levert de beste algehele resultaten op (84,3% nauwkeurigheid componist), wat bevestigt dat de twee data-regimes complementair zijn. Wij geven de dataset, tokenizer en model vrij om een basis te leggen voor representationeel leren op LilyPond.

English

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

BMdataset: Een musicologisch samengestelde LilyPond-dataset

BMdataset: A Musicologically Curated LilyPond Dataset

Samenvatting

Support