BMdataset: 音楽学的に精選されたLilyPondデータセット

要旨

記号的な音楽研究は、これまでほぼ独占的にMIDIベースのデータセットに依存してきた。LilyPondのようなテキストベースの楽譜作成形式は、音楽理解のためのリソースとして未開拓である。本研究では、BMdatasetを提案する。これは、音楽学的に精選されたデータセットであり、専門家がバロック期の原典資料から直接転写した393のLilyPond楽譜（2,646楽章）を含み、作曲家、音楽形式、楽器編成、セクション属性をカバーするメタデータを付与している。このリソースに基づき、LilyPondに特化した115のトークンによる語彙拡張とマスク言語モデル事前学習を通じて、記号的音楽に適応させたCodeBERTベースのエンコーダであるLilyBERT（重みは https://huggingface.co/csc-unipd/lilybert で公開）を紹介する。ドメイン外のMutopiaコーパスを用いた線形 probing の結果、BMdatasetのみでのファインチューニングは、その規模が控えめ（約90Mトークン）であるにもかかわらず、作曲家分類と様式分類の両方において、大規模でノイズの多いPDMXコーパス全体（約15Bトークン）での連続的事前学習を上回る性能を示し、音楽理解には大規模でノイズの多いコーパスよりも、小規模で専門家によって精選されたデータセットの方が効果的であることを実証した。広範な事前学習とドメイン特化のファインチューニングを組み合わせることで、全体として最高の結果（作曲家分類精度84.3%）が得られ、これら二つのデータ体制が相補的であることが確認された。我々は、LilyPondにおける表現学習のベースラインを確立するため、データセット、トークナイザ、モデルを公開する。

English

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

BMdataset: 音楽学的に精選されたLilyPondデータセット

BMdataset: A Musicologically Curated LilyPond Dataset

要旨

Support