PDMX：用于符号音乐处理的大规模公共领域MusicXML数据集

摘要

最近生成式AI音乐系统的迅猛发展引发了人们对数据版权、从音乐人那里获取许可以及开源AI与大型知名公司之间的冲突等诸多担忧。这些问题凸显了公开可用、无版权音乐数据的需求，特别是对于象征性音乐数据的需求。为了缓解这一问题，我们提出了PDMX：这是一个大规模开源数据集，包含超过25万个来自乐谱分享论坛MuseScore的公共领域MusicXML乐谱，据我们所知，这是目前最大的可用无版权象征性音乐数据集。PDMX还包括丰富的标签和用户交互元数据，使我们能够高效地分析数据集并筛选出高质量的用户生成乐谱。借助我们的数据收集过程提供的额外元数据，我们进行了多轨音乐生成实验，评估PDMX不同代表性子集如何导致下游模型中的不同行为，以及如何利用用户评级统计作为数据质量的有效衡量标准。示例可在https://pnlong.github.io/PDMX.demo/找到。

English

The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.