StemGen：一种具有听觉能力的音乐生成模型

摘要

最近，使用深度学习技术进行音乐音频的端到端生成活动呈现爆发式增长。然而，大多数模型集中于根据抽象调节信息生成完全混合的音乐。在这项工作中，我们提出了一种用于生成音乐的替代范式，该范式可以听取并响应音乐背景。我们描述了如何使用非自回归、基于Transformer的模型架构构建这样的模型，并提出了一些新颖的架构和采样改进。我们在一个开源数据集和一个专有数据集上训练了所描述的架构。我们使用标准质量指标和基于音乐信息检索描述符的新方法评估生成的模型。结果模型达到了最先进的文本调节模型的音频质量，并且在音乐连贯性方面表现出色。

English

End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

StemGen：一种具有听觉能力的音乐生成模型

StemGen: A music generation model that listens

摘要

Support