StemGen：一個具有聆聽功能的音樂生成模型

摘要

最近，利用深度學習技術進行音樂音頻的端對端生成活動呈現爆發式增長。然而，大多數模型集中於根據抽象條件信息生成完全混合的音樂。在本研究中，我們提出了一種用於生成音樂的替代範式，該範式可以聆聽並回應音樂背景。我們描述了如何使用非自回歸、基於Transformer的模型架構來構建這樣的模型，並提出了一些新穎的架構和採樣改進方法。我們在一個開源和一個專有數據集上訓練了所描述的架構。我們使用標準質量指標和基於音樂信息檢索描述符的新方法來評估所生成的模型。結果顯示，該模型在音頻質量方面達到了最先進的文本條件模型水準，同時在音樂連貫性方面表現出色。

English

End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

StemGen：一個具有聆聽功能的音樂生成模型

StemGen: A music generation model that listens

摘要

Support