透過離散瓶頸特徵對音樂生成進行音訊條件設定

摘要

大多數音樂生成模型使用文本或參數條件（例如節奏、和諧、音樂風格），我們提議使用音頻輸入對語言模型進行條件設置的音樂生成系統。我們的探索涉及兩種不同的策略。第一種策略稱為文本反轉，利用預先訓練的文本到音樂模型將音頻輸入映射到文本嵌入空間中相應的“虛擬詞”。對於第二個模型，我們從頭開始訓練一個音樂語言模型，同時配合一個文本條件器和一個量化音頻特徵提取器。在推理時，我們可以混合文本和音頻條件，並通過一種新的雙分類器自由引導方法來平衡它們。我們進行了自動化和人類研究，驗證了我們的方法。我們將釋出代碼，並在https://musicgenstyle.github.io 上提供音樂樣本，以展示我們模型的質量。

English

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

透過離散瓶頸特徵對音樂生成進行音訊條件設定

Audio Conditioning for Music Generation via Discrete Bottleneck Features

摘要

Support