通过离散瓶颈特征进行音频调节以用于音乐生成
Audio Conditioning for Music Generation via Discrete Bottleneck Features
July 17, 2024
作者: Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre Défossez
cs.AI
摘要
大多数音乐生成模型使用文本或参数条件(例如,速度、和声、音乐风格),我们提出使用音频输入来对基于语言模型的音乐生成系统进行条件化。我们的探索涉及两种不同的策略。第一种策略称为文本反演,利用预训练的文本到音乐模型将音频输入映射到文本嵌入空间中相应的“伪词”。对于第二个模型,我们从头开始训练一个音乐语言模型,同时配合一个文本条件器和一个量化的音频特征提取器。在推断时,我们可以混合文本和音频条件,并通过一种新颖的双分类器自由引导方法来平衡它们。我们进行了自动化和人类研究来验证我们的方法。我们将发布代码,并在https://musicgenstyle.github.io上提供音乐样本,以展示我们模型的质量。
English
While most music generation models use textual or parametric conditioning
(e.g. tempo, harmony, musical genre), we propose to condition a language model
based music generation system with audio input. Our exploration involves two
distinct strategies. The first strategy, termed textual inversion, leverages a
pre-trained text-to-music model to map audio input to corresponding
"pseudowords" in the textual embedding space. For the second model we train a
music language model from scratch jointly with a text conditioner and a
quantized audio feature extractor. At inference time, we can mix textual and
audio conditioning and balance them thanks to a novel double classifier free
guidance method. We conduct automatic and human studies that validates our
approach. We will release the code and we provide music samples on
https://musicgenstyle.github.io in order to show the quality of our model.Summary
AI-Generated Summary