Audio-conditionering voor muziekgeneratie via discrete bottleneck-functies

Samenvatting

Terwijl de meeste muziekgeneratiemodellen gebruikmaken van tekstuele of parametrische conditionering (bijv. tempo, harmonie, muziekgenre), stellen wij voor om een op taalmodel gebaseerd muziekgeneratiesysteem te conditioneren met audio-input. Onze verkenning omvat twee verschillende strategieën. De eerste strategie, genaamd tekstuele inversie, maakt gebruik van een vooraf getraind tekst-naar-muziek-model om audio-input te mappen naar corresponderende "pseudowoorden" in de tekstuele inbeddingsruimte. Voor het tweede model trainen we een muziektalenmodel van scratch, samen met een tekstconditioner en een gekwantiseerde audiofeature-extractor. Tijdens inferentie kunnen we tekstuele en audioconditionering mengen en balanceren dankzij een nieuwe dubbele classifier-free guidance-methode. We voeren automatische en menselijke studies uit die onze aanpak valideren. We zullen de code vrijgeven en bieden muziekvoorbeelden op https://musicgenstyle.github.io om de kwaliteit van ons model te demonstreren.

English

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

Audio-conditionering voor muziekgeneratie via discrete bottleneck-functies

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Samenvatting

Support