離散ボトルネック特徴量を用いた音楽生成のための音声条件付け

要旨

ほとんどの音楽生成モデルがテキストやパラメトリックな条件付け（例：テンポ、和声、音楽ジャンル）を使用する中で、我々はオーディオ入力を基にした言語モデルによる音楽生成システムを提案します。我々の探求は2つの異なる戦略を包含しています。最初の戦略は、テキスト反転（textual inversion）と呼ばれ、事前に訓練されたテキストから音楽へのモデルを活用して、オーディオ入力をテキスト埋め込み空間内の対応する「疑似単語」にマッピングします。2番目のモデルでは、テキスト条件付け器と量子化されたオーディオ特徴抽出器を併用して、音楽言語モデルをゼロから訓練します。推論時には、新たな二重クラス分類器フリーガイダンス法を用いて、テキストとオーディオの条件付けを混合し、それらのバランスを調整することができます。我々は自動および人間による研究を実施し、このアプローチの有効性を検証しました。コードを公開し、モデルの品質を示すために、https://musicgenstyle.github.io で音楽サンプルを提供します。

English

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

離散ボトルネック特徴量を用いた音楽生成のための音声条件付け

Audio Conditioning for Music Generation via Discrete Bottleneck Features

要旨

Support