MusicLDM：ビート同期Mixup戦略を用いたテキストから音楽生成における新規性の向上

要旨

拡散モデルは、テキストから画像やテキストから音声の生成といったクロスモーダル生成タスクにおいて有望な結果を示してきました。しかし、音楽という特殊なタイプの音声を生成する際には、音楽データの限られた可用性や著作権・盗作に関わる敏感な問題から、独自の課題が存在します。本論文では、これらの課題に取り組むため、まずStable DiffusionとAudioLDMのアーキテクチャを音楽領域に適応させた最先端のテキストから音楽生成モデル、MusicLDMを構築します。これを実現するために、MusicLDMの構成要素であるコントラスティブ言語-音声事前学習モデル（CLAP）とHifi-GANボコーダを、音楽データサンプルのコレクションで再学習させます。次に、学習データの制限を克服し、盗作を回避するために、ビートトラッキングモデルを活用し、データ拡張のための2つの異なるミックスアップ戦略を提案します：ビート同期オーディオミックスアップとビート同期潜在空間ミックスアップです。これらは、それぞれ学習オーディオを直接、または潜在埋め込み空間を介して再結合します。このようなミックスアップ戦略は、モデルが音楽学習サンプル間を補間し、学習データの凸包内で新しい音楽を生成することを促し、生成される音楽をより多様にしながらも、対応するスタイルに忠実に保ちます。さらに、一般的な評価指標に加えて、CLAPスコアに基づいたいくつかの新しい評価指標を設計し、提案するMusicLDMとビート同期ミックスアップ戦略が、生成される音楽の品質と新規性、および入力テキストと生成音楽の対応関係の両方を改善することを示します。

English

Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.

MusicLDM：ビート同期Mixup戦略を用いたテキストから音楽生成における新規性の向上

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

要旨

Support