音楽一貫性モデル

要旨

一貫性モデルは、効率的な画像/動画生成を促進する際に顕著な能力を示し、最小限のサンプリングステップで合成を可能にすることが実証されています。これは、拡散モデルに関連する計算負荷を軽減する上で有利であることが証明されています。しかしながら、音楽生成における一貫性モデルの応用はほとんど未開拓のままです。このギャップを埋めるため、我々はMusic Consistency Models (MusicCM)を提案します。これは、一貫性モデルの概念を活用して、音楽クリップのメルスペクトログラムを効率的に合成し、高品質を維持しながらサンプリングステップ数を最小限に抑えます。既存のテキストから音楽への拡散モデルを基に、MusicCMモデルは一貫性蒸留と敵対的識別器トレーニングを組み込んでいます。さらに、共有制約を持つ複数の拡散プロセスを組み込むことで、長く一貫性のある音楽を生成することが有益であることがわかりました。実験結果は、計算効率、忠実度、自然さの観点から我々のモデルの有効性を明らかにしています。特に、MusicCMはわずか4つのサンプリングステップでシームレスな音楽合成を実現し、例えば音楽クリップの1分あたりわずか1秒で、リアルタイム応用の可能性を示しています。

English

Consistency models have exhibited remarkable capabilities in facilitating efficient image/video generation, enabling synthesis with minimal sampling steps. It has proven to be advantageous in mitigating the computational burdens associated with diffusion models. Nevertheless, the application of consistency models in music generation remains largely unexplored. To address this gap, we present Music Consistency Models (MusicCM), which leverages the concept of consistency models to efficiently synthesize mel-spectrogram for music clips, maintaining high quality while minimizing the number of sampling steps. Building upon existing text-to-music diffusion models, the MusicCM model incorporates consistency distillation and adversarial discriminator training. Moreover, we find it beneficial to generate extended coherent music by incorporating multiple diffusion processes with shared constraints. Experimental results reveal the effectiveness of our model in terms of computational efficiency, fidelity, and naturalness. Notable, MusicCM achieves seamless music synthesis with a mere four sampling steps, e.g., only one second per minute of the music clip, showcasing the potential for real-time application.