潜在拡散モデルを用いたロングフォーム音楽生成

要旨

音楽生成のための音声ベース生成モデルは近年大きな進歩を遂げていますが、これまで一貫した音楽構造を持つフルレングスの音楽トラックを生成することには成功していませんでした。本研究では、長時間の時間的文脈で生成モデルを訓練することで、最大4分45秒のロングフォーム音楽を生成可能であることを示します。私たちのモデルは、高度にダウンサンプリングされた連続潜在表現（潜在レート21.5Hz）上で動作する拡散トランスフォーマーで構成されています。このモデルは、音質とプロンプト整合性に関するメトリクスにおいて最先端の生成性能を達成し、主観的評価では一貫した構造を持つフルレングス音楽を生成することが明らかになりました。

English

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

潜在拡散モデルを用いたロングフォーム音楽生成

Long-form music generation with latent diffusion

要旨

Support