Music ControlNet: 音楽生成のための複数の時間変動制御

要旨

テキストから音楽を生成するモデルは、現在、幅広いスタイルで高品質な音楽オーディオを生成することが可能である。しかし、テキストによる制御は、ジャンル、ムード、テンポなどのグローバルな音楽属性の操作には適しているものの、時間的に変化する属性、例えばビートの位置や音楽のダイナミクスの変化など、に対する精密な制御にはあまり適していない。本論文では、生成されたオーディオに対して複数の精密な時間的制御を提供する拡散ベースの音楽生成モデルであるMusic ControlNetを提案する。テキストから音楽を生成するモデルに時間的制御を付与するために、画像領域のControlNet手法におけるピクセル単位の制御に類似したアプローチを提案する。具体的には、トレーニング用オーディオから制御信号を抽出してペアデータを作成し、メロディ、ダイナミクス、リズムの制御信号を条件としてオーディオスペクトログラム上で拡散ベースの条件付き生成モデルをファインチューニングする。画像領域のUni-ControlNet手法は既に任意の制御信号のサブセットを用いた生成を可能にしているが、我々は、制作者が時間的に部分的に指定された制御信号を入力できるようにする新しい戦略を考案する。オーディオから抽出された制御信号と、制作者が提供すると予想される制御信号の両方について評価を行い、両方の設定において制御入力に対応する現実的な音楽を生成できることを示す。比較可能な音楽生成モデルはほとんど存在しないが、テキストとメロディ入力を受け入れる最近のモデルであるMusicGenと比較し、我々のモデルが入力メロディに対して49%高い忠実度で音楽を生成することを示す。これは、パラメータ数が35分の1、トレーニングデータ量が11分の1でありながら、さらに2つの時間的制御を可能にしている。音声サンプルはhttps://MusicControlNet.github.io/web/で確認できる。

English

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/.

Music ControlNet: 音楽生成のための複数の時間変動制御

Music ControlNet: Multiple Time-varying Controls for Music Generation

要旨

Support