音樂控制網絡：用於音樂生成的多個時間變化控制

摘要

文本生成音樂模型現在能夠生成高質量的音樂音頻，涵蓋廣泛風格。然而，文本控制主要適用於全局音樂屬性的操作，如流派、情緒和節奏，對於精確控制時間變化屬性，如時間中節拍的位置或音樂動態的變化，則不太適用。我們提出了Music ControlNet，一種基於擴散的音樂生成模型，可提供多個精確的、隨時間變化的控制生成音頻。為了賦予文本生成音樂模型隨時間變化的控制能力，我們提出了一種類似於圖像領域ControlNet方法的像素級控制方法。具體來說，我們從訓練音頻中提取控制，形成成對數據，並對給定旋律、動態和節奏控制的音頻頻譜進行微調擴散式條件生成模型。雖然圖像領域Uni-ControlNet方法已允許使用任何控制子集進行生成，但我們設計了一種新策略，允許創作者輸入僅在時間上部分指定的控制。我們評估從音頻中提取的控制以及我們期望創作者提供的控制，展示我們能夠在兩種情況下生成與控制輸入相符的逼真音樂。雖然存在少量可比較的音樂生成模型，我們對MusicGen進行基準測試，這是一個接受文本和旋律輸入的最新模型，並展示我們的模型生成的音樂對輸入旋律更忠實，儘管參數少了35倍、訓練數據少了11倍，並實現了兩種額外的隨時間變化的控制形式。聲音示例可在https://MusicControlNet.github.io/web/找到。

English

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/.

音樂控制網絡：用於音樂生成的多個時間變化控制

Music ControlNet: Multiple Time-varying Controls for Music Generation

摘要

Support