Music ControlNet: 음악 생성을 위한 다중 시간 가변 제어

초록

텍스트-음악 생성 모델은 이제 다양한 스타일의 고품질 음악 오디오를 생성할 수 있습니다. 그러나 텍스트 제어는 주로 장르, 분위기, 템포와 같은 전반적인 음악 속성을 조작하는 데 적합하며, 시간에 따라 변하는 속성(예: 비트의 시간적 위치나 음악의 변화하는 다이내믹스)에 대한 정밀한 제어에는 덜 적합합니다. 우리는 Music ControlNet을 제안합니다. 이는 확산 기반 음악 생성 모델로, 생성된 오디오에 대해 여러 가지 정밀하고 시간에 따라 변하는 제어를 제공합니다. 텍스트-음악 모델에 시간에 따라 변하는 제어를 부여하기 위해, 우리는 이미지 도메인의 ControlNet 방법의 픽셀 단위 제어와 유사한 접근 방식을 제안합니다. 구체적으로, 우리는 훈련 오디오에서 제어 신호를 추출하여 짝지어진 데이터를 얻고, 멜로디, 다이내믹스, 리듬 제어가 주어진 오디오 스펙트로그램에 대해 확산 기반 조건부 생성 모델을 미세 조정합니다. 이미지 도메인의 Uni-ControlNet 방법은 이미 어떤 제어 신호의 부분 집합으로도 생성이 가능하지만, 우리는 창작자가 시간적으로 부분적으로만 지정된 제어 신호를 입력할 수 있도록 하는 새로운 전략을 고안했습니다. 우리는 오디오에서 추출한 제어 신호와 창작자가 제공할 것으로 예상되는 제어 신호 모두에 대해 평가를 수행하여, 두 설정 모두에서 제어 입력에 대응하는 현실적인 음악을 생성할 수 있음을 입증합니다. 비교 가능한 음악 생성 모델이 거의 없지만, 우리는 텍스트와 멜로디 입력을 받는 최신 모델인 MusicGen과 벤치마크를 수행했으며, 우리 모델이 입력 멜로디에 대해 49% 더 충실한 음악을 생성함을 보여줍니다. 이는 파라미터 수가 35배 적고, 훈련 데이터가 11배 적으며, 두 가지 추가적인 시간에 따라 변하는 제어를 가능하게 하는 조건에서 이루어졌습니다. 음악 예제는 https://MusicControlNet.github.io/web/에서 확인할 수 있습니다.

English

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation of global musical attributes like genre, mood, and tempo, and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music. We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method. Specifically, we extract controls from training audio yielding paired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls. While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new strategy to allow creators to input controls that are only partially specified in time. We evaluate both on controls extracted from audio and controls we expect creators to provide, demonstrating that we can generate realistic music that corresponds to control inputs in both settings. While few comparable music generation models exist, we benchmark against MusicGen, a recent model that accepts text and melody input, and show that our model generates music that is 49% more faithful to input melodies despite having 35x fewer parameters, training on 11x less data, and enabling two additional forms of time-varying control. Sound examples can be found at https://MusicControlNet.github.io/web/.

Music ControlNet: 음악 생성을 위한 다중 시간 가변 제어

Music ControlNet: Multiple Time-varying Controls for Music Generation

초록

Support