효율적인 신경망 음악 생성

초록

최근 음악 생성 분야의 발전은 최첨단 MusicLM에 의해 크게 진전되었습니다. MusicLM은 의미론적, 거친 음향, 그리고 세밀한 음향 모델링을 각각 담당하는 세 가지 LM(Language Model)의 계층 구조로 구성되어 있습니다. 그러나 MusicLM을 사용한 샘플링은 이러한 LM들을 하나씩 거쳐 세밀한 음향 토큰을 얻어야 하기 때문에 계산 비용이 많이 들고 실시간 생성에는 적합하지 않습니다. MusicLM과 동등한 품질을 유지하면서 효율적으로 음악을 생성하는 것은 여전히 중요한 과제로 남아 있습니다. 본 논문에서는 MeLoDy(M for music; L for LM; D for diffusion)를 소개합니다. MeLoDy는 LM-가이드 확산 모델로, 최첨단 품질의 음악 오디오를 생성하면서도 10초 또는 30초 음악 샘플링 시 MusicLM의 순방향 전달 횟수를 각각 95.7% 또는 99.6% 줄입니다. MeLoDy는 MusicLM의 최상위 LM을 상속받아 의미론적 모델링을 수행하고, 새로운 이중 경로 확산(Dual-Path Diffusion, DPD) 모델과 오디오 VAE-GAN을 적용하여 조건부 의미 토큰을 웨이브폼으로 효율적으로 디코딩합니다. DPD는 각 노이즈 제거 단계에서 교차 주의(cross-attention)를 통해 의미 정보를 잠재 변수 세그먼트에 효과적으로 통합함으로써 거친 음향과 세밀한 음향을 동시에 모델링합니다. 실험 결과는 MeLoDy가 샘플링 속도와 무한히 연속 가능한 생성이라는 실용적인 장점뿐만 아니라 최첨단의 음악성, 오디오 품질, 그리고 텍스트 상관관계에서도 우수함을 보여줍니다. 샘플은 https://Efficient-MeLoDy.github.io/에서 확인할 수 있습니다.

English

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.

효율적인 신경망 음악 생성

Efficient Neural Music Generation

초록

Support