VampNet: 마스킹된 음향 토큰 모델링을 통한 음악 생성

초록

우리는 음악 합성, 압축, 인페인팅(inpainting), 변형을 위한 마스킹된 음향 토큰 모델링 접근법인 VampNet을 소개한다. 학습 과정에서 가변 마스킹 스케줄을 사용함으로써, 추론 시 다양한 마스킹 접근법(프롬프트라고 함)을 적용하여 모델로부터 일관된 음악을 샘플링할 수 있다. VampNet은 비자기회귀(non-autoregressive) 방식으로, 순방향 패스에서 모든 토큰에 주의를 기울이는 양방향 트랜스포머 아키텍처를 활용한다. 단 36번의 샘플링 패스만으로도 VampNet은 일관된 고품질 음악 파형을 생성할 수 있다. 우리는 VampNet에 다양한 방식으로 프롬프트를 제공함으로써 음악 압축, 인페인팅, 아웃페인팅(outpainting), 연속 재생, 변형을 통한 루핑(looping, vamping)과 같은 작업에 적용할 수 있음을 보여준다. 적절히 프롬프트를 제공하면, VampNet은 음악의 스타일, 장르, 악기 구성 등 고차원적인 측면을 유지할 수 있다. 이러한 유연한 프롬프트 기능은 VampNet을 강력한 음악 공동 창작 도구로 만든다. 코드와 오디오 샘플은 온라인에서 확인할 수 있다.

English

We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation (vamping). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code and audio samples are available online.

VampNet: 마스킹된 음향 토큰 모델링을 통한 음악 생성

VampNet: Music Generation via Masked Acoustic Token Modeling

초록

Support