단일 비자기회귀 트랜스포머를 활용한 마스킹 오디오 생성

초록

본 논문에서는 여러 오디오 토큰 스트림에 직접 작동하는 마스크 생성 시퀀스 모델링 방법인 MAGNeT을 소개한다. 기존 연구와 달리, MAGNeT은 단일 단계의 비자기회귀 트랜스포머로 구성된다. 학습 과정에서는 마스킹 스케줄러로부터 얻은 마스크된 토큰의 범위를 예측하며, 추론 과정에서는 여러 디코딩 단계를 통해 출력 시퀀스를 점진적으로 구성한다. 생성된 오디오의 품질을 더욱 향상시키기 위해, 외부 사전 학습 모델을 활용하여 MAGNeT의 예측을 재점수화하고 순위를 매긴 후 이를 후속 디코딩 단계에 사용하는 새로운 재점수화 방법을 도입한다. 마지막으로, MAGNeT의 하이브리드 버전을 탐구하여, 처음 몇 초는 자기회귀 방식으로 생성하고 나머지 시퀀스는 병렬로 디코딩하는 방식으로 자기회귀 모델과 비자기회귀 모델을 융합한다. 본 연구는 텍스트-음악 및 텍스트-오디오 생성 작업에서 MAGNeT의 효율성을 입증하며, 객관적 지표와 인간 평가를 고려한 광범위한 실험적 평가를 수행한다. 제안된 접근 방식은 평가된 기준 모델과 비슷한 성능을 보이면서도 상당히 빠른 속도(자기회귀 기준 모델보다 7배 빠름)를 자랑한다. 추가적으로, MAGNeT을 구성하는 각 요소의 중요성과 자기회귀 및 비자기회귀 모델링 간의 트레이드오프(지연 시간, 처리량, 생성 품질 등)를 분석을 통해 밝힌다. 샘플은 데모 페이지(https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT)에서 확인할 수 있다.

English

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

단일 비자기회귀 트랜스포머를 활용한 마스킹 오디오 생성

Masked Audio Generation using a Single Non-Autoregressive Transformer

초록

Support