使用單個非自回歸Transformer生成遮罩音頻

摘要

我們介紹了MAGNeT，一種遮罩生成序列建模方法，可直接操作多個音頻標記流。與先前的工作不同，MAGNeT由單階段、非自回歸變壓器組成。在訓練期間，我們根據遮罩調度器預測遮罩標記的範圍，而在推理期間，我們逐步使用多個解碼步驟構建輸出序列。為了進一步提高生成音頻的質量，我們引入了一種新穎的重新評分方法，其中我們利用外部預訓練模型對MAGNeT的預測進行重新評分和排名，然後用於後續的解碼步驟。最後，我們探索了MAGNeT的混合版本，在這個版本中，我們在自回歸方式下生成前幾秒，而其餘序列則同時進行解碼。我們展示了MAGNeT在文本轉音樂和文本轉音頻生成任務中的效率，並進行了廣泛的實證評估，考慮了客觀指標和人類研究。所提出的方法與評估基準相當，同時速度顯著更快（比自回歸基準快7倍）。通過消融研究和分析，我們闡明了構成MAGNeT的每個組件的重要性，並指出了自回歸和非自回歸建模之間的權衡，考慮了延遲時間、吞吐量和生成質量。樣本可在我們的演示頁面https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT 上找到。

English

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

使用單個非自回歸Transformer生成遮罩音頻

Masked Audio Generation using a Single Non-Autoregressive Transformer

摘要

Support