使用单个非自回归Transformer生成掩码音频

摘要

我们介绍了MAGNeT，这是一种遮蔽生成序列建模方法，直接处理几个音频令牌流。与先前的工作不同，MAGNeT由单阶段、非自回归变压器组成。在训练过程中，我们通过遮蔽调度器预测遮蔽令牌的跨度，而在推断过程中，我们逐步使用多个解码步骤构建输出序列。为了进一步提高生成音频的质量，我们引入了一种新颖的再评分方法，其中我们利用外部预训练模型对MAGNeT的预测进行再评分和排序，然后用于后续解码步骤。最后，我们探索了MAGNeT的混合版本，在这个版本中，我们在自回归方式下生成前几秒钟，而其余序列则并行解码。我们展示了MAGNeT在文本转音乐和文本转音频生成任务中的效率，并进行了广泛的实证评估，考虑了客观指标和人类研究。所提出的方法与评估基线相当，同时速度显著更快（比自回归基线快7倍）。通过消融研究和分析，我们阐明了构成MAGNeT的每个组件的重要性，同时指出了自回归和非自回归建模之间的权衡，考虑了延迟、吞吐量和生成质量。示例可在我们的演示页面https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT 上找到。

English

We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

使用单个非自回归Transformer生成掩码音频

Masked Audio Generation using a Single Non-Autoregressive Transformer

摘要

Support