从离散标记到高保真音频：使用多频扩散

摘要

深度生成模型能够根据各种类型的表示（例如，梅尔频谱图，梅尔频率倒谱系数（MFCC））生成高保真音频。最近，这些模型已被用于合成受高度压缩表示条件的音频波形。尽管这些方法产生了令人印象深刻的结果，但当条件出现缺陷或不完善时，它们很容易生成可听见的伪影。另一种建模方法是使用扩散模型。然而，这些模型主要被用作语音声码器（即，受梅尔频谱图条件）或生成相对低采样率信号。在这项工作中，我们提出了一个高保真的多频带扩散模型框架，可以从低比特率的离散表示生成任何类型的音频模态（例如，语音，音乐，环境声音）。在相同比特率下，所提出的方法在感知质量方面优于最先进的生成技术。训练和评估代码以及音频样本可在facebookresearch/audiocraft Github页面上找到。

English

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.

从离散标记到高保真音频：使用多频扩散

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

摘要

Support