從離散標記到高保真音訊：使用多頻擴散

摘要

深度生成模型能夠根據各種類型的表示（例如mel-頻譜圖、Mel頻率倒頻譜係數（MFCC））生成高保真度音頻。最近，這類模型已被用於合成音頻波形，並根據高度壓縮的表示進行條件設置。儘管這些方法產生了令人印象深刻的結果，但在條件設置存在缺陷或不完美時，它們容易生成可聽見的瑕疵。另一種建模方法是使用擴散模型。然而，這些模型主要被用作語音調變器（即根據mel-頻譜圖進行條件設置）或生成相對低採樣率信號。在這項工作中，我們提出了一種高保真度多頻帶擴散模型框架，可以從低比特率離散表示生成任何類型的音頻模態（例如語音、音樂、環境聲音）。在相同比特率下，所提出的方法在感知質量方面優於最先進的生成技術。訓練和評估代碼以及音頻樣本可在facebookresearch/audiocraft Github頁面上找到。

English

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.

從離散標記到高保真音訊：使用多頻擴散

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

摘要

Support