從離散標記到高保真音訊:使用多頻擴散
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
August 2, 2023
作者: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez
cs.AI
摘要
深度生成模型能夠根據各種類型的表示(例如mel-頻譜圖、Mel頻率倒頻譜係數(MFCC))生成高保真度音頻。最近,這類模型已被用於合成音頻波形,並根據高度壓縮的表示進行條件設置。儘管這些方法產生了令人印象深刻的結果,但在條件設置存在缺陷或不完美時,它們容易生成可聽見的瑕疵。另一種建模方法是使用擴散模型。然而,這些模型主要被用作語音調變器(即根據mel-頻譜圖進行條件設置)或生成相對低採樣率信號。在這項工作中,我們提出了一種高保真度多頻帶擴散模型框架,可以從低比特率離散表示生成任何類型的音頻模態(例如語音、音樂、環境聲音)。在相同比特率下,所提出的方法在感知質量方面優於最先進的生成技術。訓練和評估代碼以及音頻樣本可在facebookresearch/audiocraft Github頁面上找到。
English
Deep generative models can generate high-fidelity audio conditioned on
various types of representations (e.g., mel-spectrograms, Mel-frequency
Cepstral Coefficients (MFCC)). Recently, such models have been used to
synthesize audio waveforms conditioned on highly compressed representations.
Although such methods produce impressive results, they are prone to generate
audible artifacts when the conditioning is flawed or imperfect. An alternative
modeling approach is to use diffusion models. However, these have mainly been
used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating
relatively low sampling rate signals. In this work, we propose a high-fidelity
multi-band diffusion-based framework that generates any type of audio modality
(e.g., speech, music, environmental sounds) from low-bitrate discrete
representations. At equal bit rate, the proposed approach outperforms
state-of-the-art generative techniques in terms of perceptual quality. Training
and, evaluation code, along with audio samples, are available on the
facebookresearch/audiocraft Github page.