離散トークンから高忠実度オーディオへの変換を実現するマルチバンド拡散モデル

要旨

深層生成モデルは、様々な種類の表現（例：メルスペクトログラム、メル周波数ケプストラム係数（MFCC））を条件として高忠実度の音声を生成することができます。最近では、このようなモデルが高度に圧縮された表現を条件として音声波形を合成するために使用されています。これらの手法は印象的な結果を生み出しますが、条件付けが不完全または欠陥がある場合、聴覚上のアーティファクトを生成しやすいという問題があります。別のモデリングアプローチとして、拡散モデルを使用する方法があります。しかし、これらは主に音声ボコーダー（例：メルスペクトログラムを条件とする）として使用されるか、比較的低いサンプリングレートの信号を生成するために使用されてきました。本研究では、低ビットレートの離散表現からあらゆる種類の音声モダリティ（例：音声、音楽、環境音）を生成する高忠実度のマルチバンド拡散ベースのフレームワークを提案します。同等のビットレートにおいて、提案手法は知覚品質の点で最先端の生成技術を上回ります。トレーニングおよび評価コード、ならびに音声サンプルは、facebookresearch/audiocraftのGithubページで公開されています。

English

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.

離散トークンから高忠実度オーディオへの変換を実現するマルチバンド拡散モデル

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

要旨

Support