FreGrad: 軽量かつ高速な周波数認識拡散ボコーダ

要旨

本論文の目的は、軽量かつ高速な拡散ベースのボコーダ「FreGrad」を用いてリアルな音声を生成することである。我々のフレームワークは以下の3つの主要なコンポーネントで構成されている：(1) 複雑な波形をサブバンドのウェーブレットに分解する離散ウェーブレット変換を採用し、FreGradがシンプルで簡潔な特徴空間で動作することを可能にする、(2) 周波数認識を高める周波数認識型拡張畳み込みを設計し、正確な周波数情報を持つ音声の生成を実現する、(3) 提案モデルの生成品質を向上させるための様々な工夫を導入する。実験では、FreGradはベースラインと比較して3.7倍の高速な学習時間と2.2倍の高速な推論速度を達成し、モデルサイズを0.6倍（わずか1.78Mパラメータ）に削減しながらも出力品質を犠牲にしなかった。音声サンプルは以下で公開されている： https://mm.kaist.ac.kr/projects/FreGrad。

English

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.

FreGrad: 軽量かつ高速な周波数認識拡散ボコーダ

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

要旨

Support