FreGrad：輕量且快速的頻率感知擴散聲碼器

摘要

本文旨在使用一種輕量且快速的基於擴散的聲碼器FreGrad生成逼真的音頻。我們的框架包括以下三個關鍵組件：(1)我們採用離散小波變換，將複雜的波形分解為子頻帶小波，有助於FreGrad在簡單而簡潔的特徵空間上運行，(2)我們設計了一種頻率感知擴張卷積，提高了頻率感知度，從而生成具有準確頻率信息的語音，以及(3)我們引入了一些技巧，提升了所提出模型的生成質量。在我們的實驗中，FreGrad相較於我們的基準模型，實現了3.7倍更快的訓練時間和2.2倍更快的推理速度，同時將模型大小減少了0.6倍（僅1.78M參數），而不會影響輸出質量。音頻樣本可在以下鏈接中找到：https://mm.kaist.ac.kr/projects/FreGrad。

English

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.

FreGrad：輕量且快速的頻率感知擴散聲碼器

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

摘要

Support