FreGrad：轻量级快速的频率感知扩散声码器

摘要

本文旨在利用一种轻量且快速的基于扩散的声码器FreGrad生成逼真的音频。我们的框架包括以下三个关键组件：(1) 我们采用离散小波变换将复杂的波形分解为子带小波，这有助于FreGrad在简单而简洁的特征空间上运行，(2) 我们设计了一种频率感知扩张卷积，提高了频率感知度，从而生成具有准确频率信息的语音，以及(3) 我们引入了一些技巧，提升了所提出模型的生成质量。在实验中，FreGrad相较于我们的基准模型，实现了训练时间快3.7倍、推断速度快2.2倍，同时将模型大小减小了0.6倍（仅1.78M参数），而不会牺牲输出质量。音频样本可在以下链接获取：https://mm.kaist.ac.kr/projects/FreGrad。

English

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.

FreGrad：轻量级快速的频率感知扩散声码器

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

摘要

Support