FreGrad: 경량 및 고속 주파수 인식 확산 보코더

초록

본 논문의 목표는 FreGrad라는 경량화되고 빠른 확산 기반 보코더를 사용하여 사실적인 오디오를 생성하는 것이다. 우리의 프레임워크는 다음과 같은 세 가지 핵심 구성 요소로 이루어져 있다: (1) 복잡한 파형을 서브 밴드 웨이블릿으로 분해하는 이산 웨이블릿 변환을 사용하여 FreGrad가 간단하고 명료한 특징 공간에서 작동할 수 있도록 한다, (2) 주파수 인식을 높이는 주파수 인식 확장 컨볼루션을 설계하여 정확한 주파수 정보를 가진 음성을 생성한다, (3) 제안된 모델의 생성 품질을 향상시키는 다양한 기법들을 도입한다. 실험 결과, FreGrad는 기준 모델 대비 3.7배 빠른 학습 시간과 2.2배 빠른 추론 속도를 달성하면서 모델 크기를 0.6배 줄이고(단 1.78M 파라미터), 출력 품질을 저하시키지 않았다. 오디오 샘플은 https://mm.kaist.ac.kr/projects/FreGrad에서 확인할 수 있다.

English

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.

FreGrad: 경량 및 고속 주파수 인식 확산 보코더

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

초록

Support