이산 토큰에서 고품질 오디오로: 멀티 밴드 디퓨전을 활용한 접근

초록

딥 생성 모델은 다양한 유형의 표현(예: 멜-스펙트로그램, 멜-주파수 켑스트럼 계수(MFCC))을 조건으로 고품질 오디오를 생성할 수 있습니다. 최근에는 이러한 모델들이 고도로 압축된 표현을 조건으로 오디오 파형을 합성하는 데 사용되고 있습니다. 이러한 방법들은 인상적인 결과를 내지만, 조건이 결함이 있거나 불완전할 경우 들리는 아티팩트를 생성하기 쉽습니다. 대안적인 모델링 접근법으로는 확산 모델을 사용하는 것이 있습니다. 그러나 이들은 주로 음성 보코더(즉, 멜-스펙트로그램을 조건으로)로 사용되거나 상대적으로 낮은 샘플링 속도의 신호를 생성하는 데 사용되었습니다. 본 연구에서는 저비트레이트 이산 표현에서 모든 유형의 오디오 양식(예: 음성, 음악, 환경음)을 생성하는 고품질 다중 대역 확산 기반 프레임워크를 제안합니다. 동일한 비트레이트에서 제안된 접근법은 지각적 품질 측면에서 최첨단 생성 기술을 능가합니다. 학습 및 평가 코드와 오디오 샘플은 facebookresearch/audiocraft Github 페이지에서 확인할 수 있습니다.

English

Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.

이산 토큰에서 고품질 오디오로: 멀티 밴드 디퓨전을 활용한 접근

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

초록

Support