EDMSound: 효율적이고 고품질의 오디오 합성을 위한 스펙트로그램 기반 확산 모델

초록

오디오 디퓨전 모델은 다양한 소리를 합성할 수 있습니다. 기존 모델들은 주로 잠재 공간에서 작동하며, 파형을 재구성하기 위해 단계적 위상 복구 모듈을 사용합니다. 이는 고충실도 오디오를 생성할 때 어려움을 야기합니다. 본 논문에서는 명료화된 디퓨전 모델(EDM) 프레임워크 하에서 스펙트로그램 영역에서 작동하는 디퓨전 기반 생성 모델인 EDMSound를 제안합니다. 효율적인 결정론적 샘플러와 결합하여, 단 10단계만으로도 최상위 기준선과 유사한 프레셰 오디오 거리(FAD) 점수를 달성했으며, DCASE2023 폴리 사운드 생성 벤치마크에서 50단계로 최첨단 성능에 도달했습니다. 또한, 디퓨전 기반 오디오 생성 모델이 훈련 데이터와 높은 지각적 유사성을 가진 샘플을 생성하는 경향이 있다는 잠재적 문제를 밝혔습니다. 프로젝트 페이지: https://agentcooper2002.github.io/EDMSound/

English

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

EDMSound: 효율적이고 고품질의 오디오 합성을 위한 스펙트로그램 기반 확산 모델

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

초록

Support