EDMSound：基於頻譜圖的擴散模型，用於高效且高品質的音訊合成

摘要

音頻擴散模型能夠合成各種各樣的聲音。現有模型通常在潛在域上運作，使用串聯的相位恢復模組來重建波形。這在生成高保真音頻時存在挑戰。本文提出了EDMSound，一種基於擴散的生成模型，位於頻譜圖域，採用闡明擴散模型（EDM）框架。結合高效確定性取樣器，我們僅通過10個步驟實現了與排名靠前基準相似的Fréchet音頻距離（FAD）分數，並在DCASE2023 foley聲音生成基準上通過50個步驟達到了最先進的性能。我們還揭示了一個潛在的問題，即基於擴散的音頻生成模型傾向於生成與訓練數據相似度高的樣本。項目頁面：https://agentcooper2002.github.io/EDMSound/

English

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

EDMSound：基於頻譜圖的擴散模型，用於高效且高品質的音訊合成

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

摘要

Support