ChatPaper.aiChatPaper

EDMSound:基於頻譜圖的擴散模型,用於高效且高品質的音訊合成

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

November 15, 2023
作者: Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan
cs.AI

摘要

音頻擴散模型能夠合成各種各樣的聲音。現有模型通常在潛在域上運作,使用串聯的相位恢復模組來重建波形。這在生成高保真音頻時存在挑戰。本文提出了EDMSound,一種基於擴散的生成模型,位於頻譜圖域,採用闡明擴散模型(EDM)框架。結合高效確定性取樣器,我們僅通過10個步驟實現了與排名靠前基準相似的Fréchet音頻距離(FAD)分數,並在DCASE2023 foley聲音生成基準上通過50個步驟達到了最先進的性能。我們還揭示了一個潛在的問題,即基於擴散的音頻生成模型傾向於生成與訓練數據相似度高的樣本。項目頁面:https://agentcooper2002.github.io/EDMSound/
English
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
PDF191December 15, 2024