EDMSound：基于频谱图的扩散模型，用于高效且高质量的音频合成。

摘要

音频扩散模型能够合成各种各样的声音。现有模型通常在潜在域上运行，使用级联的相位恢复模块来重建波形。这在生成高保真音频时存在挑战。本文提出了EDMSound，这是一个基于扩散的生成模型，位于频谱图域内，采用了阐明的扩散模型（EDM）框架。结合高效确定性采样器，我们仅使用10个步骤就实现了类似于排名靠前基线的Fr\'echet音频距离（FAD）分数，并在DCASE2023 foley声音生成基准测试中使用50个步骤达到了最先进的性能。我们还揭示了一个潜在的问题，即基于扩散的音频生成模型倾向于生成与训练数据具有高感知相似性的样本。项目页面：https://agentcooper2002.github.io/EDMSound/

English

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

EDMSound：基于频谱图的扩散模型，用于高效且高质量的音频合成。

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

摘要

Support