EDMSound：効率的で高品質な音声合成のためのスペクトログラムベースの拡散モデル

要旨

オーディオ拡散モデルは多様な音声を合成することが可能です。既存のモデルでは、波形を再構築するために潜在空間で動作し、カスケード型の位相回復モジュールを使用することが一般的です。しかし、このアプローチでは高忠実度の音声生成に課題が生じます。本論文では、スペクトログラム領域における拡散モデルフレームワーク「Elucidated Diffusion Models (EDM)」に基づく生成モデル「EDMSound」を提案します。効率的な決定論的サンプラーを組み合わせることで、DCASE2023フォーリー音生成ベンチマークにおいて、わずか10ステップでトップレベルのベースラインと同等のFr\'echet Audio Distance (FAD)スコアを達成し、50ステップでは最先端の性能を実現しました。また、拡散モデルに基づく音声生成モデルが、トレーニングデータと高い知覚的類似性を持つサンプルを生成しやすいという潜在的な課題を明らかにしました。プロジェクトページ: https://agentcooper2002.github.io/EDMSound/

English

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

EDMSound：効率的で高品質な音声合成のためのスペクトログラムベースの拡散モデル

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

要旨

Support