PeriodWave: 高忠実度波形生成のためのマルチピリオド・フローマッチング

要旨

近年、様々な分布外シナリオを条件とした汎用波形生成タスクが研究されています。GANベースの手法は高速な波形生成においてその強みを示していますが、2段階テキスト音声合成のような訓練-推論のミスマッチシナリオに弱いという課題があります。一方、拡散モデルは他の領域で強力な生成性能を示していますが、波形生成タスクでは推論速度が遅いため注目を集めていません。何よりも、高解像度波形信号の自然な周期的特徴を明示的に分離できる生成器アーキテクチャは存在しませんでした。本論文では、新しい汎用波形生成モデルであるPeriodWaveを提案します。まず、ベクトル場を推定する際に波形信号の周期的特徴を捉えることができる周期認識フローマッチング推定器を導入します。さらに、異なる周期的特徴を捉えるために、重複を避けたマルチ周期推定器を活用します。周期数を増やすことで性能が大幅に向上しますが、これにはより多くの計算コストが必要です。この問題を軽減するため、周期ごとのバッチ推論によって並列にフィードフォワードできる単一周期条件付き汎用推定器も提案します。さらに、高周波数モデリングのために波形信号の周波数情報をロスレスに分離する離散ウェーブレット変換を活用し、波形生成における高周波数ノイズを低減するためにFreeUを導入します。実験結果は、我々のモデルがメルスペクトログラム再構成とテキスト音声合成タスクの両方において従来モデルを上回ることを示しています。全てのソースコードはhttps://github.com/sh-lee-prml/PeriodWaveで公開されます。

English

Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at https://github.com/sh-lee-prml/PeriodWave.

PeriodWave: 高忠実度波形生成のためのマルチピリオド・フローマッチング

PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

要旨

Support