通過對抗流匹配優化加速高保真波形生成
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization
August 15, 2024
作者: Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee
cs.AI
摘要
本文介紹了PeriodWave-Turbo,一種透過對抗流匹配優化實現高保真度和高效率波形生成模型。最近,條件流匹配(CFM)生成模型已成功應用於波形生成任務,利用單一向量場估計目標進行訓練。儘管這些模型可以生成高保真度的波形信號,但與基於GAN的模型相比,它們需要顯著更多的ODE步驟,後者只需要單一生成步驟。此外,由於噪聲向量場估計缺乏高頻信息,生成的樣本通常缺乏高頻重現,無法確保高頻重現。為解決這一限制,我們通過引入固定步驟生成器修改來增強預先訓練的CFM生成模型。我們利用重建損失和對抗反饋來加速高保真度波形生成。通過對抗流匹配優化,僅需1,000步微調即可在各種客觀指標上實現最先進的性能。此外,我們將推理速度從16步顯著降低至2或4步。此外,通過將PeriodWave的基礎從29M擴展到70M參數以改善泛化能力,PeriodWave-Turbo實現了前所未有的性能,在LibriTTS數據集上實現了4.454的語音質量感知評估(PESQ)分數。音頻樣本、源代碼和檢查點將在https://github.com/sh-lee-prml/PeriodWave 上提供。
English
This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient
waveform generation model via adversarial flow matching optimization. Recently,
conditional flow matching (CFM) generative models have been successfully
adopted for waveform generation tasks, leveraging a single vector field
estimation objective for training. Although these models can generate
high-fidelity waveform signals, they require significantly more ODE steps
compared to GAN-based models, which only need a single generation step.
Additionally, the generated samples often lack high-frequency information due
to noisy vector field estimation, which fails to ensure high-frequency
reproduction. To address this limitation, we enhance pre-trained CFM-based
generative models by incorporating a fixed-step generator modification. We
utilized reconstruction losses and adversarial feedback to accelerate
high-fidelity waveform generation. Through adversarial flow matching
optimization, it only requires 1,000 steps of fine-tuning to achieve
state-of-the-art performance across various objective metrics. Moreover, we
significantly reduce inference speed from 16 steps to 2 or 4 steps.
Additionally, by scaling up the backbone of PeriodWave from 29M to 70M
parameters for improved generalization, PeriodWave-Turbo achieves unprecedented
performance, with a perceptual evaluation of speech quality (PESQ) score of
4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will
be available at https://github.com/sh-lee-prml/PeriodWave.Summary
AI-Generated Summary