快速文本到音頻生成與對抗式後訓練
Fast Text-to-Audio Generation with Adversarial Post-Training
May 13, 2025
作者: Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons
cs.AI
摘要
文本轉音頻系統雖然性能日益提升,但在推理時速度緩慢,使得其延遲時間在許多創意應用中不切實際。我們提出了對抗性相對論對比(ARC)後訓練,這是首個不基於蒸餾技術的擴散/流模型的對抗性加速算法。儘管以往的對抗性後訓練方法在與昂貴的蒸餾方法對比時表現不佳,ARC後訓練則是一個簡單的流程,它(1)將最新的相對論對抗性公式擴展到擴散/流模型的後訓練中,並(2)結合了一種新穎的對比判別器目標,以促進更好的提示遵循。我們將ARC後訓練與Stable Audio Open的多項優化相結合,構建了一個模型,該模型在H100上能夠在約75毫秒內生成約12秒的44.1kHz立體聲音頻,在移動邊緣設備上則約為7秒,這是我們所知最快的文本轉音頻模型。
English
Text-to-audio systems, while increasingly performant, are slow at inference
time, thus making their latency unpractical for many creative applications. We
present Adversarial Relativistic-Contrastive (ARC) post-training, the first
adversarial acceleration algorithm for diffusion/flow models not based on
distillation. While past adversarial post-training methods have struggled to
compare against their expensive distillation counterparts, ARC post-training is
a simple procedure that (1) extends a recent relativistic adversarial
formulation to diffusion/flow post-training and (2) combines it with a novel
contrastive discriminator objective to encourage better prompt adherence. We
pair ARC post-training with a number optimizations to Stable Audio Open and
build a model capable of generating approx12s of 44.1kHz stereo audio in
approx75ms on an H100, and approx7s on a mobile edge-device, the fastest
text-to-audio model to our knowledge.Summary
AI-Generated Summary