快速文本到音频生成与对抗性后训练
Fast Text-to-Audio Generation with Adversarial Post-Training
May 13, 2025
作者: Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons
cs.AI
摘要
尽管文本转音频系统的性能日益提升,但其推理速度较慢,导致在许多创意应用中的延迟不切实际。我们提出了对抗性相对对比(ARC)后训练方法,这是首个不基于蒸馏的扩散/流模型对抗性加速算法。虽然以往的对抗性后训练方法难以与昂贵的蒸馏方法相媲美,但ARC后训练是一种简单流程,它(1)将最新的相对对抗性公式扩展到扩散/流模型的后训练中,并(2)结合了一种新颖的对比判别器目标,以增强对提示的更好遵循。我们将ARC后训练与Stable Audio Open的多项优化相结合,构建了一个模型,能够在H100上生成约12秒的44.1kHz立体声音频,耗时约75毫秒,在移动边缘设备上生成约7秒音频,据我们所知,这是目前最快的文本转音频模型。
English
Text-to-audio systems, while increasingly performant, are slow at inference
time, thus making their latency unpractical for many creative applications. We
present Adversarial Relativistic-Contrastive (ARC) post-training, the first
adversarial acceleration algorithm for diffusion/flow models not based on
distillation. While past adversarial post-training methods have struggled to
compare against their expensive distillation counterparts, ARC post-training is
a simple procedure that (1) extends a recent relativistic adversarial
formulation to diffusion/flow post-training and (2) combines it with a novel
contrastive discriminator objective to encourage better prompt adherence. We
pair ARC post-training with a number optimizations to Stable Audio Open and
build a model capable of generating approx12s of 44.1kHz stereo audio in
approx75ms on an H100, and approx7s on a mobile edge-device, the fastest
text-to-audio model to our knowledge.Summary
AI-Generated Summary