快速!压缩步骤和层以加速音乐生成
Presto! Distilling Steps and Layers for Accelerating Music Generation
October 7, 2024
作者: Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
cs.AI
摘要
尽管基于扩散的文本转音乐(TTM)方法取得了进展,但高效、高质量的生成仍然是一个挑战。我们介绍了Presto!,这是一种通过减少采样步骤和每步成本来加速基于乐谱的扩散变压器的推理方法。为了减少步骤,我们为EDM系列扩散模型开发了一种新的基于乐谱的分布匹配蒸馏(DMD)方法,这是第一个基于GAN的TTM蒸馏方法。为了减少每步成本,我们对最近的层蒸馏方法进行了简单但强大的改进,通过更好地保留隐藏状态方差来改善学习。最后,我们将我们的步骤和层蒸馏方法结合起来,形成一个双重方法。我们独立评估了我们的步骤和层蒸馏方法,并展示它们各自具有最佳性能。我们的组合蒸馏方法可以生成高质量的输出,具有改进的多样性,将我们的基础模型加速了10-18倍(32秒单声道/立体声44.1kHz的延迟为230/435毫秒,比可比的SOTA快15倍)-- 据我们所知,这是最快的高质量TTM。声音示例可在https://presto-music.github.io/web/找到。
English
Despite advances in diffusion-based text-to-music (TTM) methods, efficient,
high-quality generation remains a challenge. We introduce Presto!, an approach
to inference acceleration for score-based diffusion transformers via reducing
both sampling steps and cost per step. To reduce steps, we develop a new
score-based distribution matching distillation (DMD) method for the EDM-family
of diffusion models, the first GAN-based distillation method for TTM. To reduce
the cost per step, we develop a simple, but powerful improvement to a recent
layer distillation method that improves learning via better preserving hidden
state variance. Finally, we combine our step and layer distillation methods
together for a dual-faceted approach. We evaluate our step and layer
distillation methods independently and show each yield best-in-class
performance. Our combined distillation method can generate high-quality outputs
with improved diversity, accelerating our base model by 10-18x (230/435ms
latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) --
the fastest high-quality TTM to our knowledge. Sound examples can be found at
https://presto-music.github.io/web/.Summary
AI-Generated Summary