ChatPaper.aiChatPaper

DITTO-2:音乐生成的蒸馏扩散推理时间T优化

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

May 30, 2024
作者: Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
cs.AI

摘要

对于以人为中心的基于人工智能的音乐创作,可控音乐生成方法至关重要,但目前受到速度、质量和控制设计折衷的限制。其中,扩散推理时间T优化(DITTO)提供了最先进的结果,但比实时慢10倍以上,限制了实际应用。我们提出了蒸馏扩散推理时间T优化(或DITTO-2),这是一种加速推理时间优化控制并解锁超越实时生成的新方法,可用于诸如音乐修复、扩展、强度、旋律和音乐结构控制等各种应用。我们的方法通过以下方式实现:(1)通过高效的修改一致性或一致性轨迹蒸馏过程,蒸馏预训练的扩散模型以进行快速采样;(2)使用我们的蒸馏模型进行推理时间优化,采用一步采样作为高效的替代优化任务;(3)使用我们估计的噪声潜变量进行最佳质量、快速、可控生成的最终多步采样生成(解码)。通过彻底评估,我们发现我们的方法不仅可以将生成速度提高10-20倍,同时还可以同时改善控制粘附性和生成质量。此外,我们将我们的方法应用于最大化文本粘附度(CLAP分数)的新应用,并展示我们可以将无条件扩散模型转换为产生最先进文本控制的模型。可在https://ditto-music.github.io/ditto2/找到声音示例。
English
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

Summary

AI-Generated Summary

PDF110December 12, 2024