DITTO:音乐生成的扩散推理时间T优化
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
January 22, 2024
作者: Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
cs.AI
摘要
我们提出了扩散推理时间 T 优化(DITTO),这是一个用于控制预训练的文本到音乐扩散模型的通用框架,通过优化初始噪声潜变量来实现推理时间控制。我们的方法可用于通过任何可微特征匹配损失进行优化,以实现目标(风格化)输出,并利用梯度检查点实现内存效率。我们展示了音乐生成的广泛应用,包括修复、扩展和循环,以及强度、旋律和音乐结构控制,而无需对基础模型进行微调。当我们将我们的方法与相关的训练、引导和基于优化的方法进行比较时,我们发现 DITTO 在几乎所有任务上均实现了最先进的性能,包括在可控性、音频质量和计算效率方面优于可比较的方法,从而为扩散模型的高质量、灵活、无需训练的控制打开了大门。可以在 https://DITTO-Music.github.io/web/ 找到声音示例。
English
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose
frame-work for controlling pre-trained text-to-music diffusion models at
inference-time via optimizing initial noise latents. Our method can be used to
optimize through any differentiable feature matching loss to achieve a target
(stylized) output and leverages gradient checkpointing for memory efficiency.
We demonstrate a surprisingly wide-range of applications for music generation
including inpainting, outpainting, and looping as well as intensity, melody,
and musical structure control - all without ever fine-tuning the underlying
model. When we compare our approach against related training, guidance, and
optimization-based methods, we find DITTO achieves state-of-the-art performance
on nearly all tasks, including outperforming comparable approaches on
controllability, audio quality, and computational efficiency, thus opening the
door for high-quality, flexible, training-free control of diffusion models.
Sound examples can be found at https://DITTO-Music.github.io/web/.