DITTO-2:音樂生成的蒸餾擴散推理時間T優化
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
May 30, 2024
作者: Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
cs.AI
摘要
對於以人為中心的基於人工智慧的音樂創作來說,可控的音樂生成方法至關重要,但目前受限於速度、質量和控制設計的取捨。其中,擴散推論時間 T 優化(DITTO)提供了最先進的結果,但比實時慢了超過 10 倍,限制了實際應用。我們提出了蒸餾擴散推論時間 T 優化(或稱為 DITTO-2),這是一種新方法,用於加速基於推論時間優化的控制,並實現比實時更快的生成,適用於音樂修補、擴展、強度、旋律和音樂結構控制等各種應用。我們的方法通過以下步驟實現:(1)通過高效的修改一致性或一致性軌跡蒸餾過程,對預先訓練的擴散模型進行蒸餾,以實現快速抽樣;(2)使用我們的蒸餾模型進行推論時間優化,將單步抽樣作為一個高效的替代優化任務;(3)使用我們估計的噪聲潛變數進行最佳質量、快速、可控的生成的最終多步抽樣生成(解碼)。通過深入評估,我們發現我們的方法不僅使生成速度提高了 10-20 倍,同時還同時提高了控制遵循性和生成質量。此外,我們將我們的方法應用於最大化文本遵循性(CLAP 分數)的新應用,並展示我們可以將無條件的擴散模型轉換為能產生最先進文本控制的模型。聲音示例可在 https://ditto-music.github.io/ditto2/ 找到。
English
Controllable music generation methods are critical for human-centered
AI-based music creation, but are currently limited by speed, quality, and
control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in
particular, offers state-of-the-art results, but is over 10x slower than
real-time, limiting practical use. We propose Distilled Diffusion
Inference-Time T -Optimization (or DITTO-2), a new method to speed up
inference-time optimization-based control and unlock faster-than-real-time
generation for a wide-variety of applications such as music inpainting,
outpainting, intensity, melody, and musical structure control. Our method works
by (1) distilling a pre-trained diffusion model for fast sampling via an
efficient, modified consistency or consistency trajectory distillation process
(2) performing inference-time optimization using our distilled model with
one-step sampling as an efficient surrogate optimization task and (3) running a
final multi-step sampling generation (decoding) using our estimated noise
latents for best-quality, fast, controllable generation. Through thorough
evaluation, we find our method not only speeds up generation over 10-20x, but
simultaneously improves control adherence and generation quality all at once.
Furthermore, we apply our approach to a new application of maximizing text
adherence (CLAP score) and show we can convert an unconditional diffusion model
without text inputs into a model that yields state-of-the-art text control.
Sound examples can be found at https://ditto-music.github.io/ditto2/.Summary
AI-Generated Summary