DITTO：音樂生成的擴散推理時間T優化

摘要

我們提出了擴散推論時間 T 優化（DITTO），這是一個通用框架，用於通過優化初始噪聲潛變量來控制預訓練的文本轉音樂擴散模型的推論時間。我們的方法可以用於通過任何可微特徵匹配損失來進行優化，以實現目標（風格化）輸出，並利用梯度檢查點實現記憶效率。我們展示了音樂生成的驚人廣泛應用，包括修補、擴展和循環，以及強度、旋律和音樂結構控制 - 所有這些都無需對基礎模型進行微調。當我們將我們的方法與相關的訓練、引導和基於優化的方法進行比較時，我們發現 DITTO 在幾乎所有任務上均實現了最先進的性能，包括在可控性、音頻質量和計算效率方面優於可比方法，從而為擴散模型的高質量、靈活、無需訓練的控制打開了大門。聲音示例可在 https://DITTO-Music.github.io/web/ 找到。

English

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.

DITTO：音樂生成的擴散推理時間T優化

DITTO: Diffusion Inference-Time T-Optimization for Music Generation

摘要

Support