MuseControlLite：多功能音樂生成與輕量級條件控制器

摘要

我們提出了MuseControlLite，這是一種輕量級機制，旨在通過各種時變音樂屬性和參考音頻信號來精細調節文本到音樂生成模型，以實現精確的條件控制。關鍵發現是，位置嵌入（在文本條件下，文本到音樂生成模型的條件器中很少使用）在關注的條件是時間函數時至關重要。以旋律控制為例，我們的實驗表明，只需在解耦的交叉注意力層中添加旋轉位置嵌入，即可將控制精度從56.6%提高到61.1%，同時所需的可訓練參數比最先進的微調機制少6.75倍，使用的是相同的預訓練擴散Transformer模型Stable Audio Open。我們評估了各種形式的音樂屬性控制、音頻修補和音頻擴展，展示了在顯著降低微調成本的情況下，相比MusicGen-Large和Stable Audio Open ControlNet，具有更高的可控性，僅需85M可訓練參數。源代碼、模型檢查點和演示示例可在以下網址獲取：https://musecontrollite.github.io/web/。

English

We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https://musecontrollite.github.io/web/.

MuseControlLite：多功能音樂生成與輕量級條件控制器

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

摘要

Support