MuseControlLite:多功能音樂生成與輕量級條件控制器
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
June 23, 2025
作者: Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang
cs.AI
摘要
我們提出了MuseControlLite,這是一種輕量級機制,旨在通過各種時變音樂屬性和參考音頻信號來精細調節文本到音樂生成模型,以實現精確的條件控制。關鍵發現是,位置嵌入(在文本條件下,文本到音樂生成模型的條件器中很少使用)在關注的條件是時間函數時至關重要。以旋律控制為例,我們的實驗表明,只需在解耦的交叉注意力層中添加旋轉位置嵌入,即可將控制精度從56.6%提高到61.1%,同時所需的可訓練參數比最先進的微調機制少6.75倍,使用的是相同的預訓練擴散Transformer模型Stable Audio Open。我們評估了各種形式的音樂屬性控制、音頻修補和音頻擴展,展示了在顯著降低微調成本的情況下,相比MusicGen-Large和Stable Audio Open ControlNet,具有更高的可控性,僅需85M可訓練參數。源代碼、模型檢查點和演示示例可在以下網址獲取:https://musecontrollite.github.io/web/。
English
We propose MuseControlLite, a lightweight mechanism designed to fine-tune
text-to-music generation models for precise conditioning using various
time-varying musical attributes and reference audio signals. The key finding is
that positional embeddings, which have been seldom used by text-to-music
generation models in the conditioner for text conditions, are critical when the
condition of interest is a function of time. Using melody control as an
example, our experiments show that simply adding rotary positional embeddings
to the decoupled cross-attention layers increases control accuracy from 56.6%
to 61.1%, while requiring 6.75 times fewer trainable parameters than
state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion
Transformer model of Stable Audio Open. We evaluate various forms of musical
attribute control, audio inpainting, and audio outpainting, demonstrating
improved controllability over MusicGen-Large and Stable Audio Open ControlNet
at a significantly lower fine-tuning cost, with only 85M trainble parameters.
Source code, model checkpoints, and demo examples are available at:
https://musecontrollite.github.io/web/.