MuseControlLite:轻量级条件器驱动的多功能音乐生成
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
June 23, 2025
作者: Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang
cs.AI
摘要
我们提出了MuseControlLite,一种轻量级机制,旨在通过多种时变音乐属性和参考音频信号对文本到音乐生成模型进行精确条件微调。关键发现是,在文本到音乐生成模型中,对于文本条件,位置嵌入在条件器中很少被使用,但当关注的条件是时间函数时,位置嵌入至关重要。以旋律控制为例,我们的实验表明,只需在解耦的交叉注意力层中添加旋转位置嵌入,即可将控制准确率从56.6%提升至61.1%,同时所需的可训练参数比最先进的微调机制少6.75倍,使用的是相同的预训练扩散Transformer模型Stable Audio Open。我们评估了多种音乐属性控制、音频修复和音频扩展形式,展示了在显著降低微调成本的情况下,相较于MusicGen-Large和Stable Audio Open ControlNet,具有更高的可控性,仅需85M可训练参数。源代码、模型检查点和演示示例可在以下网址获取:https://musecontrollite.github.io/web/。
English
We propose MuseControlLite, a lightweight mechanism designed to fine-tune
text-to-music generation models for precise conditioning using various
time-varying musical attributes and reference audio signals. The key finding is
that positional embeddings, which have been seldom used by text-to-music
generation models in the conditioner for text conditions, are critical when the
condition of interest is a function of time. Using melody control as an
example, our experiments show that simply adding rotary positional embeddings
to the decoupled cross-attention layers increases control accuracy from 56.6%
to 61.1%, while requiring 6.75 times fewer trainable parameters than
state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion
Transformer model of Stable Audio Open. We evaluate various forms of musical
attribute control, audio inpainting, and audio outpainting, demonstrating
improved controllability over MusicGen-Large and Stable Audio Open ControlNet
at a significantly lower fine-tuning cost, with only 85M trainble parameters.
Source code, model checkpoints, and demo examples are available at:
https://musecontrollite.github.io/web/.