MusiConGen：基于Transformer的文本转音乐生成中的节奏和和弦控制

摘要

现有的文本转音乐模型能够生成质量高且多样化的音频。然而，仅凭文本提示无法精确控制生成音乐的和弦和节奏等时间音乐特征。为了解决这一挑战，我们引入了MusiConGen，这是一种基于Transformer的时序条件文本转音乐模型，它在预训练的MusicGen框架基础上进行了构建。我们的创新在于一种针对消费级GPU量身定制的高效微调机制，它集成了自动提取的节奏和和弦作为条件信号。在推断过程中，条件可以是从参考音频信号中提取的音乐特征，也可以是用户定义的符号和弦序列、BPM和文本提示。我们对两个数据集进行了性能评估，一个来自提取的特征，另一个来自用户创建的输入，结果表明MusiConGen能够生成与指定条件良好对齐的逼真伴奏音乐。我们已开源代码和模型检查点，并在网上提供音频示例，网址为https://musicongen.github.io/musicongen_demo/。

English

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.

MusiConGen：基于Transformer的文本转音乐生成中的节奏和和弦控制

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

摘要

Support