ChatPaper.aiChatPaper

播放音乐的FLUX

FLUX that Plays Music

September 1, 2024
作者: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang
cs.AI

摘要

本文探讨了扩展基于扩散的修正流Transformer用于文本到音乐生成的简单方法,称为FluxMusic。通常,除了在先进的Flux模型设计中,我们将其转换为mel频谱的潜在VAE空间。这涉及首先对双文本-音乐流应用一系列独立的注意力,然后对去噪补丁预测进行堆叠的单音乐流。我们使用多个预训练文本编码器来充分捕获标题语义信息以及推理灵活性。在此过程中,粗糙的文本信息与时间步骤嵌入一起被用于调制机制,而细粒度的文本细节则与音乐补丁序列连接作为输入。通过深入研究,我们证明了使用经过优化的架构进行修正流训练明显优于已建立的扩散方法,这一事实得到了各种自动指标和人类偏好评估的证明。我们的实验数据、代码和模型权重已公开发布在:https://github.com/feizc/FluxMusic。
English
This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Fluxhttps://github.com/black-forest-labs/flux model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: https://github.com/feizc/FluxMusic.

Summary

AI-Generated Summary

PDF342November 16, 2024