演奏音樂的FLUX
FLUX that Plays Music
September 1, 2024
作者: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang
cs.AI
摘要
本文探討了擴展基於擴散的修正流Transformer進行文本轉音樂生成的簡單方法,稱為FluxMusic。通常,除了在先進的Flux模型設計中進行設計外,我們將其轉換為mel-spectrum的潛在VAE空間。這涉及首先對雙文本-音樂流應用一系列獨立的注意力,然後堆疊單個音樂流以進行去噪片段預測。我們使用多個預訓練文本編碼器來充分捕捉標題語義信息以及推理靈活性。在此過程中,粗糙的文本信息與時間步驟嵌入一起用於調製機制,細緻的文本細節則與音樂片段序列串聯作為輸入。通過深入研究,我們證明,使用經過優化的架構進行修正流訓練明顯優於已建立的擴散方法,這一事實得到各種自動指標和人類偏好評估的證明。我們的實驗數據、代碼和模型權重已公開提供,網址為:https://github.com/feizc/FluxMusic。
English
This paper explores a simple extension of diffusion-based rectified flow
Transformers for text-to-music generation, termed as FluxMusic. Generally,
along with design in advanced
Fluxhttps://github.com/black-forest-labs/flux model, we transfers it
into a latent VAE space of mel-spectrum. It involves first applying a sequence
of independent attention to the double text-music stream, followed by a stacked
single music stream for denoised patch prediction. We employ multiple
pre-trained text encoders to sufficiently capture caption semantic information
as well as inference flexibility. In between, coarse textual information, in
conjunction with time step embeddings, is utilized in a modulation mechanism,
while fine-grained textual details are concatenated with the music patch
sequence as inputs. Through an in-depth study, we demonstrate that rectified
flow training with an optimized architecture significantly outperforms
established diffusion methods for the text-to-music task, as evidenced by
various automatic metrics and human preference evaluations. Our experimental
data, code, and model weights are made publicly available at:
https://github.com/feizc/FluxMusic.Summary
AI-Generated Summary