음악을 연주하는 FLUX

초록

본 논문은 텍스트에서 음악을 생성하기 위한 확산 기반 정류된 플로우 트랜스포머의 간단한 확장인 FluxMusic을 탐구합니다. 일반적으로, 고급 Flux 모델 설계와 함께, 우리는 mel-스펙트럼의 잠재 VAE 공간으로 전송합니다. 이는 먼저 이중 텍스트-음악 스트림에 독립적인 어텐션 시퀀스를 적용한 후, 소음이 제거된 패치 예측을 위해 쌓인 단일 음악 스트림을 따릅니다. 우리는 캡션 의미 정보를 충분히 포착하기 위해 여러 사전 훈련된 텍스트 인코더를 사용하며 추론 유연성도 확보합니다. 그 사이에서, 거친 텍스트 정보는 시간 단계 임베딩과 함께 조절 메커니즘에서 활용되며, 세부적인 텍스트 세부사항은 음악 패치 시퀀스와 함께 입력으로 연결됩니다. 철저한 연구를 통해, 최적화된 아키텍처로 정류된 플로우 훈련이 텍스트에서 음악으로의 작업에서 확립된 확산 방법을 현저히 능가함을 입증하며, 다양한 자동 메트릭 및 인간의 선호도 평가에 의해 나타냅니다. 우리의 실험 데이터, 코드 및 모델 가중치는 다음에서 공개적으로 제공됩니다: https://github.com/feizc/FluxMusic.

English

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Fluxhttps://github.com/black-forest-labs/flux model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: https://github.com/feizc/FluxMusic.

음악을 연주하는 FLUX

FLUX that Plays Music

초록

Support