音楽を再生するFLUX

要旨

この論文では、拡散ベースの整流フローTransformerを拡張した、テキストから音楽を生成するための単純な手法であるFluxMusicについて探究します。一般的に、高度なFluxモデルの設計に加えて、我々はそれをメルスペクトルの潜在VAE空間に変換します。これには、最初に独立したアテンションのシーケンスを二重のテキスト-音楽ストリームに適用し、その後、ノイズの除去されたパッチ予測のためのスタックされた単一の音楽ストリームを続けることが含まれます。我々は、キャプションの意味情報と推論の柔軟性を十分に捉えるために、複数の事前学習済みテキストエンコーダを使用します。その間、粗いテキスト情報は時間ステップ埋め込みと組み合わせて調整メカニズムで使用され、細かいテキストの詳細は音楽パッチシーケンスと入力として連結されます。詳細な研究を通じて、最適化されたアーキテクチャでの整流フロー訓練が、自動評価メトリクスや人間の選好評価によって証明されるように、テキストから音楽へのタスクにおいて確立された拡散手法を大幅に上回ることを示します。実験データ、コード、およびモデルの重みは、以下のURLから一般に公開されています：https://github.com/feizc/FluxMusic.

English

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Fluxhttps://github.com/black-forest-labs/flux model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: https://github.com/feizc/FluxMusic.

音楽を再生するFLUX

FLUX that Plays Music

要旨

Summary

Support

Support