FLUX die Muziek Speelt

Samenvatting

Dit artikel onderzoekt een eenvoudige uitbreiding van diffusiegebaseerde rectified flow Transformers voor tekst-naar-muziek generatie, genaamd FluxMusic. Over het algemeen, in combinatie met het ontwerp van het geavanceerde Fluxhttps://github.com/black-forest-labs/flux model, wordt dit overgebracht naar een latente VAE-ruimte van mel-spectrum. Het proces omvat eerst het toepassen van een reeks onafhankelijke aandacht op de dubbele tekst-muziekstroom, gevolgd door een gestapelde enkele muziekstroom voor denoised patch voorspelling. We gebruiken meerdere vooraf getrainde tekstencoders om voldoende semantische informatie uit de bijschriften vast te leggen, evenals inferentieflexibiliteit. Tussendoor wordt grove tekstuele informatie, in combinatie met tijdstap embeddings, gebruikt in een modulatiemechanisme, terwijl fijnmazige tekstuele details worden samengevoegd met de muziekpatchsequentie als invoer. Door een diepgaande studie tonen we aan dat rectified flow training met een geoptimaliseerde architectuur aanzienlijk beter presteert dan gevestigde diffusiemethoden voor de tekst-naar-muziek taak, zoals blijkt uit verschillende automatische metrieken en evaluaties van menselijke voorkeuren. Onze experimentele gegevens, code en modelgewichten zijn publiekelijk beschikbaar gesteld op: https://github.com/feizc/FluxMusic.

English

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Fluxhttps://github.com/black-forest-labs/flux model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: https://github.com/feizc/FluxMusic.

FLUX die Muziek Speelt

FLUX that Plays Music

Samenvatting

Summary

Support

Support