Auto-Regressief versus Flow-Matching: een Vergelijkende Studie van Modelleringsparadigma's voor Tekst-naar-Muziekgeneratie

Samenvatting

Recente vooruitgang in tekst-naar-muziek-generatie heeft modellen in staat gesteld om hoogwaardige muzieksegmenten, volledige composities te synthetiseren en zelfs te reageren op fijnmazige controle-signalen, zoals akkoordprogressies. State-of-the-art (SOTA) systemen verschillen aanzienlijk op vele vlakken, zoals trainingsdatasets, modelleringsparadigma's en architecturale keuzes. Deze diversiteit bemoeilijkt inspanningen om modellen eerlijk te evalueren en te bepalen welke ontwerpkeuzes de prestaties het meest beïnvloeden. Hoewel factoren zoals data en architectuur belangrijk zijn, richten we ons in deze studie uitsluitend op het modelleringsparadigma. We voeren een systematische empirische analyse uit om de effecten ervan te isoleren, waarbij we inzichten bieden in gerelateerde afwegingen en opkomende gedragingen die toekomstige tekst-naar-muziek-generatiesystemen kunnen sturen. Specifiek vergelijken we de twee wellicht meest voorkomende modelleringsparadigma's: Auto-Regressief decoderen en Conditionele Flow-Matching. We voeren een gecontroleerde vergelijking uit door alle modellen vanaf nul te trainen met identieke datasets, trainingsconfiguraties en vergelijkbare backbone-architecturen. De prestaties worden geëvalueerd langs meerdere assen, waaronder generatiekwaliteit, robuustheid tegen inferentieconfiguraties, schaalbaarheid, naleving van zowel tekstuele als tijdelijk uitgelijnde conditionering, en bewerkingsmogelijkheden in de vorm van audio-inpainting. Deze vergelijkende studie werpt licht op de verschillende sterke en zwakke punten van elk paradigma, en biedt praktische inzichten die toekomstige architecturale en trainingsbeslissingen kunnen informeren in het evoluerende landschap van tekst-naar-muziek-generatie. Audio-voorbeelden zijn beschikbaar op: https://huggingface.co/spaces/ortal1602/ARvsFM

English

Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

Auto-Regressief versus Flow-Matching: een Vergelijkende Studie van Modelleringsparadigma's voor Tekst-naar-Muziekgeneratie

Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Samenvatting

Support