自回归与流匹配:文本到音乐生成建模范式的比较研究
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
June 10, 2025
作者: Or Tal, Felix Kreuk, Yossi Adi
cs.AI
摘要
近期,文本到音乐生成领域取得了显著进展,使得模型能够合成高质量的音乐片段、完整作品,甚至能响应细粒度的控制信号,如和弦进行。当前最先进的系统在多个维度上存在显著差异,例如训练数据集、建模范式以及架构选择。这种多样性使得公平评估模型并确定哪些设计选择对性能影响最大变得复杂。尽管数据和架构等因素至关重要,但在本研究中,我们仅聚焦于建模范式。我们进行了系统的实证分析,以隔离其影响,提供关于相关权衡和新兴行为的见解,这些见解能够指导未来的文本到音乐生成系统。具体而言,我们比较了两种最为常见的建模范式:自回归解码和条件流匹配。通过使用相同的数据集、训练配置和相似的骨干架构从头训练所有模型,我们进行了受控比较。性能评估涵盖了多个方面,包括生成质量、对推理配置的鲁棒性、可扩展性、对文本和时间对齐条件的遵循度,以及以音频修复形式呈现的编辑能力。这项对比研究揭示了每种范式的独特优势和局限性,提供了可操作的见解,能够在不断发展的文本到音乐生成领域中为未来的架构和训练决策提供参考。音频示例可在以下链接获取:https://huggingface.co/spaces/ortal1602/ARvsFM。
English
Recent progress in text-to-music generation has enabled models to synthesize
high-quality musical segments, full compositions, and even respond to
fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA)
systems differ significantly across many dimensions, such as training datasets,
modeling paradigms, and architectural choices. This diversity complicates
efforts to evaluate models fairly and pinpoint which design choices most
influence performance. While factors like data and architecture are important,
in this study we focus exclusively on the modeling paradigm. We conduct a
systematic empirical analysis to isolate its effects, offering insights into
associated trade-offs and emergent behaviors that can guide future
text-to-music generation systems. Specifically, we compare the two arguably
most common modeling paradigms: Auto-Regressive decoding and Conditional
Flow-Matching. We conduct a controlled comparison by training all models from
scratch using identical datasets, training configurations, and similar backbone
architectures. Performance is evaluated across multiple axes, including
generation quality, robustness to inference configurations, scalability,
adherence to both textual and temporally aligned conditioning, and editing
capabilities in the form of audio inpainting. This comparative study sheds
light on distinct strengths and limitations of each paradigm, providing
actionable insights that can inform future architectural and training decisions
in the evolving landscape of text-to-music generation. Audio sampled examples
are available at: https://huggingface.co/spaces/ortal1602/ARvsFM