自回归与流匹配：文本到音乐生成建模范式的比较研究

摘要

近期，文本到音樂生成領域的進展使得模型能夠合成高質量的音樂片段、完整樂曲，甚至能響應細粒度的控制信號，如和弦進程。當前最先進（SOTA）的系統在多個維度上存在顯著差異，例如訓練數據集、建模範式和架構選擇。這種多樣性使得公平評估模型並確定哪些設計選擇對性能影響最大變得複雜。雖然數據和架構等因素至關重要，但在本研究中，我們僅專注於建模範式。我們進行了系統的實證分析，以隔離其影響，提供相關權衡和新興行為的見解，這些見解可以指導未來的文本到音樂生成系統。具體而言，我們比較了兩種最常見的建模範式：自回歸解碼和條件流匹配。我們通過使用相同的數據集、訓練配置和相似的骨幹架構從頭訓練所有模型，進行了受控比較。性能在多個軸向上進行評估，包括生成質量、對推理配置的魯棒性、可擴展性、對文本和時間對齊條件的依從性，以及以音頻修補形式呈現的編輯能力。這項比較研究揭示了每種範式的獨特優勢和局限性，提供了可操作的見解，可以在不斷發展的文本到音樂生成領域中為未來的架構和訓練決策提供參考。音頻示例可在以下網址獲取：https://huggingface.co/spaces/ortal1602/ARvsFM

English

Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM