Stemphonic:一体化灵活多音轨音乐生成
Stemphonic: All-at-once Flexible Multi-stem Music Generation
February 10, 2026
作者: Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, Cheng-Zhi Anna Huang, Nicholas J. Bryan
cs.AI
摘要
音乐音轨生成技术能够产生音乐同步且分离的乐器音频片段,与传统文本到音乐模型相比,该技术具有更强的用户控制性和与音乐人工作流程的契合度。然而现有音轨生成方法要么依赖固定架构并行输出预设音轨组合,要么每次仅生成单一音轨,虽在音轨组合方面具有灵活性,但推理速度缓慢。我们提出Stemphonic框架,该基于扩散/流模型的方案突破了这一局限,可在单次推理过程中生成可变数量的同步音轨。训练阶段,我们将每个音轨作为批次元素处理,将同步音轨编组后对每组应用共享噪声潜变量。推理时,通过共享初始噪声潜变量与音轨专属文本输入,实现单次推理生成同步多音轨输出。我们进一步扩展该方法,支持单次条件化多音轨生成及音轨活动度控制,使用户能迭代生成并精细编排混音的时间分层结构。在多个开源音轨评估集上的测试表明,Stemphonic在将完整混音生成速度提升25%至50%的同时,能产生更高质量的音频输出。演示地址:https://stemphonic-demo.vercel.app。
English
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.