ChatPaper.aiChatPaper

实时音乐扩散模型:交互式扩散音乐生成器的高效微调与后训练

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

May 21, 2026
作者: Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang
cs.AI

摘要

交互式流式音乐生成有望利用生成模型实现现场表演和协同创作,这是离线模型无法做到的。然而,现有最先进的模型属于离散自回归范式,训练和推理都需要工业级计算资源。本研究探讨了音频扩散模型——尽管在开源社区中得到广泛支持,但具有非流式双向特性——能否高效改造为可在消费级硬件上运行的交互式模型。通过批判性审视现代分块外扩扩散流程,我们识别出推理过程中的关键效率瓶颈,其计算效率严格低于同类离散自回归模型。我们提出现场音乐扩散模型(LMDMs),通过简单修改生成扩散过程,利用分块KV缓存技术恢复并超越了离散现场音乐模型(LMMs)的推理复杂度。与LMMs不同,LMDMs通过创新的ARC-Forcing范式实现稳定的训练后对齐,无需显式强化学习或奖励模型即可减少误差积累。我们展示了LMDMs在多个创意领域的应用,包括文本条件生成、草图驱动音乐合成以及即兴合奏。最后,我们通过真实艺术家-AI协作案例,演示LMDMs如何作为"生成式延迟"乐器,在消费级游戏笔记本上本地运行时,实时转换音乐家的即兴演奏以获得可变音色效果。
English
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.