实时音乐扩散模型：交互式扩散音乐生成器的高效微调与后训练

摘要

交互式流式音乐生成有望利用生成模型实现现场表演和协同创作，这是离线模型无法做到的。然而，现有最先进的模型属于离散自回归范式，训练和推理都需要工业级计算资源。本研究探讨了音频扩散模型——尽管在开源社区中得到广泛支持，但具有非流式双向特性——能否高效改造为可在消费级硬件上运行的交互式模型。通过批判性审视现代分块外扩扩散流程，我们识别出推理过程中的关键效率瓶颈，其计算效率严格低于同类离散自回归模型。我们提出现场音乐扩散模型（LMDMs），通过简单修改生成扩散过程，利用分块KV缓存技术恢复并超越了离散现场音乐模型（LMMs）的推理复杂度。与LMMs不同，LMDMs通过创新的ARC-Forcing范式实现稳定的训练后对齐，无需显式强化学习或奖励模型即可减少误差积累。我们展示了LMDMs在多个创意领域的应用，包括文本条件生成、草图驱动音乐合成以及即兴合奏。最后，我们通过真实艺术家-AI协作案例，演示LMDMs如何作为"生成式延迟"乐器，在消费级游戏笔记本上本地运行时，实时转换音乐家的即兴演奏以获得可变音色效果。

English

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.