ChatPaper.aiChatPaper

即時音樂擴散模型:互動式擴散音樂生成器的高效微調與後訓練

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

May 21, 2026
作者: Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang
cs.AI

摘要

互動式串流音樂生成承諾將生成模型應用於現場表演與共同創作,這在離線模型中無法實現。然而,現有最佳模型存在於離散自迴歸(AR)範式,其訓練與推論都需要工業級的計算資源。在本研究中,我們探討了音頻擴散模型——雖在開源社群中有廣泛支援,但本質上為非串流的雙向模型——能否被高效地重新利用,成為可在消費級硬體上運行的互動式模型。透過批判性地審視當今區塊式外補擴散管線,我們識別出推論過程中的關鍵效率瓶頸,導致其計算效率嚴格劣於離散自迴歸模型。我們提出「現場音樂擴散模型」(Live Music Diffusion Models, LMDMs),這是對生成式擴散過程的簡單修改,透過區塊式鍵值快取(KV Caching)恢復並超越了離散現場音樂模型(Live Music Models, LMMs)的推論複雜度。與LMMs不同,LMDMs透過我們新穎的ARC-強制(ARC-Forcing)範式,實現穩定的訓練後對齊,無需任何顯式的強化學習或獎勵模型即可減少誤差累積。我們在多個創意領域展示了LMDMs的應用,包括文字條件生成、基於草稿的音樂合成,以及即興演奏。最後,我們展示了LMDMs如何在真實的藝術家與AI協作中作為生成式樂器使用——將LMDMs作為「生成式延遲」,即時轉換音樂家的即興演奏以產生多變的音色效果,同時能在消費級遊戲筆電上本地運行。
English
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.