Diffusiemodellen voor livemuziek: efficiënte fine-tuning en post-training van interactieve diffusiemuziekgeneratoren

Samenvatting

Interactieve streaming muziekgeneratie belooft het gebruik van generatieve modellen voor live-optredens en co-creatie, wat onmogelijk is met offlinemodellen. Echter, state-of-the-art modellen bestaan in het discrete autoregressieve regime, wat industriële rekenkracht vereist voor zowel training als inferentie. In dit werk onderzoeken we of audiodiffusiemodellen, met hun brede ondersteuning in de open-sourcegemeenschap maar niet-streamende bidirectionele aard, efficiënt kunnen worden hergebruikt als interactieve modellen die toegankelijk zijn op consumentenhardware. Door een kritische blik te werpen op de moderne pijplijn voor bloksgewijze outpainting-diffusie, identificeren we kritieke inefficiënties tijdens inferentie die leiden tot een strikt slechtere rekenkundige efficiëntie dan hun discrete autoregressieve tegenhangers. We stellen Live Music Diffusion Models (LMDMs) voor, een eenvoudige aanpassing van het generatieve diffusieproces dat de inferentiecomplexiteit van de discrete Live Music Models (LMMs) herstelt en vervolgens overtreft door middel van bloksgewijze KV-caching. In tegenstelling tot LMMs maken LMDMs verder stabiele post-training afstemming mogelijk via ons nieuwe ARC-Forcing paradigma, waardoor foutaccumulatie wordt verminderd zonder enige expliciete RL- of beloningsmodellen. We demonstreren de toepassing van LMDMs in een aantal creatieve domeinen, waaronder tekstgestuurde generatie, op schetsen gebaseerde muzieksynthese en jammen. Tot slot tonen we hoe LMDMs kunnen worden gebruikt als een generatief instrument in een echte artiest-AI-samenwerking, waarbij LMDMs worden ingezet als een “generatieve vertraging” om de improvisatie van muzikanten live te transformeren voor variabele timbre-effecten, terwijl ze lokaal draaien op een consumenten-gaminglaptop.

English

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.