ライブ音楽拡散モデル: インタラクティブな拡散音楽生成器の効率的なファインチューニングとポストトレーニング

要旨

インタラクティブなストリーミング音楽生成は、オフラインモデルでは不可能なライブパフォーマンスや共創に生成モデルを活用する実現を約束する。しかしながら、最先端のモデルは離散自己回帰モデルの領域に存在し、トレーニングと推論の両方に産業レベルの計算資源を必要とする。本研究では、オープンソースコミュニティで広くサポートされているが非ストリーミングかつ双方向的な性質を持つ音声拡散モデルが、コンシューマ向けハードウェアで利用可能なインタラクティブモデルへと効率的に転用可能かどうかを調査する。ブロック単位のアウトペインティング拡散の現代的なパイプラインを批判的に検討することで、推論時に生じる重要な非効率性を特定し、それが離散自己回帰モデルよりも厳密に劣る計算効率をもたらすことを明らかにする。我々は、Live Music Diffusion Models（LMDMs）を提案する。これは生成拡散プロセスの単純な修正であり、ブロック単位のKVキャッシングにより、離散型Live Music Models（LMMs）の推論計算量を回復し、さらにそれを上回る。LMMsとは異なり、LMDMsは我々の新規なARCフォーシングパラダイムを通じて安定した事後学習の調整を可能にし、明示的な強化学習や報酬モデルを用いずに誤差蓄積を低減する。我々は、テキスト条件付き生成、スケッチベースの音楽合成、ジャミングなどの多くの創造的領域におけるLMDMsの応用を示す。最後に、LMDMsが実際のアーティストとAIのコラボレーションにおいて生成型楽器としてどのように使用できるかを示す。これはLMDMsを「生成ディレイ」として活用し、コンシューマ向けゲーミングノートPC上でローカルに動作させながら、ミュージシャンの即興演奏をライブで変換して多様な音色効果を生み出すものである。

English

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.