DEMON：音樂編排噪聲的擴散引擎

摘要

我們提出DEMON，這是一個即時擴散引擎，能將去噪過程轉變為可演奏的現場音樂樂器：一個既廣闊（每幀可同時調整多個輸出參數）又靈敏（每個控制依據其在去噪循環中的位置盡快生效）的控制介面。該系統建構於ACE-Step 1.5與StreamDiffusion的環形緩衝區架構，並搭配TensorRT加速，在單一消費級GPU（RTX 5090）上，針對60秒音樂可達到每秒12.3次解碼器完成運算，或是在我們生產環境的環形深度設定為4時，達到每秒11.3次生成。在此速率下，去噪參數得以成為可行的現場演奏控制項，但環形緩衝區僅能以自身的排出速率（至少為S個去噪步驟）傳播每次請求的變更。我們提出了四項機制：(1) 每槽異質去噪排程：每個環形緩衝區槽擁有自己的時間步排程，因此移動的去噪滑桿得以被追蹤，而無需清除正在處理的佇列；相較之下，上游的全局排程設計必須重建並捨棄該佇列。(2) 共享可變的每步驟狀態：任何在每個求解器步驟中被查詢的參數都能立即生效，跳過環形緩衝區的排出延遲。(3) 每幀來源混合：在標準SDE重噪步驟中引入取樣時控制，提供逐幀的轉換強度軸，作為標量去噪排程的補充。(4) 視窗化VAE解碼，利用感受野分析實現8.0倍解碼加速。結合這些機制，我們將串流擴散參數根據其觸發與收斂延遲，區分為四個傳播類別。

English

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.