DEMON:音樂編排噪聲的擴散引擎
DEMON: Diffusion Engine for Musical Orchestrated Noise
May 27, 2026
作者: Ryan Fosdick
cs.AI
摘要
我們提出DEMON,這是一個即時擴散引擎,能將去噪過程轉變為可演奏的現場音樂樂器:一個既廣闊(每幀可同時調整多個輸出參數)又靈敏(每個控制依據其在去噪循環中的位置盡快生效)的控制介面。該系統建構於ACE-Step 1.5與StreamDiffusion的環形緩衝區架構,並搭配TensorRT加速,在單一消費級GPU(RTX 5090)上,針對60秒音樂可達到每秒12.3次解碼器完成運算,或是在我們生產環境的環形深度設定為4時,達到每秒11.3次生成。在此速率下,去噪參數得以成為可行的現場演奏控制項,但環形緩衝區僅能以自身的排出速率(至少為S個去噪步驟)傳播每次請求的變更。我們提出了四項機制:(1) 每槽異質去噪排程:每個環形緩衝區槽擁有自己的時間步排程,因此移動的去噪滑桿得以被追蹤,而無需清除正在處理的佇列;相較之下,上游的全局排程設計必須重建並捨棄該佇列。(2) 共享可變的每步驟狀態:任何在每個求解器步驟中被查詢的參數都能立即生效,跳過環形緩衝區的排出延遲。(3) 每幀來源混合:在標準SDE重噪步驟中引入取樣時控制,提供逐幀的轉換強度軸,作為標量去噪排程的補充。(4) 視窗化VAE解碼,利用感受野分析實現8.0倍解碼加速。結合這些機制,我們將串流擴散參數根據其觸發與收斂延遲,區分為四個傳播類別。
English
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.