DEMON：用于音乐编排噪声的扩散引擎

摘要

我们提出了DEMON，一个实时扩散引擎，能使去噪过程像现场乐器一样可演奏：一个既宽泛（每帧跨输出实时调节众多参数）又响应迅速（每个控制按照其在去噪循环中的位置尽快生效）的控制界面。基于ACE-Step 1.5和StreamDiffusion的环形缓冲区架构，并采用TensorRT加速，在单块消费级GPU（RTX 5090）上，对于60秒音乐，每秒可完成多达12.3次解码器完整生成，或在我们生产环深度为4时每秒11.3次生成。在此速率下，去噪参数可作为现场表演控制，但环形缓冲区仅在以排出速率传播每次请求的变化，这至少需要S步去噪步骤。我们贡献了四种机制：（1）每槽异构去噪调度：每个环形缓冲区槽拥有独立的时间步调度，因此移动的去噪滑块可被跟踪而无需清空处理队列，而上游的全局调度设计必须重建并丢弃它；（2）共享可变的每步状态，使得任何在求解器每一步中查询的参数都能产生下一拍效果，绕过环形缓冲区排出；（3）每帧源混合：在标准SDE重新噪声步骤上引入采样时间控制，提供一个逐帧变换强度轴，补充了标量去噪调度；（4）窗口化VAE解码，利用感受野分析实现8.0倍解码加速。这些机制共同将流式扩散参数分为四个传播类别，依据其触发和收敛延迟。

English

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.