ChatPaper.aiChatPaper

DEMON:用于音乐编排噪声的扩散引擎

DEMON: Diffusion Engine for Musical Orchestrated Noise

May 27, 2026
作者: Ryan Fosdick
cs.AI

摘要

我们提出了DEMON,一个实时扩散引擎,能使去噪过程像现场乐器一样可演奏:一个既宽泛(每帧跨输出实时调节众多参数)又响应迅速(每个控制按照其在去噪循环中的位置尽快生效)的控制界面。基于ACE-Step 1.5和StreamDiffusion的环形缓冲区架构,并采用TensorRT加速,在单块消费级GPU(RTX 5090)上,对于60秒音乐,每秒可完成多达12.3次解码器完整生成,或在我们生产环深度为4时每秒11.3次生成。在此速率下,去噪参数可作为现场表演控制,但环形缓冲区仅在以排出速率传播每次请求的变化,这至少需要S步去噪步骤。我们贡献了四种机制:(1)每槽异构去噪调度:每个环形缓冲区槽拥有独立的时间步调度,因此移动的去噪滑块可被跟踪而无需清空处理队列,而上游的全局调度设计必须重建并丢弃它;(2)共享可变的每步状态,使得任何在求解器每一步中查询的参数都能产生下一拍效果,绕过环形缓冲区排出;(3)每帧源混合:在标准SDE重新噪声步骤上引入采样时间控制,提供一个逐帧变换强度轴,补充了标量去噪调度;(4)窗口化VAE解码,利用感受野分析实现8.0倍解码加速。这些机制共同将流式扩散参数分为四个传播类别,依据其触发和收敛延迟。
English
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.