DEMON: 음악적 편곡 노이즈를 위한 확산 엔진

초록

본 논문에서는 실시간 확산 엔진인 DEMON을 제시한다. DEMON은 노이즈 제거 과정을 실시간 악기처럼 연주 가능하게 만든다. 즉, 출력 전반에 걸쳐 프레임별로 많은 파라미터를 조정할 수 있는 광범위한 제어 표면이면서, 각 제어가 노이즈 제거 루프에서 허용되는 위치만큼 빠르게 반응하는 제어 표면이다. ACE-Step 1.5, StreamDiffusion의 링 버퍼 구조, 그리고 TensorRT 가속을 기반으로 구축된 DEMON은 단일 소비자 GPU(RTX 5090)에서 60초 길이의 음악에 대해 초당 최대 12.3회의 디코더 완료, 또는 프로덕션 링 깊이 4에서 초당 11.3회의 생성을 유지한다. 이러한 속도에서는 노이즈 제거 파라미터가 실시간 연주 제어로 사용 가능해지지만, 링 버퍼는 요청별 변경 사항을 배출 속도, 즉 S개의 노이즈 제거 단계라는 하한선에서만 전파한다. 이에 본 논문은 네 가지 메커니즘을 기여한다. (1) 슬롯별 이기종 노이즈 제거 스케줄링: 각 링 버퍼 슬롯이 자체 타임스텝 스케줄을 소유하므로, 이동하는 노이즈 제거 슬라이더가 진행 중인 큐를 초기화하지 않고도 추적되며, 업스트림 전역 스케줄 설계에서는 큐를 재구축하고 폐기해야 한다. (2) 공유된 변경 가능한 단계별 상태: 모든 솔버 단계에서 참조되는 모든 파라미터에 다음 틱 효과를 부여하여 링 버퍼 배출을 우회한다. (3) 프레임별 소스 블렌딩: 표준 SDE 재노이즈 단계에 대한 샘플링 시간 제어로서, 스칼라 노이즈 제거 스케줄링을 보완하는 프레임별 변환 강도 축을 제공한다. (4) 윈도우 방식 VAE 디코드: 수용 필드 분석을 활용하여 8.0배의 디코딩 속도 향상을 제공한다. 이러한 메커니즘들은 스트리밍 확산 파라미터를 시작 지연 시간과 수렴 지연 시간에 따라 네 가지 전파 클래스로 분리한다.

English

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.