유색 잡음 확산 샘플링

초록

확산 모델은 최신 이미지 합성 성능을 달성하며, 그 생성 궤적은 본질적으로 스펙트럼 편향을 나타내어 초기에는 저주파 전역 구조를, 후기에는 고주파 세부 정보를 해결한다. 기존의 확률적 미분 방정식(SDE) 해석기는 이러한 동역학을 고려하지 못하며, 전체 과정에 걸쳐 단순히 균일한 백색 잡음을 주입하고 유한한 에너지 예산을 오용한다. 본 연구에서는 SDE 추론을 표적화된 주파수 분리 에너지 전달로 재구성하는 수학적 프레임워크를 구축한다. 이 프레임워크를 활용하여 학습이 필요 없는 새로운 확률적 해석기인 유색 잡음 샘플링(CNS)을 도입한다. CNS는 균일한 백색 잡음을 주입하는 대신, 구조적으로 해결되지 않은 주파수 대역으로 주입 에너지를 보다 효율적으로 할당하는 동적인 시간 단계 및 주파수 종속 스케줄을 사용한다. CNS는 모델의 고유한 스펙트럼 편향을 적극적으로 활용하여 생성 분포를 실제 데이터 다양체로 체계적으로 유도한다. 광범위한 실험을 통해 CNS가 다양한 아키텍처(SiT, JiT, FLUX)에서 엄격한 플러그 앤 플레이 방식의 추론 시간 샘플러 대체로서 표준 ODE 및 SDE 기준선을 크게 능가함을 입증한다. ImageNet-256에서 표준 샘플링과 비교하여 CNS는 SiT-XL/2에서 8.26에서 6.27로, JiT-B/16에서 32.39에서 26.69로, JiT-H/16에서 11.88에서 8.31로 유도 없는 FID를 크게 감소시켰으며, 분류기-자유 유도와 함께 일관된 상대적 FID 개선을 달성했다. 프로젝트 페이지는 https://hadardavidson.github.io/CNS/에서 확인할 수 있다.

English

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.