有色ノイズ拡散サンプリング

要旨

拡散モデルは最先端の画像合成を実現し、その生成軌道は本質的にスペクトルバイアスを示し、低周波の大域構造を初期に、高周波の微細な詳細を後で解像する。従来の確率微分方程式(SDE)ソルバーはこのダイナミクスを考慮できず、全プロセスを通して一様な白色ノイズを注入し、有限のエネルギーバジェットを誤用している。本研究では、SDE推論を対象的で周波数分離されたエネルギー伝達として再考する数学的枠組みを確立する。この枠組みを活用して、新規で学習不要な確率的ソルバーである有色ノイズサンプリング(CNS)を提案する。一様な白色ノイズを注入する代わりに、CNSは動的でタイムステップと周波数に依存するスケジュールを利用し、注入エネルギーを構造的に未解決の周波数帯域により効率的に割り当てる。モデルの内在的なスペクトルバイアスを積極的に活用することで、CNSは生成分布を真のデータ多様体へと体系的に導く。広範な実験により、CNSが厳密なプラグアンドプレイ型の推論時サンプラー置換として、多様なアーキテクチャ(SiT, JiT, FLUX)にわたって標準的なODEおよびSDEベースラインを大幅に上回ることが示される。ImageNet-256における標準サンプリングと比較して、CNSは大幅な非誘導FID削減を達成し、SiT-XL/2では8.26から6.27へ、JiT-B/16では32.39から26.69へ、JiT-H/16では11.88から8.31へ改善し、分類器フリーガイダンスにおいても一貫した相対的なFID改善をもたらす。プロジェクトページは https://hadardavidson.github.io/CNS/ で入手可能である。

English

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.