あらゆるスケールのすべて：スケール不変拡散による連続超解像

要旨

ノイズから画像を生成することが画像生成であり、粗い入力から細部を再構成することが超解像である。実際の用途は異なるものの、両者はスケール間での情報損失を逆転させるプロセスとして捉えることができる。本稿では、スケール不変なK空間画像学習拡散モデル（SKILD）を提案する。これは、生成と連続超解像を単一の無条件フレームワークで統合するものである。自然画像と臨界物理系はともにスケール不変性を示す。この性質を活用し、微細スケールから粗視スケールへと画像内容を減衰させると同時にスペクトル適合ガウシアンノイズを注入する順過程を設計し、スケールを拡散ダイナミクスの明示的な座標とする。同一の学習済み逆過程は、開始タイムステップのみを変更することで生成と連続超解像を実行する。タスク固有のアーキテクチャ、条件付けブランチ、分類器不要ガイダンス、スケール因子ごとの再学習は一切不要である。実験的に、SKILDは無条件CIFAR-10においてFID 2.65、Inception Score 9.63を達成し、単一の無条件チェックポイントからImageNet上で2倍から8倍の超解像を実現し、知覚指標において条件付きモデルを上回る。また、臨界イジングモデルを再構成し、その接続4点相関が真値に密に追従する。

English

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce SKILD, a Scale-invariant K-Space Image Learning Diffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor. Empirically, SKILD reaches FID 2.65 and Inception Score 9.63 on unconditional CIFAR-10, performs 2times--8times super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.